Nvidia GT300 core: Speculation

Status
Not open for further replies.
I can inform you that current A1-samples are clocked at 700/1600/1100 MHz. So they come to impressed 2,457 Gigaflops and 281 GB/s memory bandwidth.
It is nice, is not it? ;)

Source: Hardware-Infos
 
I can inform you that current A1-samples are clocked at 700/1600/1100 MHz. So they come to impressed 2,457 Gigaflops and 281 GB/s memory bandwidth.
It is nice, is not it? ;)

Source: Hardware-Infos

I'm not convinced GFlops numbers and the amount of memory bandwidth is a good indication of performance.

Just take a look at the R600 (massive amount of memory bandwidth at the time) and the Radeon HD 4870x2 which pushes 2.4 GFlops.
 
My query is why can't software threading (fibre based) enjoy the same benefit? In software threading the scheduler would sleep/queue fibres.

IMO, if you are talking about fibers as in light weight threads, the second a fiber has to check, via software, if it can run, then it has too much overhead. Also a fiber context switch still has to restore it's context, which could likely be a full trip to/from L2. Switch granularity likely has to be coarse for good performance (in order to amortize the switch cost).

If you are talking about fibers as per LRB, it seems to me that LRB fibers are more like compiler loop unrolling. I thought the idea here was to, at compile time, compute an order of operations which would likely best insure that when the fiber returns to execution that it wouldn't have to check anything and could just continue executing (clearly in worst case just stall).

I look at LRB (in terms of NVidia's hardware) as having a fixed set of 4 blocks (or work groups), with fixed round-robin scheduling between the warps of a block. Anything outside of this formula (besides just round-robin of more blocks) requires software overhead. With NVidia's hardware you have the flexibility of warps of a block running out of order without software overhead.

If we speculate that GT300 goes the route of having the ability to lock down a section of registers, shared memory, and warps (out of the 32/core maximum in GT200) for a "pinned block" then I'd bet they add something new in the hardware to insure warps of a pinned block don't have to poll (ie wake up, check if they should run, then either run or sleep again). Perhaps a dedicated instruction to wait on data from a hardware queue, or wait until a reserved line gets modified (in the case they they implemented writable caches).

The way I see it, where the atomic operation is performed doesn't impinge on whether the hardware runs 10s or 100s of contexts in order to hide latency, or whether software threading is used.

Where the atomic operation is performed is very important because it factors greatly into the overall throughput of atomic operations.

I've got a number of issues with that paper:

Yeah, part of the problem here is that even beyond what you have mentioned and testing applications re-written for the different hardware options (cache/nocache, cache size, etc), one would also have to compare everything assuming the same base amount of silicon to implement all the hardware options...
 
I'm not convinced GFlops numbers and the amount of memory bandwidth is a good indication of performance.

Well that's true when comparing architectures from different companies. But it's a safe bet that GT300 will scale at least linearly with theoretical increases vs GT200 (in shader bound scenarios)
 
Well that's true when comparing architectures from different companies. But it's a safe bet that GT300 will scale at least linearly with theoretical increases vs GT200 (in shader bound scenarios)
Is it a safe bet to think that GT300 and GT200 are the same architecture?
I mean, ~2.5 Tflops is a x2.5 improvement over gt200 parts, that's a huge improvement.
There are two choices the chip is really really huge or Charlie has got some stuff right and Nvidia said goodbye to a bunch of fixed function hardware.
 
IMO, if you are talking about fibers as in light weight threads, the second a fiber has to check, via software, if it can run, then it has too much overhead. Also a fiber context switch still has to restore it's context, which could likely be a full trip to/from L2. Switch granularity likely has to be coarse for good performance (in order to amortize the switch cost).
Clearly Larrabee is doing this already since 32 registers can't hold the entire state of a reasonably complex pixel shader for much more than a quad of pixels. So L1/L2 traffic is "continuous" from the point of view of a Larrabee core.

ATI also has coarse context switching - i.e. a preference for long clauses of TEX instructions and a minimal number of clauses, in total, over the lifetime of the shader. That's because switching adds latency to the duration of the shader (don't know how much).

The overhead cost for software scheduling is a trade of utility versus area to implement the funny stuff. e.g. for all we know Intel's put dedicated atomic functionality in the cache/bus as part of the cross-core TLB fabric.

If you are talking about fibers as per LRB, it seems to me that LRB fibers are more like compiler loop unrolling. I thought the idea here was to, at compile time, compute an order of operations which would likely best insure that when the fiber returns to execution that it wouldn't have to check anything and could just continue executing (clearly in worst case just stall).

I look at LRB (in terms of NVidia's hardware) as having a fixed set of 4 blocks (or work groups), with fixed round-robin scheduling between the warps of a block. Anything outside of this formula (besides just round-robin of more blocks) requires software overhead. With NVidia's hardware you have the flexibility of warps of a block running out of order without software overhead.
There's basic hardware scheduling to cover latencies such as texture fetches, cache fetches (not sure) and branch misprediction.

Fragment shading is atomic per pixel. The scheduler task that dishes out pixel-shading/output-merger work has a scoreboard for these atomics. So between rasterisation and shading there's an "atomic queue". This is obviously fairly brutal and nowhere near as fine-grained as fibre-level sleeping, of course.

I don't know if Intel plans to routinely sleep individual fibres rather than threads (contexts). I can't tell how they've implemented global atomics. Hell, maybe there's no hardware support at all.

I don't really understand how texturing faults in Larrabee, but apparently it does, seemingly through the core's own TLB. I'm unclear if the thread sleeps or if the thread's expected to sleep fibres, or how things wake up - seemingly the thread issues a request to memory, perhaps the return of the data simply wakes the thread. Maybe the thread builds a list of faults before submitting them as a burst and then sleeping - so it doesn't attempt to run any fibres they have all successfully gotten data.

Or maybe it can wake as texture results return then go back to sleep? Using a gather-register full of addresses where texture results are expected, the thread is woken when any of that data appears in L2/L1?:

Code:
// clause 1
TEX: x1   
repeat until TEXGatherAddressList is empty
    scan for fibre to do
    ALU: x6 // dependent instructions executed for the fibre
 
// clause 2
TEX: x3   
repeat until TEXGatherAddressList is empty
    scan for fibre to do
    ALU: x25

I don't know if gather can effectively drive this sleep-until-woken thing. If this is possible then it seems it's a technique applicable to atomics too, wherever it is (out of pipeline) the atomics are actuated.

If we speculate that GT300 goes the route of having the ability to lock down a section of registers, shared memory, and warps (out of the 32/core maximum in GT200) for a "pinned block" then I'd bet they add something new in the hardware to insure warps of a pinned block don't have to poll (ie wake up, check if they should run, then either run or sleep again). Perhaps a dedicated instruction to wait on data from a hardware queue, or wait until a reserved line gets modified (in the case they they implemented writable caches).
This is purely about load-balancing in my view. A persistent kernel wakes up when the scoreboarder determines that overall throughput will suffer.

So I doubt the kernel does any polling as such.

Yeah, part of the problem here is that even beyond what you have mentioned and testing applications re-written for the different hardware options (cache/nocache, cache size, etc), one would also have to compare everything assuming the same base amount of silicon to implement all the hardware options...
Not to mention what clocks you can run it at.

Jawed
 
Is it a safe bet to think that GT300 and GT200 are the same architecture?
I mean, ~2.5 Tflops is a x2.5 improvement over gt200 parts, that's a huge improvement.
There are two choices the chip is really really huge or Charlie has got some stuff right and Nvidia said goodbye to a bunch of fixed function hardware.

Well I think it's a safe bet that "512" refers to the number of scalar MADD ALUs. So in that regard it should scale linearly. But that says nothing about the other bottlenecks that may have shifted.
 
Clearly Larrabee is doing this already since 32 registers can't hold the entire state of a reasonably complex pixel shader for much more than a quad of pixels. So L1/L2 traffic is "continuous" from the point of view of a Larrabee core.

With LRB I'd suspect that actual register to SIMD group (example, 4 quads) allocation changes throughout the shader. With a mix of areas (you could think of these as clauses) as that are separately loop unrolled and get different kinds of register allocation during the duration of the shader. Just a guess. But might be good to think of register allocation as cross clause (many fibers) rather than per shader.

If GT300 is some kind of MIMD or DWF however, I don't see this applying.

The overhead cost for software scheduling is a trade of utility versus area to implement the funny stuff. e.g. for all we know Intel's put dedicated atomic functionality in the cache/bus as part of the cross-core TLB fabric.

I wonder if they are using the x86 LOCK prefix for atomic operations.

Fragment shading is atomic per pixel. The scheduler task that dishes out pixel-shading/output-merger work has a scoreboard for these atomics. So between rasterisation and shading there's an "atomic queue". This is obviously fairly brutal and nowhere near as fine-grained as fibre-level sleeping, of course.

In CUDA one can take advantage of out-of-order warp execution. SIMD groups of quads could also execute out-of-order (might have to sync block prior to output however?). Also if NVidia does provide "warp as a task" functionality for GT300, then the finer warp granularity scheduling could indeed be useful.

I don't really understand how texturing faults in Larrabee, but apparently it does, seemingly through the core's own TLB. I'm unclear if the thread sleeps or if the thread's expected to sleep fibres, or how things wake up - seemingly the thread issues a request to memory, perhaps the return of the data simply wakes the thread. Maybe the thread builds a list of faults before submitting them as a burst and then sleeping - so it doesn't attempt to run any fibres they have all successfully gotten data.

What about submit entire caches lines full of offset (and lod, etc) for 16 fetches as one unit. This would make sense with the idea of compiler cross-fiber register allocation (I talked about above). Then have enough fibers issuing groups of 16 fetches such that when you return to the first fiber again, that you assume the TEX resulting data is in L2 (if it isn't there than core does an L2 miss and waits for the TEX result only on that hyperthread).
 
Fibering != unrolling, unless you want to have fun constantly thrashing the instruction cache.
 
The overhead cost for software scheduling is a trade of utility versus area to implement the funny stuff. e.g. for all we know Intel's put dedicated atomic functionality in the cache/bus as part of the cross-core TLB fabric.
I'm not sure what dedicated functionality there would be. Atomic operations resolve to control of cache lines, and this has been rolled into in a CPU architecture's coherence scheme. The dedicated logic in this case would be the cache controller. Atomic operations as far as the actual ALU work is concerned are nothing special once the cache line statuses have been handled.

Usually, whatever the TLB would be doing with regards to page status has already been done, as the CPU first loads the cache line with the atomic variable, then tries to assert ownership. Involving the TLB past the initial load attempt is begging to cause an exception or interrupt related to page table or the TLB needing to be filled, which would run counter to the desire to keep atomic ops short.
I'm not sure what a TLB fabric would entail.

I don't really understand how texturing faults in Larrabee, but apparently it does, seemingly through the core's own TLB. I'm unclear if the thread sleeps or if the thread's expected to sleep fibres, or how things wake up - seemingly the thread issues a request to memory, perhaps the return of the data simply wakes the thread. Maybe the thread builds a list of faults before submitting them as a burst and then sleeping - so it doesn't attempt to run any fibres they have all successfully gotten data.
The statements concerning this indicate the core is in charge with handling pages not in memory, a hopefully rare occurrence for Larrabee. My impression was more that the core itself handles paging things in, which is more than just the TLB, as this involves creating page table entries, which is something a simple texture unit might not be able to do.

Or maybe it can wake as texture results return then go back to sleep? Using a gather-register full of addresses where texture results are expected, the thread is woken when any of that data appears in L2/L1?:
Data doesn't just appear in the L2/L1 in a coherent cache scheme, barring some very non-x86 and possibly no-CPU type cache behavior.
The data would have to have been pulled in by the core, which is currently only guessing when the data will have actually been added--absent some kind special interrupt that has not yet been disclosed.
A low-rent scheme would be for the texture units to write out 16-bit error flags per each qquad's texture access.
Set that flag as a bit mask for a standard gather using the base addresses of each texture access.

If all bits are 0, the operation is skipped.
For all bits set to 1, each gather load is a standard x86 load, which will hit the fault and initiate the standard x86 handler.
 
Fibering != unrolling, unless you want to have fun constantly thrashing the instruction cache.

Oops, so perhaps scratch my idea of grouped clause register allocation... seemed as if it might work when all hyperthreads are running the same shader, but perhaps not!
 
Isn't the shader clock supposed to be twice of core clock in G80 and GT200? Clearly not here if this is true. I am a bit sceptical that it has been ditched. :???:

Nope, it's at least twice the core clock. Most configs have shaders running at 2.2-2.5x the core clock. For example my GTX 285 is at 700/1585 right now.
 
I wonder if they are using the x86 LOCK prefix for atomic operations.
The first Larrabee paper stated that communication among the four hardware threads of a core went through a queue updated with the CMPXCHG instruction without using the LOCK prefix. This is possible because the four logical threads running on the hardware context (1 FE and 3 BE using Intel's nomenclature) are pinned to the core and thus CMPXCHG works atomically even without the LOCK prefix. This was done to avoid the unnecessary cache-coherency overhead of such operation. Communication with the outside world on the other hand requires using primitives involving the LOCK prefix and suffering the full cost of a cache-coherency broadcast.
 
I'm not sure what dedicated functionality there would be. Atomic operations resolve to control of cache lines, and this has been rolled into in a CPU architecture's coherence scheme. The dedicated logic in this case would be the cache controller. Atomic operations as far as the actual ALU work is concerned are nothing special once the cache line statuses have been handled.
I was dallying with the thought of the programmer having to roll their own global atomics :oops: forgetting that cache coherence + line-locking should lie at the heart of this.

Usually, whatever the TLB would be doing with regards to page status has already been done, as the CPU first loads the cache line with the atomic variable, then tries to assert ownership. Involving the TLB past the initial load attempt is begging to cause an exception or interrupt related to page table or the TLB needing to be filled, which would run counter to the desire to keep atomic ops short.
I'm not sure what a TLB fabric would entail.
:oops: I should have just referred to cache-coherency fabric. I was mixing-in the need for virtualised addressing being common to all cores, into cache-coherency communication. But referring to the former.

For 32 cores on the same die to all cooperate in a single virtual address space there does need to be some kind of TLB fabric though, doesn't there? I supposed they're all caching the same page table, which implies changes to the page table have to be atomic, but that's a whole other kettle of fish. I don't know how the "Pentium core" that Larrabee's based on does this, or how scalable it is.

The statements concerning this indicate the core is in charge with handling pages not in memory, a hopefully rare occurrence for Larrabee. My impression was more that the core itself handles paging things in, which is more than just the TLB, as this involves creating page table entries, which is something a simple texture unit might not be able to do.
Hmm, I was forgetting that D3D programmers currently have to implement their own virtualised texture scheme and wouldn't work like this for normal textures, so this is looking into a future D3D or just for people like Sweeney.

Though I'm wondering what happens when "AGP texturing" is required...

Data doesn't just appear in the L2/L1 in a coherent cache scheme, barring some very non-x86 and possibly no-CPU type cache behavior.
The data would have to have been pulled in by the core, which is currently only guessing when the data will have actually been added--absent some kind special interrupt that has not yet been disclosed.
A low-rent scheme would be for the texture units to write out 16-bit error flags per each qquad's texture access.
Set that flag as a bit mask for a standard gather using the base addresses of each texture access.

If all bits are 0, the operation is skipped.
For all bits set to 1, each gather load is a standard x86 load, which will hit the fault and initiate the standard x86 handler.
I was thinking that there's an addressing scheme for texture results, e.g. a block of addresses are reserved for 16 quads of texture results returned by the TU. A portion of L2 acts as a stream-through buffer for these texture results. But that's mostly a sideline issue for this topic.

I suppose the chip works in one of two modes: real and virtual addressing. Normally the TUs fetch texels from flat addresses and return results in a real-addressed block (which is the L2 stream-through buffer).

Slides 12 and 33 here:

http://s08.idav.ucdavis.edu/forsyth-larrabee-graphics-architecture.pdf

indicate that there are TU TLBs, which allow it to work independently of the core, fetching page table entries if need be and managing its own page load requests. The owning thread is oblivious to all this stuff (though the core may receive mirrored TLB entries if they're changed?) and just ends up stalling when the texture results don't appear.

Alternatively the programmer can elect to have hard faults activated, which seems to mean that TU-TLB is a mirror of core-TLB. I guess as soon as a TU-TLB entry miss occurs or a page is listed as not in physical RAM, the TU abandons the TEX. The TU fires the address back to the core which sleeps the thread and then services either the missing TLB entry or the page load, or both.

When the page is ready to be used by the TU the core wakens the thread which re-submits the TEX.

Presumably textures are striped into pages by mip-level, and each texture-mip doesn't cross page boundaries. This means there's only one page fault per TEX instruction per mip level - so in theory you could get two or more page misses for a single TEX, but you wouldn't be generating a miss per pixel in the qquad.

So overall I don't think this would use the gather mechanic I described before - it seems like it's just a paging mechanic. Though I suppose you could use gather with page-base addresses, hmm...

One of the interesting things here is that a TU-TLB is logically a mirror of 8 cores' TLBs, when hard-faulting is active. And each core's TLB is logically 4-way threaded too (though I'm not sure if Larrabee actually threads core-TLB). Whether hard- or soft-faulting there's 32 cores and 8 TUs all cooperating in page table maintenance.

Jawed
 
Isn't the shader clock supposed to be twice of core clock in G80 and GT200? Clearly not here if this is true. I am a bit sceptical that it has been ditched. :???:

ALU:TMU frequency ratio

G80= 2.35x
G92= 2.49x
GT200b = 2.27x
(hypothetical G300) = 2.29x
 
With LRB I'd suspect that actual register to SIMD group (example, 4 quads) allocation changes throughout the shader. With a mix of areas (you could think of these as clauses) as that are separately loop unrolled and get different kinds of register allocation during the duration of the shader. Just a guess. But might be good to think of register allocation as cross clause (many fibers) rather than per shader.
To be honest I don't think of the 32 registers as anything other than pipeline registers. A shader's state registers, r0, r1 etc. map to addresses. How the compiler or programmer maps between state-register-addresses and pipeline-registers is completely flexible, e.g. if shader execution is emulated as clause-by-clause loops over the fibres, then the body of each loop has a fixed configuration of instructions<->registers, so shader state registers must all be mapped consistently into these registers. If the shader is formed as some chaotic intermingling of fibres :oops: then, erm, well leave that to the compiler.

If GT300 is some kind of MIMD or DWF however, I don't see this applying.
It should be the operand collector's problem - i.e. it translates warp+thread register-address into dynamic-warp lane and all that gritty stuff.

In CUDA one can take advantage of out-of-order warp execution. SIMD groups of quads could also execute out-of-order (might have to sync block prior to output however?). Also if NVidia does provide "warp as a task" functionality for GT300, then the finer warp granularity scheduling could indeed be useful.
Yep, having warps as contexts in their own right works very nicely - though any kind of render target RMW, intra-shader, obviously causes all sorts of fun as you're trying to construct a scoreboarder of your own using a task-warp as an atomic-singleton. Certainly not saying it won't work, but clearly this ends up looking a lot like Larrabee. Here I'm talking about long-winded RMW clauses within a shader, e.g. fetch pixel on instruction 1 with 10 or 20 dependent instructions before returning a result to the render target.

Though I have to admit the way D3D-11 render target RMW is described seems to imply that the developer can't make this atomic. If atomic is desired then an extra pass is required using D3D-CS and, optionally, atomic - depends on whether CS is consuming an append buffer with pixel address+value pairs or whether it's trying to post-process a render target, I guess.

So render-target RMW seems to be un-ordered chaos with fingers-crossed - prolly won't see much use?

Obviously CUDA's different and we're now just waiting to see if persistent warps are a possibility.

What about submit entire caches lines full of offset (and lod, etc) for 16 fetches as one unit. This would make sense with the idea of compiler cross-fiber register allocation (I talked about above). Then have enough fibers issuing groups of 16 fetches such that when you return to the first fiber again, that you assume the TEX resulting data is in L2 (if it isn't there than core does an L2 miss and waits for the TEX result only on that hyperthread).
Yeah, this kind of thing seems to be the usual case of hiding latency by using a combination of fibres/qquads with ALUs in the shadow of the batch of requests + SMT. If the thread runs out of ALUs to run it means that the task-allocation routine has made the screen space tile too small. I suppose this is inevitable sometimes no matter what - after all GPUs can't always hide worst case texturing latency (e.g. perlin noise based on 3D texture lookups).

Jawed
 
For 32 cores on the same die to all cooperate in a single virtual address space there does need to be some kind of TLB fabric though, doesn't there? I supposed they're all caching the same page table, which implies changes to the page table have to be atomic, but that's a whole other kettle of fish. I don't know how the "Pentium core" that Larrabee's based on does this, or how scalable it is.
The page table is just data in memory, granted it has system-level significance.
The various cores and tex units might cache parts of it at any given time in their TLBs. There's no need in x86 for a TLB fabric, as they are caches that can stay coherent like their neighbor L1 and L2s.
Since it is system-critical, changing the page table would require additional work that would be serializing. As AMD's Barcelona chip showed, there are a number of actions related to TLBs where there are assumed to be atomic operations on page table structures. The failure to handle them atomically is basically game-over.

It might be fun to see how much work Intel has done in verifying Larrabee can handle TLB updates across so many cores, given what AMD experienced with just four. TLB-related errata are documented concerning non-atomic updates are still present in Intel and AMD designs, but are typically prevented by microcode patches.

It would be interesting to see just how much the texture units can modify the page tables. Maybe they can update some of the bookkeeping bits per entry. If they write to the L2 buffer, it might have to be labelled dirty, or maybe it has to be already initialized to a fixed status by the core.
That the texture unit gives up on a fault makes sense as modifying the actual page table is an OS-level operation.
I would assume Larrabee's software sets up as much as it can ahead of time and tries to keep it as unchanged as possible, given the overhead.

Slides 12 and 33 here:

http://s08.idav.ucdavis.edu/forsyth-larrabee-graphics-architecture.pdf

indicate that there are TU TLBs, which allow it to work independently of the core, fetching page table entries if need be and managing its own page load requests. The owning thread is oblivious to all this stuff (though the core may receive mirrored TLB entries if they're changed?) and just ends up stalling when the texture results don't appear.
I wouldn't expect the core to receive mirrored TLB entries, if the texture unit somehow modifies them.
An alteration would invalidate all cached copies of the page table entry. The CPU, if it were to require that entry, would miss and have to fill the TLB before trying to complete the memory access.
It should be on-chip, as the texture unit would have a copy. It might be that the entries are mostly not modified by the texture units to avoid TLB thrashing.

Alternatively the programmer can elect to have hard faults activated, which seems to mean that TU-TLB is a mirror of core-TLB.
My take was that texture units have more specialized TLB behavior that allows them to behave in a rather non-x86 manner. The texture unit can, at programmer discretion, give up when a full core would be required to service a miss or fault.
This might make good performance sense, as fiddling with the TLB can inject unpredictable latencies.
I would think that the texture unit with hard faults enabled would still defer to the core if it encounters a fault that invokes an OS routine.

One of the interesting things here is that a TU-TLB is logically a mirror of 8 cores' TLBs, when hard-faulting is active. And each core's TLB is logically 4-way threaded too (though I'm not sure if Larrabee actually threads core-TLB). Whether hard- or soft-faulting there's 32 cores and 8 TUs all cooperating in page table maintenance.

I didn't get the impression that there was any mirroring of TLBs in the pdf. What leads to this conclusion? TLB entries can be shared, but an update to one TLB's entry will typically only lead to the invalidation of old copies cached elsewhere. Broadcasting an update wouldn't normally be done. The other cores would just have to service a miss, if they happen to need the address.

TLBs are typically a shared resource local to a core. Given how intimately they are tied to the memory pipeline and how many there would be if they were per-thread, I'd bet this hasn't changed.
 
Status
Not open for further replies.
Back
Top