G80 Architecture from CUDA

Stacks are hard on the GPU. Even with the PDC (Parallel Data Cache), you have to share the space with all threads in the warp and you have to be careful about conflics on bank access. In the GPGPU community, we adapt datastructure traversal to support "restart" or "backtrack" methods, see Foley et al's paper from Graphics Hardware last year or Horn et al's paper from the upcoming I3D, both on k-D tree raytracing. The later emulates a fixed size small stack using the register file and using looping constructs instead of pure streaming. With scatter and gather, you could emulate stacks in GPU memory (and even host on ATI chips with fine grain scatter), but it becomes *extremely* expensive. You are now talking about tremendous amounts of latency to cover, and you are still talking about defining a bounded region of memory for each thread, basically allocating for the worst case stack usage. However, someone can probably extend Lefohn's GLift GPU datastructure work to make this easier to use, but it's likely still going to be expensive.

The main issue with recursion or stacks is that the memory requirements are unbounded. On the CPU this really isn't a problem, but as you have to handle 100s-1000s of threads, the required space on a GPU or a CPU way into the future gets quite high.

I have now done both a grid and a BIH-tree based raytracer in CUDA. Both are pretty naive, giving med about 1 / 3 MRays/sec (grid/bih) for million polygon models. I simply put the stacks for the BIH one in shared memory, there is enough space there for one stack per thread. I probably have shitloads of bank conflicts though, and lots of divergent execution :(
 
The CUDA website was just updated with an XLS spreadsheet to calculate warp efficiency: http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls
Ooh, that's great, I bet it will clear up a lot of heartache! Very nice.

Now the interesting part is click the GPU data tab. There you find:
Code:
GPU:                                            G80
Multiprocessors per GPU                         16
Threads / Warp                                  32
Warps / Multiprocessor                          24
Threads / Multiprocessor                        768
Thread Blocks / Multiprocessor                  8
Total # of 32-bit registers / Multiprocessor    8192
Shared Memory / Multiprocessor (bytes)          16384

Nice little run down of the main stats. Has anything like this be released publicly before so concisely?
No.

So, I make it 512KB of register file across the entire GPU = 16 multiprocessors * 8192 registers * 4 bytes.

Jawed
 
I think it's capable of supporting four 128-bit registers per thread, with 880 threads per quad processor. 6 quad processors * 880 * 4 * 16 bytes is 330KB.

It's worth noting that registers in G80 are "single channel" 32-bits, i.e. one of RGBA, not RGBA as a unit as we're used to thinking of them from previous GPUs. Clearly that lines up with the "scalar" mindset of G80.

It makes me wonder if the pipeline FIFO structure in G7x and older is, actually, the "register file". i.e. there is no register file as such, all registers are held in the "ever circulating FIFO". Hmm...

Jawed
 
It makes me wonder if the pipeline FIFO structure in G7x and older is, actually, the "register file". i.e. there is no register file as such, all registers are held in the "ever circulating FIFO". Hmm...

Jawed

I think your right, I remember thinking somthing like that for Geforce FX..... the number of 'in flight quads' within the pipeline at one time was a quotient of the size of the register file and the amount of memory a quad takes up.

So FP16 was faster (than FP32) because it was possible to pack two FP16 temporary registers in the same space as single FP32 temporary register -thus reducing register file pressure (which allowed more 'in flight quads' within the pipeline).
 
I also think texturing related data (e.g. barycentrics?) may have space allocated in the FIFO. There's a statement in a PS3-RSX presentation that the number of allowable registers doubles if the shader performs no texturing. I presume this behaviour translates back to NV4x GPUs.

So, the memory for all the FIFOs in the GPU might actually be twice as large as I've suggested above.

Jawed
 
So FP16 was faster (than FP32) because it was possible to pack two FP16 temporary registers in the same space as single FP32 temporary register -thus reducing register file pressure (which allowed more 'in flight quads' within the pipeline).
You're mixing 2 different ideas here, imho.
You reduce registers pressure when you load less data.. you have more threads in fly when you allocate less data per thread.
Now going from full precision to half precision you address both problems :)
 
Last edited:
It makes me wonder if the pipeline FIFO structure in G7x and older is, actually, the "register file". i.e. there is no register file as such, all registers are held in the "ever circulating FIFO". Hmm...
I think that's true for R300/R400, but the way the NV4x/G7x/RSX scale gracefully with more data per pixel makes me doubt that.

Your FIFO would have to vary (quite finely) in width to get this sort of characteristic. As data shuffles through, you'd need a switch to determine where exactly the data would go next at each stage in the FIFO. I think its easier to just keep the data stationary and access the blocks you need.

Conceptually, though, it's the same thing. The way the shader units process the data is still FIFO-like whether the data circulates in a physical FIFO or resides in a cache-like register file.
 
I think that's true for R300/R400, but the way the NV4x/G7x/RSX scale gracefully with more data per pixel makes me doubt that.

Your FIFO would have to vary (quite finely) in width to get this sort of characteristic. As data shuffles through, you'd need a switch to determine where exactly the data would go next at each stage in the FIFO. I think its easier to just keep the data stationary and access the blocks you need.
I imagine the FIFO as broken into sections:
  • first ALU
  • TMU (perhaps 2 sections, LOD/Bias and texel-fetch/filter)
  • second ALU
with whatever appears at the head of each feeding into the computation stage that follows, with the succeeding section's tail being written with the full "state" produced by the stage (including "unchanged registers").

But, my concept of a FIFO requires variable-rate clocking (for the FIFO sections considered as a unit, i.e. all at the same rate) to cope with the fact that the pipeline can support more pixels in flight if the register allocation is low (e.g. 4x the pixels when 1 FP32 register is allocated) or high (e.g 1/8th the pixels when 32 FP32s are allocated). Either that or each computation stage has variable-length delays within it to align data as its fetched and stored. Maybe have to combine both techniques to cover the full range of possibilities?

Conceptually, though, it's the same thing. The way the shader units process the data is still FIFO-like whether the data circulates in a physical FIFO or resides in a cache-like register file.
Thinking more about this, the register fetch limit seems to imply that a single-ported fetch from the register file is capable of supporting whatever allocations of registers are required. As long as the fetch works on a 4x 128-bit units, then any allocation of registers from 1 to 32 can be accommodated, since the throughput of the pipeline varies in rate depending on the register allocation.

e.g. if 32 FP32s are allocated per pixel, then the pipeline has 8 clocks to fetch the 4 FP32s it can consume. So the computation unit would have to align the data coming out of these successive fetches (i.e. use "holds").

So, on balance, it seems to me simplest approach is prolly the register file with a single 512-bit port per pixel (ganged for the entire quad, if desired, I suppose).

Jawed
 
A separate thing I've been wondering relates to the number of registers that are supported by a multiprocessor. With 8192 registers (each 32-bit) supported, that means a SM4 shader program that allocated 4096 vec4 fp32s (128-bits each) could only hold 2048 of those registers in memory at any one time (assuming a single pixel is scheduled for execution!).

This implies that the driver would have to "swap" portions of the shader code (and state) to video memory at boundaries of the subsets of the register allocation.

I wonder if the case of >2048 allocated vec4 registers is considered so unlikely that it's not been coded. In truth, presumably, the limit is much lower, e.g. for a single warp of 32 threads, the limit would be >64 allocated vec4 registers.

Or, perhaps the general trickery of vec4->scalar instruction scheduling allows the driver to get round this issue quite easily?

Playing with the CUDA Occupancy Calculator indicates a hard limit of 64 32-bit registers per multiprocessor, which is effectively one quarter of what I was expecting. Hmm...

Jawed
 
Playing with the CUDA Occupancy Calculator indicates a hard limit of 64 32-bit registers per multiprocessor, which is effectively one quarter of what I was expecting. Hmm...
Dunno about GPGPU stuff, but with <graphics shaders> is really really hard to use more than 5 or 6 128 bits registers (especially if you don't use DB..) It's easy to construct cases where you use more than that but I doubt they're very pratical
 
The four light shader from Far Cry uses more registers than that (seemingly .xyz most of the time though).

But yeah, the "limits" in SM4 do seem spectacularly over-the-top in comparison with what SM3 allows.

Still, I'd be curious to see what happens to G80 if SM3's full 32 128-bit registers are assigned, since the CUDA documentation implies it can only support half that. The limitation might be CUDA-specific, or it might be a bug in the spreadsheet... :???:

I wonder if this is partly the problem that the Folding@Home guys have trying to get the Brook DX9 code to run on G80: maybe it's trying to use all 32 128-bit registers it's allowed under SM3, but G80's refusing the code or falling over when it "flips" between partitions of registers.

Jawed
 
I expect the compiler can spill registers to thread-local off-chip memory if needed, allowing it to support the full complement of 4096 vec4 "registers". It could trade off fast register access against threads in flight... This would also be necessary for dynamically indexed local arrays -- I've never seen a real register file that supports that.

Take a look at section 4.2.2.4 in the CUDA programming guide, sounds like the CUDA compiler is doing exactly this.
 
I wouldn't expect performance to hold up for a SM4 shader program that has hundreds or thousands of registers defined. I only asked the question originally because it's what D3D10 specifies as a requirement of all hardware, no ifs, buts or maybes ... (erm, supposedly :LOL: )

But it is curious that G80 CUDA appears to run out of registers at 16 vec4s. That's not even what SM3 specifies - which is why I suspect a bug in the spreadsheet.

Anyway, shared memory provides a handy "overspill".

Jawed
 
The amount of defined registers isn't significant from a performance standpoint. What counts is the maximum amount of concurrently live registers. Compilers can do lots of transformations to optimize the amount of live registers. Data values can be refetched, results recomputed, computations reordered, results spilled to storage and so on. 64 live registers seems quite enough to me.
 
Back
Top