G80 Architecture from CUDA

Discussion in 'Architecture and Products' started by Rufus, Feb 16, 2007.

  1. ector

    Newcomer

    Joined:
    Nov 3, 2002
    Messages:
    111
    Likes Received:
    2
    Location:
    Sweden
    I have now done both a grid and a BIH-tree based raytracer in CUDA. Both are pretty naive, giving med about 1 / 3 MRays/sec (grid/bih) for million polygon models. I simply put the stacks for the BIH one in shared memory, there is enough space there for one stack per thread. I probably have shitloads of bank conflicts though, and lots of divergent execution :(
     
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Ooh, that's great, I bet it will clear up a lot of heartache! Very nice.

    No.

    So, I make it 512KB of register file across the entire GPU = 16 multiprocessors * 8192 registers * 4 bytes.

    Jawed
     
  3. PeterAce

    Regular

    Joined:
    Sep 15, 2003
    Messages:
    490
    Likes Received:
    10
    Location:
    UK, Bedfordshire
    For comparison how large (in KB) is G70's register file(s)?
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I think it's capable of supporting four 128-bit registers per thread, with 880 threads per quad processor. 6 quad processors * 880 * 4 * 16 bytes is 330KB.

    It's worth noting that registers in G80 are "single channel" 32-bits, i.e. one of RGBA, not RGBA as a unit as we're used to thinking of them from previous GPUs. Clearly that lines up with the "scalar" mindset of G80.

    It makes me wonder if the pipeline FIFO structure in G7x and older is, actually, the "register file". i.e. there is no register file as such, all registers are held in the "ever circulating FIFO". Hmm...

    Jawed
     
  5. PeterAce

    Regular

    Joined:
    Sep 15, 2003
    Messages:
    490
    Likes Received:
    10
    Location:
    UK, Bedfordshire
    I think your right, I remember thinking somthing like that for Geforce FX..... the number of 'in flight quads' within the pipeline at one time was a quotient of the size of the register file and the amount of memory a quad takes up.

    So FP16 was faster (than FP32) because it was possible to pack two FP16 temporary registers in the same space as single FP32 temporary register -thus reducing register file pressure (which allowed more 'in flight quads' within the pipeline).
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I also think texturing related data (e.g. barycentrics?) may have space allocated in the FIFO. There's a statement in a PS3-RSX presentation that the number of allowable registers doubles if the shader performs no texturing. I presume this behaviour translates back to NV4x GPUs.

    So, the memory for all the FIFOs in the GPU might actually be twice as large as I've suggested above.

    Jawed
     
  7. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    You're mixing 2 different ideas here, imho.
    You reduce registers pressure when you load less data.. you have more threads in fly when you allocate less data per thread.
    Now going from full precision to half precision you address both problems :)
     
    #67 nAo, Mar 22, 2007
    Last edited: Mar 22, 2007
  8. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I think that's true for R300/R400, but the way the NV4x/G7x/RSX scale gracefully with more data per pixel makes me doubt that.

    Your FIFO would have to vary (quite finely) in width to get this sort of characteristic. As data shuffles through, you'd need a switch to determine where exactly the data would go next at each stage in the FIFO. I think its easier to just keep the data stationary and access the blocks you need.

    Conceptually, though, it's the same thing. The way the shader units process the data is still FIFO-like whether the data circulates in a physical FIFO or resides in a cache-like register file.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I imagine the FIFO as broken into sections:
    • first ALU
    • TMU (perhaps 2 sections, LOD/Bias and texel-fetch/filter)
    • second ALU
    with whatever appears at the head of each feeding into the computation stage that follows, with the succeeding section's tail being written with the full "state" produced by the stage (including "unchanged registers").

    But, my concept of a FIFO requires variable-rate clocking (for the FIFO sections considered as a unit, i.e. all at the same rate) to cope with the fact that the pipeline can support more pixels in flight if the register allocation is low (e.g. 4x the pixels when 1 FP32 register is allocated) or high (e.g 1/8th the pixels when 32 FP32s are allocated). Either that or each computation stage has variable-length delays within it to align data as its fetched and stored. Maybe have to combine both techniques to cover the full range of possibilities?

    Thinking more about this, the register fetch limit seems to imply that a single-ported fetch from the register file is capable of supporting whatever allocations of registers are required. As long as the fetch works on a 4x 128-bit units, then any allocation of registers from 1 to 32 can be accommodated, since the throughput of the pipeline varies in rate depending on the register allocation.

    e.g. if 32 FP32s are allocated per pixel, then the pipeline has 8 clocks to fetch the 4 FP32s it can consume. So the computation unit would have to align the data coming out of these successive fetches (i.e. use "holds").

    So, on balance, it seems to me simplest approach is prolly the register file with a single 512-bit port per pixel (ganged for the entire quad, if desired, I suppose).

    Jawed
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    A separate thing I've been wondering relates to the number of registers that are supported by a multiprocessor. With 8192 registers (each 32-bit) supported, that means a SM4 shader program that allocated 4096 vec4 fp32s (128-bits each) could only hold 2048 of those registers in memory at any one time (assuming a single pixel is scheduled for execution!).

    This implies that the driver would have to "swap" portions of the shader code (and state) to video memory at boundaries of the subsets of the register allocation.

    I wonder if the case of >2048 allocated vec4 registers is considered so unlikely that it's not been coded. In truth, presumably, the limit is much lower, e.g. for a single warp of 32 threads, the limit would be >64 allocated vec4 registers.

    Or, perhaps the general trickery of vec4->scalar instruction scheduling allows the driver to get round this issue quite easily?

    Playing with the CUDA Occupancy Calculator indicates a hard limit of 64 32-bit registers per multiprocessor, which is effectively one quarter of what I was expecting. Hmm...

    Jawed
     
  11. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Dunno about GPGPU stuff, but with <graphics shaders> is really really hard to use more than 5 or 6 128 bits registers (especially if you don't use DB..) It's easy to construct cases where you use more than that but I doubt they're very pratical
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    The four light shader from Far Cry uses more registers than that (seemingly .xyz most of the time though).

    But yeah, the "limits" in SM4 do seem spectacularly over-the-top in comparison with what SM3 allows.

    Still, I'd be curious to see what happens to G80 if SM3's full 32 128-bit registers are assigned, since the CUDA documentation implies it can only support half that. The limitation might be CUDA-specific, or it might be a bug in the spreadsheet... :???:

    I wonder if this is partly the problem that the Folding@Home guys have trying to get the Brook DX9 code to run on G80: maybe it's trying to use all 32 128-bit registers it's allowed under SM3, but G80's refusing the code or falling over when it "flips" between partitions of registers.

    Jawed
     
  13. armchair_architect

    Newcomer

    Joined:
    Nov 28, 2006
    Messages:
    128
    Likes Received:
    8
    I expect the compiler can spill registers to thread-local off-chip memory if needed, allowing it to support the full complement of 4096 vec4 "registers". It could trade off fast register access against threads in flight... This would also be necessary for dynamically indexed local arrays -- I've never seen a real register file that supports that.

    Take a look at section 4.2.2.4 in the CUDA programming guide, sounds like the CUDA compiler is doing exactly this.
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I wouldn't expect performance to hold up for a SM4 shader program that has hundreds or thousands of registers defined. I only asked the question originally because it's what D3D10 specifies as a requirement of all hardware, no ifs, buts or maybes ... (erm, supposedly :lol: )

    But it is curious that G80 CUDA appears to run out of registers at 16 vec4s. That's not even what SM3 specifies - which is why I suspect a bug in the spreadsheet.

    Anyway, shared memory provides a handy "overspill".

    Jawed
     
  15. stepz

    Newcomer

    Joined:
    Dec 11, 2003
    Messages:
    66
    Likes Received:
    3
    The amount of defined registers isn't significant from a performance standpoint. What counts is the maximum amount of concurrently live registers. Compilers can do lots of transformations to optimize the amount of live registers. Data values can be refetched, results recomputed, computations reordered, results spilled to storage and so on. 64 live registers seems quite enough to me.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...