NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    My argument with this "idealised" stance is that the actual sequence of operations undertaken by the processor results in varying performance for the same ALU:TEX.

    http://developer.amd.com/gpu_assets/PLDI08Tutorial.pdf

    You can clearly see in versions 3, 4 and 5 that, despite identical ALU:TEX, performance varies substantially.

    5 is faster than 3 despite the fact that 5 has less threads in flight per SIMD than 3. The estimated threads for version 3 is 256/28 = 9, while for version 5 it is 256/38 = 6. (Both estimates are subject to clause-temporary overhead. Also I suspect that 256 is not the correct baseline, something like 240 might be better, not sure...)

    Evergreen GPUs support 16-long TEX clauses as opposed to the 8-long clauses seen in R600-RV790. There are two reasons to do this:
    1. all clause switches increase the latency experienced by an individual hardware thread as switching has latency, so packing TEX instructions into a lower count of clauses reduces the total latency experienced
    2. TEX clauses are sensitive to cache behaviour, so a doubling in TEX clause length can increase coherency
    So, in summary, the idealised stance is merely a starting point.

    Going back in time, your argument is that if AMD doubles ALU:TEX, e.g. 8:1 in the next GPU, but leaving the overall ALU/TEX architecture alone, that each ALU would only need half the register file. The 256KB of aggregate register file per SIMD we see in Evergreen would be enough for the next GPU. Well, clearly this is fallacious as version 5 above would be reduced to a mere 3 hardware threads, killing throughput (3 hardware threads means that both ALU and TEX clauses cannot be 100% occupied by hardware threads, since both require pairs of hardware threads for full utilisation).

    Separately, I've long maintained that careful management of register spill can be used to amplify the effective size of the register file (hardly news: CPUs are continuously doing this). NVidia's older architectures and ATI's current one take a rather naive and useless all-or-nothing approach to register spill, i.e. there's zero optimisation for the register-spill case, so once it's induced the wheels fall off. These GPUs are set up for fencing, and don't like it when you come armed with a baseball bat.

    AMD will have to catch-up and implement register spill properly. One of a long list of catch-ups, in comparison with Fermi.

    Jawed
     
  2. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    You can rasterize 32 pixels all right, but then - what are you going to do with two of them when the next part of the pipe only fits 30/28 at a time?

    Now I know what you meant :) 'twas my silly mistakeā€¦
     
  3. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,079
    Likes Received:
    648
    Location:
    O Canada!
    Really? To what end?
     
  4. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    Coupling this to memory clock could make sense IMHO, but it doesn't fit the overclocking data Tridam did. Apparently none of the fillrate numbers budge a bit (which is imho quite remarkable anyway, since it suggests despite the low memory clock it's not really bandwidth limited as the ROPs are the major bandwidth consumers), but instead scale linearly with core clock.
     
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    What is the next part of the pipe that you're referring to that can only accomodate 30 pixels?
     
  6. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Performance on workloads beyond the low-hanging fruit that's been much of the focus so far? Dunno, does AMD want to compete on OpenCL performance?

    The strange thing about register spill is that it's not a miilion miles away from the way the ATI compiler currently has to allocate clause-temporary registers. It can only allocate clause-temporaries after determining lifetime, something that's required when evaluating register spill.

    Seems to me any argument against register spill is like arguing against local memory/L1 back before G80 came out. The signs were clear in papers from back then that GPGPU needed to go in this direction. No matter how clean the Brook stream model is, it's too restrictive in the real world. Fitting the entire context into registers is a "too-clean" model, it fails with increased kernel complexity.

    If you want to argue that the programmer should explicitly manage spillage simply by doing their own reads/writes to global memory, well I think that's a step too far when competing architectures scale smoothly with growth of work-item context. A few KB of context per work-item shouldn't be treated like a pure global memory resource, just because it's 512 bytes too much to fit into registers for a given latency-hiding constraint.

    Jawed
     
  7. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,298
    Likes Received:
    247
    GTX480 has 1.5-times more ROPs. 1.5 better results are quite expectable to me...
     
  8. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    The shader engine, actually. It can output in current configurations a maximum of 30 ppc to the ROPs.
     
  9. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    Except that z sample rate is often bandwidth limited, especially without AA (with 24bit z and 160GB/s you could only do ~53GSamples/s if there's no compression).
    Though by comparing HD4890 to HD5870, I'm actually not sure that theory is true any longer. Cause the z sample results are exactly twice no matter the AA setting... So it looks like there's something else preventing HD5870 from reaching its peak z sample rate except with 8xAA... Maybe it just can't push that many pixels...
    FWIW GT200b seems to only reach half its potential, since afaik it should also be capable of 8xZ per clock just like g80/g92/gf100.
    I agree though if it's not memory bandwidth limited due to good compression, the results make sense, and whatever holds AMD chips back at non-AA is not a problem for nvidia chips.
     
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Well now we're back right back where we started. My original question is why is it assumed that each SM can only output 2 pixels per clock? That makes no sense to me since unified architectures since G80 are all about running different workloads in parallel. So you're telling me that if only half the chip is running pixel shaders, fillrate falls to 16 ppc?
     
  11. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    You lost me. AFAIK a each GPC can rasterize 8 ppc and send the work off to the shader-engine. That makes for 32 ppc for both GTX 480 and 470 rasterized. What follows is pixelshading. The shader engine as a whole can process 32 ppc, but in current configs, only a maximum of 15 out of 16 SMs is active, so it's only 15/16th of 32 pixels (i.e. 30) that can be sent off at a time.

    If they're in a format, the ROPs can process single-cycle: voila - 28-30 ppc fillrate.
    If the ROPs take two cycles, fillrate is halved.
     
  12. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Well you keep saying that but as with my first question I'm asking where it's coming from. Who said all SM's combined can only output 32 ppc?
     
  13. cho

    cho
    Regular

    Joined:
    Feb 9, 2002
    Messages:
    416
    Likes Received:
    2
    1 warp = 8 pixel quad

    the fermi can do 2 warps per SM per 2 shader cycle

    so on fermi, it should be 4 pixels per sm per shader cycle ?
     
  14. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,298
    Likes Received:
    247
    http://www.pcgameshardware.com/aid,743526/Some-gory-guts-of-Geforce-GTX-470/480-explained/News/
     
  15. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Well it can retire two instructions for two half-warps every hot clock. So that's two full warps per scheduler clock. So potentially 64 pixels per SM. Hence my amazement that it could be only 2.

    Yep, that's why I originally asked:

    Was hoping there was something more detailed out there.
     
  16. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    I'm not assuming the latter at all (hence my use of quads/clk throughout my post), and the former is a solid assumption. Remember that Xenos is able to do this, and it's more closely related to R600 onwards than R520. What's so hard about it? All Cypress has to do is store vertex indices for each quad so that the interpolation routine can fetch the correct data. With RV770 and earlier, the shading engines had no idea which triangle the quads came from because interpolation was done earlier and stuffed into the registers.

    Because they already did the minimal work required in Xenos and probably earlier.

    You're wrong about "so few small triangles, ever". I'm having trouble finding public information about triangle size distributions, but here's an old one with GLQuake II information:
    http://citeseerx.ist.psu.edu/viewdo...CAA817BB2?doi=10.1.1.56.8726&rep=rep1&type=ps
    They do glide interception, but don't mention the rendering resolution (let's assume 640x480). They find 41% of the visible triangles are 1-25 pixels in size. How is this possible if 640*480/1300 = 307? Well, that's the area-weighted average triangle size and says nothing about the distribution. Nowadays we have 10x the pixels and well over 100x the triangles, so that likely means even more triangles are tiny. Still think triangles are big?

    It's just plain stupid not to have multiple triangles per hardware thread.

    Look, it's very easy to test this. Draw a fullscreen mesh with a 10,000 cycle pixel shader and no AA, and see what happens as triangle edge length decreases from 100 pixels down to 0.1. If you're right, it will bottom out at 1/64th of large triangle performance. If I'm right it will bottom out at 1/4th.

    Do you have split personality? You are the one that proposed >10M visible triangles per frame, not me. You are the one that brought up one pixel triangles, not me.

    If you have a 40 cycle shader (up to 400 flops, 10 bilinear fetches), ATI's tesselator will avoid idling the SIMDs only if the rasterizer generates a buffered average of six quads per triangle. Note that I'm counting wavefronts with as little as one visible sample in each of its 16 quads as keeping an SIMD busy. The tessellator is a bottleneck for triangles with an area of 5, 10, even 15 pixels. One-pixel triangles are completely irrelevent. Do you get it yet?

    As I explained earlier, a shadowmapped game with 4M triangles per frame will work out to around 300k visible triangles in the final rendering view. That's an average of 5-10 pixels per triangle, depending on resolution, and thus is not comparable to your strawman case of single pixel polygons. Heaven shows us that 4M triangles is enough for tessellation performance to be a major factor.

    Just because IHVs have under 50% quad-filling efficiency for <10 pixel triangles doesn't mean that they'll throw in the towel and let wavefront efficiency drop to <10%.
    Why are you assuming 20 fps when B3D provides perf numbers much higher than that? You don't know how much that state (which can have multiple draw calls, BTW) changes without tessellation, so you're going about it the wrong way. That's why I looked at total frame time, because we have differential data in the review.

    That's a dumb assumption. Why do you think vertex load is negligible? Only the objects that the programmer bothered to cull with the CPU will be eliminated, and this is a demo.

    And this is why I use render time differential. All the assumptions you're making are grossly flawed and unnecessary with my approach. The difference in render time that I'm highlighting is due only to the difference in workload brought about by enabling tessellation.

    Stop putting words into my mouth AGAIN. I never said doubling setup was easy. I said the rasterizer, shading engine, and render backend could be left untouched. You said it needs to be overhauled to take advantage of faster tessellation, and that's just plain wrong.

    Leaving the setup and just improving to one tessellated triangle per clock, on the other hand, is very easy, and I'm baffled as to why Cypress can't do that. Laziness? A bug? Planned obsolescence?

    Has anyone timed how many triangles per clock we see from R600 onwards when creating tri-strips in the geometry shader? Maybe they just left the geometry amplification hardware unchanged.
     
  17. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
    Didn't I confirm this? Sorry, didn't want to do too much advertising our own site.
    And no, apart from the cited paragraph, Nvidia said nothing more details about it yet.


    That's why (not only) GTX480 can achieve their peak "fillrate" with longer shaders too.
    For example this shader from MDolencs Fillrate Tester also achieves "peak fillrate throughput" (i.e. ~20ish GPix/s) despite being more than just one operation per pixel:
    Code:
    ps_2_0
    
    dcl v0
    dcl v1
    
    def c0, 0.3f, 0.7f, 0.2f, 0.4f
    def c1, 0.9f, 0.3f, 0.8f, 0.6f
    
    add r0, c0, v1
    mad r1, c1, r0, -v0
    mad r2, v1, r1, c1
    mad r3, r0, r1, r2
    mov oC0, r3
     
    #3937 CarstenS, Apr 1, 2010
    Last edited by a moderator: Apr 1, 2010
  18. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    No, it takes 2 clocks to retire 2 warps. 32alu's/sm mean 1 warp per clock is the effective peak.
     
  19. Colourless

    Colourless Monochrome wench
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,274
    Likes Received:
    30
    Location:
    Somewhere in outback South Australia
    In reference to Quake 2, most of the screen space is covered by the world geometry, which is large triangles, and not all that many of them. All the small triangles will primarily be in enemy models. The enemies themselves didn't have all that many triangles, but they weren't particularly large on screen. Overall there weren't many triangles being rendered, and the game really isn't that useful to discuss things more than 10 years later.

    Talking about x percentage of the visible triangles less than 25 pixels isn't as useful as talking about x percentage of the screen is covered by triangles less than 25 pixels.
     
  20. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,430
    Likes Received:
    433
    Location:
    New York
    Ok.

    Yes, hot clock but it's scheduler clock that I mentioned in my comment (since we're talking about feeding the ROPs). Or are you saying it's only 1 warp per scheduler clock?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...