NVIDIA Fermi: Architecture discussion

Discussion in 'Architecture and Products' started by Rys, Sep 30, 2009.

  1. Acert93

    Acert93 Artist formerly known as Acert93
    Legend

    Joined:
    Dec 9, 2004
    Messages:
    7,782
    Likes Received:
    162
    Location:
    Seattle
    Are there any good overviews of the architecture and potential performance yet? I just recently read about the quasi 4 triangle/clock arrangement and the monster (?) tessellation performance. I haven't been following Fermi much due to the vaporware/high pitch fud so please excuse my disconnect--sounds like Fermi has some neat tricks up the sleeve. Maybe NV has something for SLI as well? (I must admit I am excited about their laptop dock with the Gateway, I hope that catches on!)
     
  2. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Figure it's a good time to start some architecture discussion again.

    In the leaked Hexus benchmarks for Heaven 2.0, we see that changing the tesselation level imparts a performance hit. I believe that for the most part this is due to increased load on triangle setup (including clipping and culling) because additional hull/domain/vertex shader load should be minimal, and the usual clumping of triangles will prevent the GPU form hiding this bottleneck behind pixel procesing. So let's do a little analysis:

    No tesselation, normal tesselation:
    HD5870: 40.5 fps, 26.3 fps ==> 13.3 ms extra processing time
    GTX480: 45.9 fps, 36.9 fps ==> 5.3 ms

    Fermi crunches through this additional load 2.5 times faster.

    Normal tesselation, extreme tesselation:
    HD5870: 26.3 fps, 17.0 fps ==> 20.8 ms
    GTX480: 36.9 fps, 29.5 fps ==> 6.8 ms

    Fermi crunches through this larger additional load 3.1 times faster.

    We know Cypress can do one triangle per clock, and this is what NVidia has said about Fermi:
    http://www.bjorn3d.com/read.php?cID=1778&pageID=8321
    http://www.techreport.com/articles.x/18332/2
    Not quite the expected result, given that Cypress is clocked faster, but Cypress is probably a little below 1 tri/clk on average, so close enough :smile:
     
  3. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    FYI - The final release of the Heaven benchmark doe not have a "No Tesselation" mode, but a "Moderate" mode, so if that is accurate I don't know if they were using an RC release of the bench. While I don't know what the performance differences on Fermi are we do see differences in performance between the RC and the final release, to the tune of about 10% performance for Cypress.
     
  4. Sinistar

    Sinistar I LIVE
    Regular Subscriber

    Joined:
    Aug 11, 2004
    Messages:
    660
    Likes Received:
    74
    Location:
    Indiana
    Just checked, it has, disabled, moderate, normal, and extreme.
     
  5. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    I wouldn't obsess to much about setup rates, it is important but with so many very small triangles (turn on the wireframe mode and see :) ) I wouldn't be surprised if tessellation in that test kills pixel shaders perfomance which in turn could be a new bottleneck. And who knows..Fermi could be doing something clever about it.
     
  6. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Well, remember that every clock these cards can do 2000 to 3200 flops and output 8 quads. You need a long pixel shader to be unable to take advantage of multiple tris per clock on 1-2 quad triangles.

    It may be inefficient to throw away half the samples in a quad on tiny triangles, but it's even more wasteful to have the majority of your shader engine idling due to lack of quads to work on, dreaming of 50% efficiency :smile:

    If you're right about the pixel shader load increasing with tesselation (and I suspect you are), then we need to subtract a few milliseconds from those numbers. It's probably roughly equal for both cards, because they have similar performance without tesselation and thus similar pixel crunching ability, but it would wind up making the ratio bigger.
     
  7. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
  8. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    Some tests at hardware.fr:
    http://www.hardware.fr/articles/787-6/dossier-nvidia-geforce-gtx-480.html

    Apparently Cypress can only generate one tessellated triangle every three clocks. So it's not setup. I wonder why ATI went with such a slow implementation? I don't see what's so difficult about it. This would explain why NVidia was claiming 600% advantage in some directed tests.

    It also makes my reasoning above rather moot. :oops:
     
  9. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Now, Damien - put in a 5770 please!
     
  10. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Doesn't Xenos have half-speed tessellation? Presumably the same is true for R600...RV790's tessellator?

    I wonder if the reason for that factor is due to multi-passing? If the tessellator can only amplify by X per iteration, then worst case amplification on those older GPUs is a factor of 8, e.g. as two passes of 4x.

    In D3D11 the amplification factor is a maximum of 32x. 1/3 of that isn't a very comforting number, though - so I'm unsure if the lack of agreement with what's seen in HD5870 is significant or not.

    EDIT: doh, three iterations: x, y and z :?:

    The other side of the coin, though, is that reasonable scenarios such as 834 v 618 for "tessellation + displacement mapping" which is 35%, or 978 v 878 "adaptive tessellation + displacement mapping" which is 11%, seem like what a developer would aim for.

    Jawed
     
  11. cal_guy

    Newcomer

    Joined:
    Jun 27, 2008
    Messages:
    217
    Likes Received:
    3
  12. GZ007

    Regular

    Joined:
    Jan 22, 2010
    Messages:
    416
    Likes Received:
    0
    Try out this SDK demo http://developer.download.nvidia.com/SDK/10.5/direct3d/samples.html#InstancedTessellation. Its a dx10 tesselation demo. With max 32 tesselation levels i get vsynced 60fps on my 4850(at any ressolution).
    Maybe the whole dx11 hs,tesselator,ds pipeline is quite overcomplicated if a software implentation can be this fast.
     
  13. no-X

    Veteran

    Joined:
    May 28, 2005
    Messages:
    2,455
    Likes Received:
    471
    Sorry, if this has been answered, but reviews are not completely consistent...

    GF100 has 64 texturing units. Each one consists of 1 addressing unit, 4 texture samplers, but how many filtering units? 1 or 4? Or are the 4 units capable of both sampling or/and filtering?

    Thanks!
     
  14. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,059
    Likes Received:
    3,119
    Location:
    New York
    Don't know what Anandtech is talking about. Each SM can calculate 4 addresses and fetch 16 point samples per clock as they now support Gather4. However they can still produce only 4 filtered samples per clock. It's no different to AMD's setup. So it should be 16 addressing and 16 filtering units per GPC for a 1:1 ratio.

    Each unit can fetch 4 samples (Gather4) or produce 1 filtered sample per clock.
     
  15. cal_guy

    Newcomer

    Joined:
    Jun 27, 2008
    Messages:
    217
    Likes Received:
    3
    Thanks for clearing that up.
     
  16. A.L.M.

    Newcomer

    Joined:
    Jun 2, 2008
    Messages:
    144
    Likes Received:
    0
    Location:
    Looking for a place to call home
    So Fermi is rougly three times faster than Cypress in triangle setup...
    This is something I was thinking about, looking at iXBT theoretical tests...

    http://translate.google.it/translat.../gf100-2-part2.shtml&sl=ru&tl=en&hl=&ie=UTF-8

    It seems like it's not the tessellator per se being much slower than Fermi's (Detail Tessellation), but that's when you combine an heavy charge on triangle setup and tessellation, then you end up with Fermi winning by far....
    By the way, looking at those tests, I don't think Fermi is much more future oriented than Cypress...
    It seems like Fermi is much better in Geometry Shaders and more recent pixel shaders, but it's worse than Cypress in SSAA scenes and with compute shaders.
     
  17. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,245
    Likes Received:
    4,465
    Location:
    Finland
    For what I've understood on my limited knowledge, the geometry part is indeed the "strong point" of Fermi-architecture - but is it really that limiting factor on other architectures, since by any definition the pure shader power, which is used for geometry shaders aswell, HD5 for example is a lot faster, and this also shows in most pixel shader tests for example?
     
  18. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    That makes the 1/3 factor for Cypress even more disappointing. Also, remember that each vertex generated by the tessellator creates two triangles, so Cypress is only generating 1 vertex every six clocks.

    It could be due to a data flow bottleneck. Damien mentions that reading multiple vertices per clock slows down non-Fermi GPUs. Not sure if he's talking about domain shaders or vertex shaders, though they're basically the same thing. Does Evergreen still use a separate vertex cache?

    I really doubt it. Remember that tessellation factors are floating point numbers, allowing smooth transitions. All vertices are defined by the same formula, so there's no need for iteration.

    The tessellator doesn't even calculate the positions of the vertices. All it does is create room for the vertex in the pipeline and give 16-bit (0..1) barycentric coordinates to the domain shader. I'd be shocked if ATI didn't put in the maybe 10 million transistors needed to do that math quickly. Given that the B3D article on Cypress said that a lot of shader time was spent in the domain shader, it could be data contention for the patch's control points, stalling the domain shader to the point of only allowing one control point to be read by only one thread (vertex) every two cycles. That would suck...

    This is pretty simple geometry with a low resolution displacement map (in terms of features wrt resolution), though. You may not be able to be so adaptive in the real world.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    That's not right, either.

    Naked vertices produced by the tessellator only have u,v.

    So, still no good idea why it's apparently 3x slowed-down.

    Jawed
     
  20. Mintmaster

    Veteran

    Joined:
    Mar 31, 2002
    Messages:
    3,897
    Likes Received:
    87
    It doesn't matter how fast the ALUs can crunch through the vertex/hull/domain shaders, because it can only assemble and set up one triangle per clock (which only need 0.5 vertices per clock to run at max speed if the mesh is good). To put that in perspective, Cypress can use 1/10th of its shading power on a 600 flop vertex shader and still saturate the triangle setup.

    What we're learning about tessellation, though, is that the bottle neck is even tighter than that for setup. If my theory is right, the ALUs are just stuck in the domain shader waiting for data.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...