NVIDIA Fermi: Architecture discussion

Are there any good overviews of the architecture and potential performance yet? I just recently read about the quasi 4 triangle/clock arrangement and the monster (?) tessellation performance. I haven't been following Fermi much due to the vaporware/high pitch fud so please excuse my disconnect--sounds like Fermi has some neat tricks up the sleeve. Maybe NV has something for SLI as well? (I must admit I am excited about their laptop dock with the Gateway, I hope that catches on!)
 
Figure it's a good time to start some architecture discussion again.

In the leaked Hexus benchmarks for Heaven 2.0, we see that changing the tesselation level imparts a performance hit. I believe that for the most part this is due to increased load on triangle setup (including clipping and culling) because additional hull/domain/vertex shader load should be minimal, and the usual clumping of triangles will prevent the GPU form hiding this bottleneck behind pixel procesing. So let's do a little analysis:

No tesselation, normal tesselation:
HD5870: 40.5 fps, 26.3 fps ==> 13.3 ms extra processing time
GTX480: 45.9 fps, 36.9 fps ==> 5.3 ms

Fermi crunches through this additional load 2.5 times faster.

Normal tesselation, extreme tesselation:
HD5870: 26.3 fps, 17.0 fps ==> 20.8 ms
GTX480: 36.9 fps, 29.5 fps ==> 6.8 ms

Fermi crunches through this larger additional load 3.1 times faster.

We know Cypress can do one triangle per clock, and this is what NVidia has said about Fermi:
http://www.bjorn3d.com/read.php?cID=1778&pageID=8321
Fermi can theoretically produce 4 but in reality it can parallel produce 2.5 - 2.7 triangles per clock cycle.
http://www.techreport.com/articles.x/18332/2
Nvidia tells us that in directed tests, GF100 has averaged as many as 3.2 triangles per clock, which is still quite formidable.

Not quite the expected result, given that Cypress is clocked faster, but Cypress is probably a little below 1 tri/clk on average, so close enough :smile:
 
In the leaked Hexus benchmarks for Heaven 2.0, we see that changing the tesselation level imparts a performance hit.
FYI - The final release of the Heaven benchmark doe not have a "No Tesselation" mode, but a "Moderate" mode, so if that is accurate I don't know if they were using an RC release of the bench. While I don't know what the performance differences on Fermi are we do see differences in performance between the RC and the final release, to the tune of about 10% performance for Cypress.
 
I wouldn't obsess to much about setup rates, it is important but with so many very small triangles (turn on the wireframe mode and see :) ) I wouldn't be surprised if tessellation in that test kills pixel shaders perfomance which in turn could be a new bottleneck. And who knows..Fermi could be doing something clever about it.
 
I wouldn't obsess to much about setup rates, it is important but with so many very small triangles (turn on the wireframe mode and see :) ) I wouldn't be surprised if tessellation in that test kills pixel shaders perfomance which in turn could be a new bottleneck. And who knows..Fermi could be doing something clever about it.
Well, remember that every clock these cards can do 2000 to 3200 flops and output 8 quads. You need a long pixel shader to be unable to take advantage of multiple tris per clock on 1-2 quad triangles.

It may be inefficient to throw away half the samples in a quad on tiny triangles, but it's even more wasteful to have the majority of your shader engine idling due to lack of quads to work on, dreaming of 50% efficiency :smile:

If you're right about the pixel shader load increasing with tesselation (and I suspect you are), then we need to subtract a few milliseconds from those numbers. It's probably roughly equal for both cards, because they have similar performance without tesselation and thus similar pixel crunching ability, but it would wind up making the ratio bigger.
 
Doesn't Xenos have half-speed tessellation? Presumably the same is true for R600...RV790's tessellator?

I wonder if the reason for that factor is due to multi-passing? If the tessellator can only amplify by X per iteration, then worst case amplification on those older GPUs is a factor of 8, e.g. as two passes of 4x.

In D3D11 the amplification factor is a maximum of 32x. 1/3 of that isn't a very comforting number, though - so I'm unsure if the lack of agreement with what's seen in HD5870 is significant or not.

EDIT: doh, three iterations: x, y and z :?:

The other side of the coin, though, is that reasonable scenarios such as 834 v 618 for "tessellation + displacement mapping" which is 35%, or 978 v 878 "adaptive tessellation + displacement mapping" which is 11%, seem like what a developer would aim for.

Jawed
 
Doesn't Xenos have half-speed tessellation? Presumably the same is true for R600...RV790's tessellator?

I wonder if the reason for that factor is due to multi-passing? If the tessellator can only amplify by X per iteration, then worst case amplification on those older GPUs is a factor of 8, e.g. as two passes of 4x.

In D3D11 the amplification factor is a maximum of 32x. 1/3 of that isn't a very comforting number, though - so I'm unsure if the lack of agreement with what's seen in HD5870 is significant or not.

EDIT: doh, three iterations: x, y and z :?:

The other side of the coin, though, is that reasonable scenarios such as 834 v 618 for "tessellation + displacement mapping" which is 35%, or 978 v 878 "adaptive tessellation + displacement mapping" which is 11%, seem like what a developer would aim for.

Jawed

Try out this SDK demo http://developer.download.nvidia.com/SDK/10.5/direct3d/samples.html#InstancedTessellation. Its a dx10 tesselation demo. With max 32 tesselation levels i get vsynced 60fps on my 4850(at any ressolution).
Maybe the whole dx11 hs,tesselator,ds pipeline is quite overcomplicated if a software implentation can be this fast.
 
Sorry, if this has been answered, but reviews are not completely consistent...

GF100 has 64 texturing units. Each one consists of 1 addressing unit, 4 texture samplers, but how many filtering units? 1 or 4? Or are the 4 units capable of both sampling or/and filtering?

Thanks!
 
What's the rationale for the 4:1 texture filtering unit to texture addressing unit ratio?

http://anandtech.com/video/showdoc.aspx?i=3783&p=3

Don't know what Anandtech is talking about. Each SM can calculate 4 addresses and fetch 16 point samples per clock as they now support Gather4. However they can still produce only 4 filtered samples per clock. It's no different to AMD's setup. So it should be 16 addressing and 16 filtering units per GPC for a 1:1 ratio.

Sorry, if this has been answered, but reviews are not completely consistent...

GF100 has 64 texturing units. Each one consists of 1 addressing unit, 4 texture samplers, but how many filtering units? 1 or 4? Or are the 4 units capable of both sampling or/and filtering?

Thanks!

Each unit can fetch 4 samples (Gather4) or produce 1 filtered sample per clock.
 
Don't know what Anandtech is talking about. Each SM can calculate 4 addresses and fetch 16 point samples per clock as they now support Gather4. However they can still produce only 4 filtered samples per clock. It's no different to AMD's setup. So it should be 16 addressing and 16 filtering units per GPC for a 1:1 ratio.



Each unit can fetch 4 samples (Gather4) or produce 1 filtered sample per clock.

Thanks for clearing that up.
 
Figure it's a good time to start some architecture discussion again.

In the leaked Hexus benchmarks for Heaven 2.0, we see that changing the tesselation level imparts a performance hit. I believe that for the most part this is due to increased load on triangle setup (including clipping and culling) because additional hull/domain/vertex shader load should be minimal, and the usual clumping of triangles will prevent the GPU form hiding this bottleneck behind pixel procesing. So let's do a little analysis:

No tesselation, normal tesselation:
HD5870: 40.5 fps, 26.3 fps ==> 13.3 ms extra processing time
GTX480: 45.9 fps, 36.9 fps ==> 5.3 ms

Fermi crunches through this additional load 2.5 times faster.

Normal tesselation, extreme tesselation:
HD5870: 26.3 fps, 17.0 fps ==> 20.8 ms
GTX480: 36.9 fps, 29.5 fps ==> 6.8 ms

Fermi crunches through this larger additional load 3.1 times faster.

We know Cypress can do one triangle per clock, and this is what NVidia has said about Fermi:
http://www.bjorn3d.com/read.php?cID=1778&pageID=8321
http://www.techreport.com/articles.x/18332/2

Not quite the expected result, given that Cypress is clocked faster, but Cypress is probably a little below 1 tri/clk on average, so close enough :smile:

So Fermi is rougly three times faster than Cypress in triangle setup...
This is something I was thinking about, looking at iXBT theoretical tests...

http://translate.google.it/translat.../gf100-2-part2.shtml&sl=ru&tl=en&hl=&ie=UTF-8

It seems like it's not the tessellator per se being much slower than Fermi's (Detail Tessellation), but that's when you combine an heavy charge on triangle setup and tessellation, then you end up with Fermi winning by far....
By the way, looking at those tests, I don't think Fermi is much more future oriented than Cypress...
It seems like Fermi is much better in Geometry Shaders and more recent pixel shaders, but it's worse than Cypress in SSAA scenes and with compute shaders.
 
For what I've understood on my limited knowledge, the geometry part is indeed the "strong point" of Fermi-architecture - but is it really that limiting factor on other architectures, since by any definition the pure shader power, which is used for geometry shaders aswell, HD5 for example is a lot faster, and this also shows in most pixel shader tests for example?
 
Doesn't Xenos have half-speed tessellation? Presumably the same is true for R600...RV790's tessellator?
That makes the 1/3 factor for Cypress even more disappointing. Also, remember that each vertex generated by the tessellator creates two triangles, so Cypress is only generating 1 vertex every six clocks.

It could be due to a data flow bottleneck. Damien mentions that reading multiple vertices per clock slows down non-Fermi GPUs. Not sure if he's talking about domain shaders or vertex shaders, though they're basically the same thing. Does Evergreen still use a separate vertex cache?

I wonder if the reason for that factor is due to multi-passing? If the tessellator can only amplify by X per iteration, then worst case amplification on those older GPUs is a factor of 8, e.g. as two passes of 4x.
I really doubt it. Remember that tessellation factors are floating point numbers, allowing smooth transitions. All vertices are defined by the same formula, so there's no need for iteration.

The tessellator doesn't even calculate the positions of the vertices. All it does is create room for the vertex in the pipeline and give 16-bit (0..1) barycentric coordinates to the domain shader. I'd be shocked if ATI didn't put in the maybe 10 million transistors needed to do that math quickly. Given that the B3D article on Cypress said that a lot of shader time was spent in the domain shader, it could be data contention for the patch's control points, stalling the domain shader to the point of only allowing one control point to be read by only one thread (vertex) every two cycles. That would suck...

The other side of the coin, though, is that reasonable scenarios such as 834 v 618 for "tessellation + displacement mapping" which is 35%, or 978 v 878 "adaptive tessellation + displacement mapping" which is 11%, seem like what a developer would aim for.
This is pretty simple geometry with a low resolution displacement map (in terms of features wrt resolution), though. You may not be able to be so adaptive in the real world.
 
For what I've understood on my limited knowledge, the geometry part is indeed the "strong point" of Fermi-architecture - but is it really that limiting factor on other architectures, since by any definition the pure shader power, which is used for geometry shaders aswell, HD5 for example is a lot faster, and this also shows in most pixel shader tests for example?
It doesn't matter how fast the ALUs can crunch through the vertex/hull/domain shaders, because it can only assemble and set up one triangle per clock (which only need 0.5 vertices per clock to run at max speed if the mesh is good). To put that in perspective, Cypress can use 1/10th of its shading power on a 600 flop vertex shader and still saturate the triangle setup.

What we're learning about tessellation, though, is that the bottle neck is even tighter than that for setup. If my theory is right, the ALUs are just stuck in the domain shader waiting for data.
 
Back
Top