NVIDIA Fermi: Architecture discussion

mczak · Mar 28, 2010

Mintmaster said:
Fermi is still benefiting in non-tessellated games where Evergreen is 1 tri/clock. Did you notice how several reviews mentioned that Fermi has a bigger advantage at lower resolutions (unless memory comes into play)? That's because the pixel load goes down but the vertex load stays the same, so proportionally the vertex load increases and becomes a bigger factor in the FPS you get.

Maybe, but I've seen no evidence this really is due to the distributed setup of fermi. Could as well be attributed to other inefficiencies in Evergreen, like the larger batch size for instance which I think should make more of a difference in low res if you have the same vertex count.

mczak · Mar 28, 2010

Lightman said:
Interesting numbers from Fermi analysis in Polish review by PCLab.pl

http://pclab.pl/art41241-6.html

Performance drops for GTX285, GTX480 and HD5870 when going from:
- 0AF to 16AF
- 1AA to 4AA
- 4AA to 8AA

in various titles

Seems to confirm the computerbase.de numbers. Generally, GTX285 has the smallest drop for enabling 16xAF, followed by HD5870, and GTX480 has the highest hit. That makes sense.
And Fermi as expected fixes the quite large hit for 8xAA the GTX285 had - overall 4x/8xAA hit seem quite comparable from GTX480 to HD5870 - though with some very large, interesting differences in some titles.

btw I still have no idea of ROP/L2 clock. Going by Tridam's numbers of z pixel rate (assuming 8x z sample rate per rop), which is ~160GSamples/s in best case (at 8xAA), that would give a ROP clock of roughly 450Mhz only, which seems awfully low (160/8/48 + some factor for factoring in inefficiences), to the point of making no sense. In fact it would mean in raw pixel throughput, it isn't limited by its 32 pixel rasterization rate (at 700Mhz) but rather by its 48 rops. Not sure this makes sense, but in any case rop throughput is indeed slightly lower (by Tridam's numbers) than GTX285 despite that the rasterization throughput should be slightly higher (700Mhz vs. 648Mhz).

edit: in fact ROP clock really must be that low (in the order of 450Mhz). Didn't think about pixel fill rate for other pixel formats, which would obviously not be changed by rasterization limits. That is a really low clock which makes little sense.

Arty · Mar 28, 2010

Tridam said:
I wouldn't say "insufficient" but Nvidia's advantage will decrease. HD 5450 scores 86 Mtri/s in the same test.

Besides, does this 5x directly translate into real games? Uniengine, the best case scenario shows a ~35% lead. Its far less in other games like Pripyat etc.

Mintmaster · Mar 28, 2010

Dave Baumann said:
Cedar doesn't have the same prim rates as the rest of the stack. Cypress, Juniper and Redwood are the same as each other.

So Dave, can you give us an idea as to why Cypress can only make one tessellated vertex every 6 clocks, or one tri every 3?

Jawed said:
Yeah, the limit for maximum tessellation is 2 triangles per vertex, but lower tessellation levels (sporadic use or adaptive) makes that lower.

I said think harder!

Three/four vertices are always trivial. For a tri-patch, it's (0,0), (1,0), and (0,1). Every vertex generated after those creates two triangles, regardless of tessellation factors. (I use 'vertex' loosely in this paragraph.)

I think TS has a vertex-centric view with the triangles coming out in the wash, because tessellation factors are per edge, not per triangle.

There's also a factor for the face. The edge factors are necessary for continuity between patches. When two patches have different factors, you need to have vertices on the edges match up or you get ugly seam problems. All four factors are used to tessellate.

Exactly. Generating 10 million visible triangles per frame whose average area <1 pixel requires vastly more hardware support in the rasterisation/fragment shading/back-end part of the GPU, so is therefore pointless.

It's not pointless by any means. Cypress has 20 SIMDs, yet in your scenario it's getting fed one quad every three clocks. Unless you have a 60 cycle shader (up to 60 fetches and 2400 flops), fragment shading ability is sitting idle. The RBEs can handle 24x the tessellator throughput. The rasterizer, according to Dave, can handle 6x the throughput. Even the dated setup engine can handle 3x the throughput.

And most of a GPU's work is due to an order of magnitude of amplification derived from those resulting triangles.

That's a silly argument. First of all, it's not a factor of 10, it's less than a factor of four. Second, you need a lot of work to pack samples together and reduce that amplification. Third, doing so doesn't always speed up processing, because texturing can't share LOD calcs between all pixels of a quad.

Finally, and most importantly, it's a lame excuse. If your quads only have a few samples to be written, that's no reason to have 80% of your SIMD's outputting zero quads/clk.

BTW, 10 million triangles does not mean <1 pixel area. 50% are frustum culled due to object-level CPU culling granularity. 40% of the rest are backface culled. Half of the rest are in the shadow map (or more, due to off screen but casting shadows). Over half of the rest are invisible due to overdraw. So now we're down to screen res divided by (10M * 50% * 60% * 50% * 40%). That is most certainly not <1 pix/tri avg area.

since we know that Heaven's absolute triangle counts are low (i.e < 2 million - though obviously some multiple of that for extreme mode).

And where do you get that information? B3D's review says that three states have 3.7 million triangles, and that accounts for only 71% of the frame time, so there's more.

Let's assume a little more than 4M triangles from tessellation. Without tessellation, GPU time limited by geometry is probably minimal, so those triangles probably add 12M cycles to the frame time, or 14ms. B3D's numbers show the following fps without/with tessellation at different AA settings: 64/38, 43/27, 34/23. Render time differences: 11ms, 14ms, 14ms.

Clearly, tessellation time alone is more than enough to account for the performance impact.

It's not triangle count that matters, it's area per triangle.

And what do you think happens to area per triangle in a real game with triangles off screen, backfacing, in shadow maps, with higher detail displacement maps needing more tessellation, etc? The DXSDK sample has one square mesh that fits entirely within the screen.

Though I remain cautious about the hardwareluxx graph until someone reproduces that HD5870 result

As will I. There's a review out there that shows the GTX480 having a lower AA impact than Cypress much of the time.

KimB · Mar 28, 2010

Arty said:
Besides, does this 5x directly translate into real games? Uniengine, the best case scenario shows a ~35% lead. Its far less in other games like Pripyat etc.

I think for the most part this will allow NV users to play games that support tessellation at higher settings. It isn't likely to produce a direct performance benefit, as games have to be made to play well with lower-end hardware that doesn't have anywhere near the geometry power.

Mintmaster · Mar 28, 2010

mczak said:
Maybe, but I've seen no evidence this really is due to the distributed setup of fermi. Could as well be attributed to other inefficiencies in Evergreen, like the larger batch size for instance which I think should make more of a difference in low res if you have the same vertex count.

What makes you think batch sizes in games are a bigger problem for Cypress than Fermi? Heck, what makes you think those problems even exist? Tridam's numbers show a huge advantage for Fermi in culling, and I know for sure that this will help game performance. Show me some batch size data for Fermi vs Cypress.

EDIT: Wait, are you talking about wavefront/warp size, i.e. 8x8 vs. 8x4? That's hardly an issue in games. They rarely have any shaders doing incoherent branching. It would also be a very weak function of resolution, giving the advantage to Fermi only for very a very limited range of incoherence. Moreover, we would have seen it with G80-GT200, but there GT200 shrinks its lead over RV770 at low resolutions due to its slower setup.

mczak · Mar 28, 2010

Mintmaster said:
What makes you think batch sizes in games are a bigger problem for Cypress than Fermi?

Because batch size is twice as big?

Heck, what makes you think those problems even exist?

I don't know if this really is a problem. Just saying there could be other reasons why GF100 performs better relatively in lower resolutions (it's really true for about half the benchmarks neither anyway, though I don't know if the games affected are indeed those with higher triangle count as you'd expect). If you think about it, in theory for instance better/larger caches could be a reason too (should have less texture cache hits relatively in lower resolutions).
Just saying there are a lots of reason why GF100 could be more efficient at lower resolutions, and no doubt the higher geometry throughput is one of them - I'm just not convinced this is really what we're seeing in those game benchmarks where this pattern can be observed.

edit: actually the part about texture caches shouldn't be true.

Mintmaster · Mar 28, 2010

I edited my post. Misunderstood what you meant by batch size.

Tridam · Mar 28, 2010

mczak said:
btw I still have no idea of ROP/L2 clock. Going by Tridam's numbers of z pixel rate (assuming 8x z sample rate per rop), which is ~160GSamples/s in best case (at 8xAA), that would give a ROP clock of roughly 450Mhz only, which seems awfully low (160/8/48 + some factor for factoring in inefficiences), to the point of making no sense. In fact it would mean in raw pixel throughput, it isn't limited by its 32 pixel rasterization rate (at 700Mhz) but rather by its 48 rops. Not sure this makes sense, but in any case rop throughput is indeed slightly lower (by Tridam's numbers) than GTX285 despite that the rasterization throughput should be slightly higher (700Mhz vs. 648Mhz).

At some point I thought the ROP/L2 clock was around 425/450 MHz or even synchronized with the memory clock at 412 MHz. However Nvidia insisted that there are 2 clock domains, GPC and ROP/L2 and that they run at the same 700 MHz frequency.

trinibwoy · Mar 28, 2010

Mintmaster said:
Not really, because the cards process pixels at half the speed of the higher end cards, so the time spent is the same.

Look at my example earlier in the thread:
So Juniper would spend 3ms on geometry and 15ms on pixels at the low resolution for the same 56fps that Cypress got at high res, and a half-Fermi would spend 2ms on geometry and 15 ms on pixels for 59fps. The 13% is now reduced to 6%.

Ah yes, that makes complete sense though it's based on the assumption that measured geometry performance will scale linearly.

Acert93 · Mar 28, 2010

Lightman said:
Interesting numbers from Fermi analysis in Polish review by PCLab.pl

http://pclab.pl/art41241-6.html

Performance drops for GTX285, GTX480 and HD5870 when going from:
- 0AF to 16AF
- 1AA to 4AA
- 4AA to 8AA

in various titles

I have only read a handful of reviews, but they seem to be quite different (some claiming 20% difference, some like the above very close in most titles). That said, considering the size and power draw of Fermi I am shocked how close ATI's 5870 is.

Has anyone tested overclocked 5870's to a similar power draw to compare relative performance per-WATT as well as mm^2?

CarstenS · Mar 28, 2010

Here's maybe some more basis for discussion:
http://www.pcgameshardware.com/aid,743526/Some-gory-guts-of-Geforce-GTX-470/480-explained/News/

Nvidia was kind enough to answer some questions we've had about Fermi and the Geforce GTX products.

For example they state how they arrive at the Fermi is 8x over GT200 in geometry thingie. Also it's seemingly low fillrates in some tests are explained.

trinibwoy · Mar 28, 2010

According to Nvidia, in the Fermi architecture, the transcendental ops and texture interpolation hardware are actually separated now (guess you didn't know that). There are now four transcendental units per 32 ALUs (that you knew), which is a 2:1 ratio change vs the previous generation. Nvidia said, they felt this was a reasonable balance given that the more decoupled nature of the Fermi design would allow them to use the units more efficiently. Now, if this separate from transcendental TEX interpolation means they're carrying it out in the normal shader-ALU like AMD does or if there's extra hardware involved, we cannot tell just yet.

That's an interesting nugget. So do the SFU's get cheaper as a result or are they the same but just too slow to keep up with the texturing rate? We'll probably never know.

mczak · Mar 28, 2010

trinibwoy said:
That's an interesting nugget. So do the SFU's get cheaper as a result or are they the same but just too slow to keep up with the texturing rate? We'll probably never know.

I guess the SFUs should get a bit cheaper. But probably more importantly, using normal ALUs for interpolation shouldn't make them really more complicated, there's nothing fancy about interpolation it's really just mul and add anyway (intel handles interpolation in main alus too and not in MathBox). In any case pretty sure there's no special hardware for interpolation (like the tidbit suggested might be there). Oh and the SFUs even if there are only half as many wouldn't have had any problem keeping up with texturing rate neither, as that is halved as well. Maybe the main reason why interpolation is handled by main alus could be because there are only half as many SFUs, you don't want to keep them busy with interpolation (which you can easily to with normal alus), instead freeing them up for the stuff you really need them.

I also found the paragraph about rasterization rate interesting. Suggests it's really not 8pix/clock per GPC, but 2pix/clock per SM. Makes sense though I guess with the PolyMorph engines.

CarstenS · Mar 28, 2010

mczak said:
I also found the paragraph about rasterization rate interesting. Suggests it's really not 8pix/clock per GPC, but 2pix/clock per SM. Makes sense though I guess with the PolyMorph engines.

I'm pretty sure they were talking about what comes out of the shader engine, not what can possibly get fed into it or generated at the top.

mczak · Mar 28, 2010

Tridam said:
At some point I thought the ROP/L2 clock was around 425/450 MHz or even synchronized with the memory clock at 412 MHz. However Nvidia insisted that there are 2 clock domains, GPC and ROP/L2 and that they run at the same 700 MHz frequency.

Hmm, but the fillrate numbers make no sense if you assume rops run at 700Mhz. Apparently (at least some of them without blend) can't be limited by memory bandwidth at all, and unless those tests do something very strange I can't see why a different format would make them half as fast if they were basically limited by that 32 (30 actually) pixel/clock rasterization limit.
Any chance you could run some overclocking tests to disprove nvidia

.

CarstenS · Mar 28, 2010

mczak said:
Hmm, but the fillrate numbers make no sense if you assume rops run at 700Mhz. Apparently (at least some of them without blend) can't be limited by memory bandwidth at all, and unless those tests do something very strange I can't see why a different format would make them half as fast if they were basically limited by that 32 (30 actually) pixel/clock rasterization limit.

Which specific fillrate numbers are you talking about? I'm a bit lost atm.

mczak · Mar 28, 2010

CarstenS said:
Which specific fillrate numbers are you talking about? I'm a bit lost atm.

Those here:
http://forum.beyond3d.com/showthread.php?t=56155&page=219
Fit perfectly with 48 rops running at somewhere below 450Mhz. No way to fit the data assuming rops run at 700Mhz...

Tridam · Mar 28, 2010

mczak said:
Hmm, but the fillrate numbers make no sense if you assume rops run at 700Mhz. Apparently (at least some of them without blend) can't be limited by memory bandwidth at all, and unless those tests do something very strange I can't see why a different format would make them half as fast if they were basically limited by that 32 (30 actually) pixel/clock rasterization limit.
Any chance you could run some overclocking tests to disprove nvidia .

I don't have the proper tool to overclock yet :/

AlexV · Mar 28, 2010

mczak said:
Those here
http://forum.beyond3d.com/showthread.php?t=56155&page=219
Fit perfectly with 48 rops running at somewhere below 450Mhz. No way to fit the data assuming rops run at 700Mhz...

It's late and I may be wrong, but I.m not following how those results aren't in line with the fact that their rasterizers output 32 pixels per clock total with color, and 8 times that with z-only...what am I missing? Assuming inherent inefficiencies, the numbers fit more or less fine AFAICT.

NVIDIA Fermi: Architecture discussion

mczak

mczak

Arty

KEPLER

Mintmaster

KimB

Mintmaster

mczak

Mintmaster

Tridam

trinibwoy

Meh

Acert93

Artist formerly known as Acert93

CarstenS

Moderator

trinibwoy

Meh

mczak

CarstenS

Moderator

mczak

CarstenS

Moderator

mczak

Tridam

AlexV

Heteroscedasticitate

Similar threads