NVIDIA Fermi: Architecture discussion

It's late and I may be wrong, but I.m not following how those results aren't in line with the fact that their rasterizers output 32 pixels per clock total with color, and 8 times that with z-only...what am I missing? Assuming inherent inefficiencies, the numbers fit more or less fine AFAICT.
If it would be limited by the 32 pixels/clock why would the results of other pixel formats be exactly half? Unless somehow not enough data could be passed from the SMs to the ROPs this makes no sense.
(edit: in fact even then it makes no sense as the data passed is the same anyway)
 
Last edited by a moderator:
I said think harder! :devilish:

Three/four vertices are always trivial. For a tri-patch, it's (0,0), (1,0), and (0,1). Every vertex generated after those creates two triangles, regardless of tessellation factors. (I use 'vertex' loosely in this paragraph.)
In general with lower tessellation factors and anisotropy, you'll get notably less than 2 triangles per extra vertex. Go count some tessellated patches' triangles and vertices if you don't believe me. I counted one tessellated patch with 146 triangles and 87 vertices, earlier ;)

There's also a factor for the face. The edge factors are necessary for continuity between patches. When two patches have different factors, you need to have vertices on the edges match up or you get ugly seam problems. All four factors are used to tessellate.
I don't remember hearing about face factor and I was under the impression developers have to manage adjacent-edge tessellation factors (and orientation) very carefully in order to ensure there are no Ts.

I don't see how a face factor can work when a patch's edges, each with a different factor, each abutt another patch with its own face factor.

If you compare these two shots:

http://unigine.com/devlog/090928-dragon_no_tesselation_wire1.jpg
http://unigine.com/devlog/090928-dragon_tesselation_wire1.jpg

it appears they are tessellating all triangles' edges to the same factor regardless of size. Perhaps that's their work-around for Ts.

It's not pointless by any means. Cypress has 20 SIMDs, yet in your scenario it's getting fed one quad every three clocks.
AMD says that the architecture's comfort zone bottoms out at 8 fragments per clock, in effect. It's not "my scenario", it's how the hardware works.

Unless you have a 60 cycle shader (up to 60 fetches and 2400 flops), fragment shading ability is sitting idle. The RBEs can handle 24x the tessellator throughput. The rasterizer, according to Dave, can handle 6x the throughput.
The rasteriser is the bottleneck on hardware thread generation, I presume: a new hardware thread can be started once every 4 cycles per group of 10 SIMDs - and only one SIMD can start a hardware thread in any 4 cycle window, when setup is exporting 1-pixel triangles.

TS, in this scenario, is producing triangles every 3 cycles. So the SIMDs can't go any faster. They're starved by the huge granularity of rasterisation and thread generation, not by lack of triangles.

The architecture is designed for big triangles, spanning >64 fragments, with a single triangle coming out of setup and, in the best-case, being sent to both rasterisers and fully occupying them both, resulting in the generation of a total of 128 fragments during 4 cycles.

That's a silly argument. First of all, it's not a factor of 10, it's less than a factor of four. Second, you need a lot of work to pack samples together and reduce that amplification. Third, doing so doesn't always speed up processing, because texturing can't share LOD calcs between all pixels of a quad.
You're missing the point, this architecture is designed for big triangles.

Finally, and most importantly, it's a lame excuse. If your quads only have a few samples to be written, that's no reason to have 80% of your SIMD's outputting zero quads/clk.
Don't shoot the messenger.

BTW, 10 million triangles does not mean <1 pixel area. 50% are frustum culled due to object-level CPU culling granularity. 40% of the rest are backface culled. Half of the rest are in the shadow map (or more, due to off screen but casting shadows). Over half of the rest are invisible due to overdraw. So now we're down to screen res divided by (10M * 50% * 60% * 50% * 40%). That is most certainly not <1 pix/tri avg area.
I'm talking about triangles entering rasterisation.

And where do you get that information? B3D's review says that three states have 3.7 million triangles, and that accounts for only 71% of the frame time, so there's more.
It says that the maximum triangles in any draw call is 1.6 million, coming out of TS.

A lot of the remaining frame time will be taken with post-processing.

Let's assume a little more than 4M triangles from tessellation. Without tessellation, GPU time limited by geometry is probably minimal, so those triangles probably add 12M cycles to the frame time, or 14ms. B3D's numbers show the following fps without/with tessellation at different AA settings: 64/38, 43/27, 34/23. Render time differences: 11ms, 14ms, 14ms.
All academic for an architecture that likes big triangles, I'm afraid. This is completely the wrong workload. It's like giving NV40 nested, divergent, control flow in a pixel shader. It'll just puke it back in your face.

Jawed
 
If it would be limited by the 32 pixels/clock why would the results of other pixel formats be exactly half? Unless somehow not enough data could be passed from the SMs to the ROPs this makes no sense.
(edit: in fact even then it makes no sense as the data passed is the same anyway)

Why wouldn't they be? I think their ROPs do 1 FP16 pixel per 2 clocks(not 100% sure, foggy recollection alert - tomorrow morning I'll make sure), so that'd be 16 FP16 pixels per clock in total - pretty much in line with what monsieur Triolet is getting, again taking into account inherent inefficiencies. Cypress is full-rate there, but doesn't have the BW to do 32 FP16 pixel writes per clock. So the only outlier remaining would be 11:11:10, which seems to be half-rate too(about that one I don't remember any mentions). Am I still missing something?:???:
 
One thing I liked:
The longer your register file, the more the girls will like you! - True? Compared to GT200, the registers available per individual core are lower in GF100/Fermi-architecture. But don't graphics and cuda programs tend to get longer, consuming more register space we wonder.

Nvidia says, they've been looking across a variety of workloads including long running programs, and are pretty satisfied with the ratio of floating-point units per register space (FP:RF) Fermi. In general, they said, they'd find that the scalar architecture to be very helpful in minimizing RF requirements, and the addition of L1 cache in Fermi improves spill performance. That's when the register file is full and you need to store the data somewhere - that's when the new L1-cache comes in. This is giving the architecture effectively a capacity amplifier.
This is something I have mentioned several times in architecture threads, and I was going to make my own thread about it last year.

It is a common misconception that registers per SM is metric necessary for hiding latency. What you want to look at is registers per texture unit, because that's the latency you want to hide. If you double the ALUs but keep the TUs the same (or in this case reduce them), then you do not need to double the total register count to have the same latency hiding ability. I wrote a program to simulate the way SIMD engines process wavefronts and it confirms my conviction on the matter.

Latency hiding = # threads / tex throughput

(More specifically, the last term is average texture clause throughput. I know NVidia doesn't use clauses, but you can still group texture accesses together by dependency to create quasi-clauses and get a slightly understimated value of latency hiding)
 
One thing I liked:

This is something I have mentioned several times in architecture threads, and I was going to make my own thread about it last year.

It is a common misconception that registers per SM is metric necessary for hiding latency. What you want to look at is registers per texture unit, because that's the latency you want to hide. If you double the ALUs but keep the TUs the same (or in this case reduce them), then you do not need to double the total register count to have the same latency hiding ability. I wrote a program to simulate the way SIMD engines process wavefronts and it confirms my conviction on the matter.

Latency hiding = # threads / tex throughput

(More specifically, the last term is average texture clause throughput. I know NVidia doesn't use clauses, but you can still group texture accesses together by dependency to create quasi-clauses and get a slightly understimated value of latency hiding)
I kind of agree with what you say here but going forward the denominator of your formula should be replaced by a term that takes in consideration texture throughput and GS from/to UAV throughput (and perhaps global atomics too).
 
Why wouldn't they be? I think their ROPs do 1 FP16 pixel per 2 clocks(not 100% sure, foggy recollection alert - tomorrow morning I'll make sure), so that'd be 16 FP16 pixels per clock in total - pretty much in line with what monsieur Triolet is getting, again taking into account inherent inefficiencies. Cypress is full-rate there, but doesn't have the BW to do 32 FP16 pixel writes per clock. So the only outlier remaining would be 11:11:10, which seems to be half-rate too(about that one I don't remember any mentions). Am I still missing something?:???:
Yes, the obvious. Pixel throughput may be limited to 32pix/clock due to rasterization limits, but there are 48 rops...
 
I kind of agree with what you say here but going forward the denominator of your formula should be replaced by a term that takes in consideration texture throughput and GS from/to UAV throughput (and perhaps global atomics too).
Well if you want to be picky :smile:

Yeah, tex throughput really means average grouped memory access throughput. So for any program, count the number of groups of independent loads of all types, and divide by the number of cycles it takes to complete the program at the theoretical throughput (noting that math, branches, loads and tex fetches can often be done in parallel). This groups/clk is the number we need in the denominator.

Of course, if you only have a few wavefronts, then variations within a program won't average out very well. I had a very simple scheduler in my simulation, so sometimes I got a sort of 'shader aliasing' where all the wavefronts were doing tex fetches at the same time or math at the same time, and only when I randomized the fetch latency did things even out.
 
Yes, the obvious. Pixel throughput may be limited to 32pix/clock due to rasterization limits, but there are 48 rops...

But if the ROPs only get fed 32, or rather 30 pixels and take two clocks for FP16 or four for multichannel FP32, then the results are perfectly in line with what to expect; I don't think, you can split those FP16-stuff between different Octo-ROP units.
 
Remember the cookie monster? :D

I've managed to find an uber-high resolution shot of GF100's die, and look in more detail of this part:



Does it look like array of pad structures -- an interface for something? And there is a clock generator, marked with the little red rectangle.
 
But if the ROPs only get fed 32, or rather 30 pixels and take two clocks for FP16 or four for multichannel FP32, then the results are perfectly in line with what to expect; I don't think, you can split those FP16-stuff between different Octo-ROP units.
No they can't get split. But obviously those 32 pixels get spread to all 6 octo-rops (well on average), and the only way you would get only half rate is if somehow those pixels would be fed with zero buffering. Don't think that makes sense though admittedly I'm not quite sure what the datapath from SMs (or GPCs?) to ROPs look like.
 
I imagine there's a crossbar from the shader engine feeding the octo-ROPs through L2 in a round robin fashion/first free. In normal mode, i.e. single-cycle ROP operations, you'd get what you except given the 32/30/28 ppc from the shaders and have some ROP idling inbetween. With multi-cycle operations you can make use of the abundance of ROPs, as for example in case of 8xAA or the more intense data formats.

The crossbar seems to be needed anyway so that you can make the L2-caches talk to each other and appear as a unified cache.

But obviously, Cypress is able to operate much closer to its theoretical peaks in these tests.
 
Hmm, but the fillrate numbers make no sense if you assume rops run at 700Mhz. Apparently (at least some of them without blend) can't be limited by memory bandwidth at all, and unless those tests do something very strange I can't see why a different format would make them half as fast if they were basically limited by that 32 (30 actually) pixel/clock rasterization limit.
Any chance you could run some overclocking tests to disprove nvidia :).

I checked with overclocking. Increasing the memory frequency doesn't increase fillrate on any format. However increasing the shader frequency from 1400 to 1540 MHz directly gives 10% more fillrate on every format. There could still be a fixed multiplier applying a lower clock to the ROPs/L2, but it's linked to the shader clock.
 
This thread is strictly for discussing the gt300's architecture (not market conditions, how loud fans are, etc). This discussion should be on the technical side (and thus may be above some people's pay grade; you should ask yourself "do I belong in this thread" before posting).
 
I checked with overclocking. Increasing the memory frequency doesn't increase fillrate on any format. However increasing the shader frequency from 1400 to 1540 MHz directly gives 10% more fillrate on every format. There could still be a fixed multiplier applying a lower clock to the ROPs/L2, but it's linked to the shader clock.
Oh, interesting. So at least it's possible ROP clock is the same...
I am puzzled though in this case about the ROP throughput (FP11 format for instance). Any theories?
 
Back
Top