NVIDIA Fermi: Architecture discussion

Jawed · Mar 31, 2010

Mintmaster said:
This is something I have mentioned several times in architecture threads, and I was going to make my own thread about it last year.

It is a common misconception that registers per SM is metric necessary for hiding latency. What you want to look at is registers per texture unit, because that's the latency you want to hide. If you double the ALUs but keep the TUs the same (or in this case reduce them), then you do not need to double the total register count to have the same latency hiding ability. I wrote a program to simulate the way SIMD engines process wavefronts and it confirms my conviction on the matter.

Latency hiding = # threads / tex throughput

(More specifically, the last term is average texture clause throughput. I know NVidia doesn't use clauses, but you can still group texture accesses together by dependency to create quasi-clauses and get a slightly understimated value of latency hiding)

My argument with this "idealised" stance is that the actual sequence of operations undertaken by the processor results in varying performance for the same ALU:TEX.

http://developer.amd.com/gpu_assets/PLDI08Tutorial.pdf

You can clearly see in versions 3, 4 and 5 that, despite identical ALU:TEX, performance varies substantially.

5 is faster than 3 despite the fact that 5 has less threads in flight per SIMD than 3. The estimated threads for version 3 is 256/28 = 9, while for version 5 it is 256/38 = 6. (Both estimates are subject to clause-temporary overhead. Also I suspect that 256 is not the correct baseline, something like 240 might be better, not sure...)

Evergreen GPUs support 16-long TEX clauses as opposed to the 8-long clauses seen in R600-RV790. There are two reasons to do this:

all clause switches increase the latency experienced by an individual hardware thread as switching has latency, so packing TEX instructions into a lower count of clauses reduces the total latency experienced
TEX clauses are sensitive to cache behaviour, so a doubling in TEX clause length can increase coherency

So, in summary, the idealised stance is merely a starting point.

Going back in time, your argument is that if AMD doubles ALU:TEX, e.g. 8:1 in the next GPU, but leaving the overall ALU/TEX architecture alone, that each ALU would only need half the register file. The 256KB of aggregate register file per SIMD we see in Evergreen would be enough for the next GPU. Well, clearly this is fallacious as version 5 above would be reduced to a mere 3 hardware threads, killing throughput (3 hardware threads means that both ALU and TEX clauses cannot be 100% occupied by hardware threads, since both require pairs of hardware threads for full utilisation).

Separately, I've long maintained that careful management of register spill can be used to amplify the effective size of the register file (hardly news: CPUs are continuously doing this). NVidia's older architectures and ATI's current one take a rather naive and useless all-or-nothing approach to register spill, i.e. there's zero optimisation for the register-spill case, so once it's induced the wheels fall off. These GPUs are set up for fencing, and don't like it when you come armed with a baseball bat.

AMD will have to catch-up and implement register spill properly. One of a long list of catch-ups, in comparison with Fermi.

Jawed

CarstenS · Mar 31, 2010

trinibwoy said:
Is that 30 number from the interview? How do you lose 2 pixels due to the disabled SM? Rasterization is supposed to be a GPC level function. I don't get it.

You can rasterize 32 pixels all right, but then - what are you going to do with two of them when the next part of the pipe only fits 30/28 at a time?

mczak said:
That's what I was trying to tell you in about a dozen posts

Now I know what you meant

'twas my silly mistake…

Dave Baumann · Mar 31, 2010

Jawed said:
One of a long list of catch-ups, in comparison with Fermi.

Really? To what end?

mczak · Mar 31, 2010

mboeller said:
Wacky idea: maybe not 450MHz but 462MHz. This would be 924MHz/2 ??

Coupling this to memory clock could make sense IMHO, but it doesn't fit the overclocking data Tridam did. Apparently none of the fillrate numbers budge a bit (which is imho quite remarkable anyway, since it suggests despite the low memory clock it's not really bandwidth limited as the ROPs are the major bandwidth consumers), but instead scale linearly with core clock.

trinibwoy · Mar 31, 2010

CarstenS said:
You can rasterize 32 pixels all right, but then - what are you going to do with two of them when the next part of the pipe only fits 30/28 at a time?

What is the next part of the pipe that you're referring to that can only accomodate 30 pixels?

Jawed · Mar 31, 2010

Dave Baumann said:
Really? To what end?

Performance on workloads beyond the low-hanging fruit that's been much of the focus so far? Dunno, does AMD want to compete on OpenCL performance?

The strange thing about register spill is that it's not a miilion miles away from the way the ATI compiler currently has to allocate clause-temporary registers. It can only allocate clause-temporaries after determining lifetime, something that's required when evaluating register spill.

Seems to me any argument against register spill is like arguing against local memory/L1 back before G80 came out. The signs were clear in papers from back then that GPGPU needed to go in this direction. No matter how clean the Brook stream model is, it's too restrictive in the real world. Fitting the entire context into registers is a "too-clean" model, it fails with increased kernel complexity.

If you want to argue that the programmer should explicitly manage spillage simply by doing their own reads/writes to global memory, well I think that's a step too far when competing architectures scale smoothly with growth of work-item context. A few KB of context per work-item shouldn't be treated like a pure global memory resource, just because it's 512 bytes too much to fit into registers for a given latency-hiding constraint.

Jawed

no-X · Mar 31, 2010

mczak said:
it reaches twice the throughput of the HD5870 with no AA, and still 1.5 times more with 8xAA, with only a little more memory bandwidth than both GTX285 and HD5870

GTX480 has 1.5-times more ROPs. 1.5 better results are quite expectable to me...

CarstenS · Mar 31, 2010

trinibwoy said:
What is the next part of the pipe that you're referring to that can only accomodate 30 pixels?

The shader engine, actually. It can output in current configurations a maximum of 30 ppc to the ROPs.

mczak · Mar 31, 2010

no-X said:
GTX480 has 1.5-times more ROPs. 1.5 better results are quite expectable to me...

Except that z sample rate is often bandwidth limited, especially without AA (with 24bit z and 160GB/s you could only do ~53GSamples/s if there's no compression).
Though by comparing HD4890 to HD5870, I'm actually not sure that theory is true any longer. Cause the z sample results are exactly twice no matter the AA setting... So it looks like there's something else preventing HD5870 from reaching its peak z sample rate except with 8xAA... Maybe it just can't push that many pixels...
FWIW GT200b seems to only reach half its potential, since afaik it should also be capable of 8xZ per clock just like g80/g92/gf100.
I agree though if it's not memory bandwidth limited due to good compression, the results make sense, and whatever holds AMD chips back at non-AA is not a problem for nvidia chips.

trinibwoy · Mar 31, 2010

CarstenS said:
The shader engine, actually. It can output in current configurations a maximum of 30 ppc to the ROPs.

Well now we're back right back where we started. My original question is why is it assumed that each SM can only output 2 pixels per clock? That makes no sense to me since unified architectures since G80 are all about running different workloads in parallel. So you're telling me that if only half the chip is running pixel shaders, fillrate falls to 16 ppc?

CarstenS · Mar 31, 2010

You lost me. AFAIK a each GPC can rasterize 8 ppc and send the work off to the shader-engine. That makes for 32 ppc for both GTX 480 and 470 rasterized. What follows is pixelshading. The shader engine as a whole can process 32 ppc, but in current configs, only a maximum of 15 out of 16 SMs is active, so it's only 15/16th of 32 pixels (i.e. 30) that can be sent off at a time.

If they're in a format, the ROPs can process single-cycle: voila - 28-30 ppc fillrate.
If the ROPs take two cycles, fillrate is halved.

trinibwoy · Mar 31, 2010

CarstenS said:
The shader engine as a whole can process 32 ppc

Well you keep saying that but as with my first question I'm asking where it's coming from. Who said all SM's combined can only output 32 ppc?

cho · Mar 31, 2010

trinibwoy said:
Well you keep saying that but as with my first question I'm asking where it's coming from. Who said all SM's combined can only output 32 ppc?

1 warp = 8 pixel quad

the fermi can do 2 warps per SM per 2 shader cycle

so on fermi, it should be 4 pixels per sm per shader cycle ?

no-X · Mar 31, 2010

trinibwoy said:
Well you keep saying that but as with my first question I'm asking where it's coming from. Who said all SM's combined can only output 32 ppc?

Earlier in our conversations with Nvidia, they said a full-blown Fermi chip could have a throughput of 32 pixels per clock in the shader-engine and 256 z-samples if the data is compressible. But how does this change with actual products like GTX 480 and GTX 470? According to Nvidia this throughput can change at either the GPC or SM level. A 15 SM configuration like GTX 480 would be limited to 30 pixels per clock due to the SM count, for example.

http://www.pcgameshardware.com/aid,743526/Some-gory-guts-of-Geforce-GTX-470/480-explained/News/

trinibwoy · Mar 31, 2010

cho said:
1 warp = 8 pixel quad

the fermi can do 2 warps per SM per 2 shader cycle

so on fermi, it should be 4 pixels per sm per shader cycle ?

Well it can retire two instructions for two half-warps every hot clock. So that's two full warps per scheduler clock. So potentially 64 pixels per SM. Hence my amazement that it could be only 2.

no-X said:
http://www.pcgameshardware.com/aid,743526/Some-gory-guts-of-Geforce-GTX-470/480-explained/News/

Yep, that's why I originally asked:

trinibwoy said:
Is that 30 number from the interview?

Was hoping there was something more detailed out there.

Mintmaster · Mar 31, 2010

Jawed said:
You're assuming that the hardware can put multiple triangles into a hardware thread or even that multiple triangles can occupy a fragment-quad.

I'm not assuming the latter at all (hence my use of quads/clk throughout my post), and the former is a solid assumption. Remember that Xenos is able to do this, and it's more closely related to R600 onwards than R520. What's so hard about it? All Cypress has to do is store vertex indices for each quad so that the interpolation routine can fetch the correct data. With RV770 and earlier, the shading engines had no idea which triangle the quads came from because interpolation was done earlier and stuffed into the registers.

Why, in this era, would ATI support multiple triangles per hardware thread, when hardware threads are small and when there's so few small triangles, ever?

Because they already did the minimal work required in Xenos and probably earlier.

You're wrong about "so few small triangles, ever". I'm having trouble finding public information about triangle size distributions, but here's an old one with GLQuake II information:
http://citeseerx.ist.psu.edu/viewdo...CAA817BB2?doi=10.1.1.56.8726&rep=rep1&type=ps
They do glide interception, but don't mention the rendering resolution (let's assume 640x480). They find 41% of the visible triangles are 1-25 pixels in size. How is this possible if 640*480/1300 = 307? Well, that's the area-weighted average triangle size and says nothing about the distribution. Nowadays we have 10x the pixels and well over 100x the triangles, so that likely means even more triangles are tiny. Still think triangles are big?

It's just plain stupid not to have multiple triangles per hardware thread.

Look, it's very easy to test this. Draw a fullscreen mesh with a 10,000 cycle pixel shader and no AA, and see what happens as triangle edge length decreases from 100 pixels down to 0.1. If you're right, it will bottom out at 1/64th of large triangle performance. If I'm right it will bottom out at 1/4th.

Since I believe that single-pixel triangle tessellation is a strawman, I wonder why I'm here, frankly.

Do you have split personality? You are the one that proposed >10M visible triangles per frame, not me. You are the one that brought up one pixel triangles, not me.

If you have a 40 cycle shader (up to 400 flops, 10 bilinear fetches), ATI's tesselator will avoid idling the SIMDs only if the rasterizer generates a buffered average of six quads per triangle. Note that I'm counting wavefronts with as little as one visible sample in each of its 16 quads as keeping an SIMD busy. The tessellator is a bottleneck for triangles with an area of 5, 10, even 15 pixels. One-pixel triangles are completely irrelevent. Do you get it yet?

As I explained earlier, a shadowmapped game with 4M triangles per frame will work out to around 300k visible triangles in the final rendering view. That's an average of 5-10 pixels per triangle, depending on resolution, and thus is not comparable to your strawman case of single pixel polygons. Heaven shows us that 4M triangles is enough for tessellation performance to be a major factor.

Just because IHVs have under 50% quad-filling efficiency for <10 pixel triangles doesn't mean that they'll throw in the towel and let wavefront efficiency drop to <10%.

The B3D graph shows that the longest draw call is ~28% of frame time. I don't know the frame rate at that time, but let's say it was 20fps, 50ms.

Why are you assuming 20 fps when B3D provides perf numbers much higher than that? You don't know how much that state (which can have multiple draw calls, BTW) changes without tessellation, so you're going about it the wrong way. That's why I looked at total frame time, because we have differential data in the review.

On a close-up of the dragon the frame rate is about 45fps without tessellation. That implies a very substantial per-pixel workload - something like 260 cycles (2600 ALU cycles) per pixel assuming vertex load is negligible

That's a dumb assumption. Why do you think vertex load is negligible? Only the objects that the programmer bothered to cull with the CPU will be eliminated, and this is a demo.

One of the factors here is we don't know what Heaven's doing per frame. Their site talks about advanced cloud rendering, for example. That'd be a workload that's unaffected by tessellation.

And this is why I use render time differential. All the assumptions you're making are grossly flawed and unnecessary with my approach. The difference in render time that I'm highlighting is due only to the difference in workload brought about by enabling tessellation.

Ah yes, it was so easy and obvious, it's a feature of Cypress

Stop putting words into my mouth AGAIN. I never said doubling setup was easy. I said the rasterizer, shading engine, and render backend could be left untouched. You said it needs to be overhauled to take advantage of faster tessellation, and that's just plain wrong.

Leaving the setup and just improving to one tessellated triangle per clock, on the other hand, is very easy, and I'm baffled as to why Cypress can't do that. Laziness? A bug? Planned obsolescence?

Has anyone timed how many triangles per clock we see from R600 onwards when creating tri-strips in the geometry shader? Maybe they just left the geometry amplification hardware unchanged.

CarstenS · Apr 1, 2010

trinibwoy said:
Yep, that's why I originally asked:

Didn't I confirm this? Sorry, didn't want to do too much advertising our own site.
And no, apart from the cited paragraph, Nvidia said nothing more details about it yet.

trinibwoy said:
Hence my amazement that it could be only 2.

That's why (not only) GTX480 can achieve their peak "fillrate" with longer shaders too.
For example this shader from MDolencs Fillrate Tester also achieves "peak fillrate throughput" (i.e. ~20ish GPix/s) despite being more than just one operation per pixel:

Code:

ps_2_0

dcl v0
dcl v1

def c0, 0.3f, 0.7f, 0.2f, 0.4f
def c1, 0.9f, 0.3f, 0.8f, 0.6f

add r0, c0, v1
mad r1, c1, r0, -v0
mad r2, v1, r1, c1
mad r3, r0, r1, r2
mov oC0, r3

rpg.314 · Apr 1, 2010

trinibwoy said:
Well it can retire two instructions for two half-warps every hot clock. So that's two full warps per scheduler clock.

No, it takes 2 clocks to retire 2 warps. 32alu's/sm mean 1 warp per clock is the effective peak.

Colourless · Apr 1, 2010

Mintmaster said:
You're wrong about "so few small triangles, ever". I'm having trouble finding public information about triangle size distributions, but here's an old one with GLQuake II information:
http://citeseerx.ist.psu.edu/viewdo...CAA817BB2?doi=10.1.1.56.8726&rep=rep1&type=ps
They do glide interception, but don't mention the rendering resolution (let's assume 640x480). They find 41% of the visible triangles are 1-25 pixels in size. How is this possible if 640*480/1300 = 307? Well, that's the area-weighted average triangle size and says nothing about the distribution. Nowadays we have 10x the pixels and well over 100x the triangles, so that likely means even more triangles are tiny. Still think triangles are big?

In reference to Quake 2, most of the screen space is covered by the world geometry, which is large triangles, and not all that many of them. All the small triangles will primarily be in enemy models. The enemies themselves didn't have all that many triangles, but they weren't particularly large on screen. Overall there weren't many triangles being rendered, and the game really isn't that useful to discuss things more than 10 years later.

Talking about x percentage of the visible triangles less than 25 pixels isn't as useful as talking about x percentage of the screen is covered by triangles less than 25 pixels.

trinibwoy · Apr 1, 2010

CarstenS said:
That's why (not only) GTX480 can achieve their peak "fillrate" with longer shaders too. For example this shader from MDolencs Fillrate Tester also achieves "peak fillrate throughput" (i.e. ~20ish GPix/s) despite being more than just one operation per pixel:

Ok.

rpg.314 said:
No, it takes 2 clocks to retire 2 warps. 32alu's/sm mean 1 warp per clock is the effective peak.

Yes, hot clock but it's scheduler clock that I mentioned in my comment (since we're talking about feeding the ROPs). Or are you saying it's only 1 warp per scheduler clock?

NVIDIA Fermi: Architecture discussion

Jawed

CarstenS

Moderator

Dave Baumann

Gamerscore Wh...

mczak

trinibwoy

Meh

Jawed

no-X

CarstenS

Moderator

mczak

trinibwoy

Meh

CarstenS

Moderator

trinibwoy

Meh

cho

no-X

trinibwoy

Meh

Mintmaster

CarstenS

Moderator

rpg.314

Colourless

Monochrome wench

trinibwoy

Meh

Similar threads