NVIDIA Fermi: Architecture discussion

5 is faster than 3 despite the fact that 5 has less threads in flight per SIMD than 3. The estimated threads for version 3 is 256/28 = 9, while for version 5 it is 256/38 = 6. (Both estimates are subject to clause-temporary overhead. Also I suspect that 256 is not the correct baseline, something like 240 might be better, not sure...)
This is a special case, though, as you have monsterous register use. Since you don't have enough threads to hide latency for a fetch immediately followed by an ALU clause that uses the data, the ordering of the fetches is very important.

Going back in time, your argument is that if AMD doubles ALU:TEX, e.g. 8:1 in the next GPU, but leaving the overall ALU/TEX architecture alone, that each ALU would only need half the register file. The 256KB of aggregate register file per SIMD we see in Evergreen would be enough for the next GPU. Well, clearly this is fallacious as version 5 above would be reduced to a mere 3 hardware threads, killing throughput (3 hardware threads means that both ALU and TEX clauses cannot be 100% occupied by hardware threads, since both require pairs of hardware threads for full utilisation).
Number of threads has to do with number of registers per hw thread (i.e. registers per fragment * fragments per hw thread). It doesn't matter what the ALU:TEX ratio of the SIMD engine is. If the hw thread size grows to 128 fragments, then I agree (only for programs with extremely high register use, though), but I doubt ATI is going to do that because the branching granularity gap with Fermi really starts to get wide.

Fermi did not increase the wavefront size, but decreased TEX throughput per SM. That's why it can reduce number of registers per SM without hurting latency hiding.
 
In reference to Quake 2, most of the screen space is covered by the world geometry, which is large triangles, and not all that many of them. All the small triangles will primarily be in enemy models. The enemies themselves didn't have all that many triangles, but they weren't particularly large on screen. Overall there weren't many triangles being rendered, and the game really isn't that useful to discuss things more than 10 years later.
I still say that when you multiply pixel count by 10 and geometry count by a few hundred, it's going to skew the distribution towards a greater percent of small triangles, not lower. It's not like artists only used more polys on world geometry and kept the same low poly enemies.
Talking about x percentage of the visible triangles less than 25 pixels isn't as useful as talking about x percentage of the screen is covered by triangles less than 25 pixels.
Not when we're talking about geometry bottlenecks, because triangles come in clumps that are similar sized (often zero sized), so you can't buffer out this inefficiency. You can't process those millions of small triangles while having the rasterizer and shading engines work on the large triangles, because the workload just doesn't arrive in such a neatly interleaved fashion.

So if you want to know how many cycles you will be limited by geometry processing or setup, number of triangles that are small is the important factor.
 
Regarding GTX 400s pixel throughput, the puzzle seems finally solved. Over easter holiday I've had a rather lengthy email conversation with Nvidia Tech PR and some PMs with Damien who kindly shared his results he was getting.

The bottleneck, as it seems right now, is the connection between shader engines and ROPs which is tailored to accomodate 32 pixels of 32 bits at a time. Formats like RGB9E5 or RGB10A2 and the like take up as many slots as fully blown FP16 pixels, thus being half rate her also. I can only guess at what the connection itself will bee, but it seems like it can (for each pixel of theoretical throughput) operate on four lanes of 8 bits at a time. If they exceed 8 bits, like in RGB9E5, it's taking double time, either serialized or with paired lanes (and then 2 by 2 serialized). More than 16 Bits and we go four cycle/ groups of four.

One-Channel FP32 pixels seems to be able to only occupy a single time slot for all four lanes, so maybe this is the base unit and can be split two- and four-way really.

I wrote somewhat lengthier (and in german) about it here: http://www.pcgameshardware.de/aid,7...-Fuellraten-Raetsel-geloest/Grafikkarte/Test/
But the gist is a I said above. Hopefully Damien will follow soon with his update and maybe he can do some tests with two FP16 channels…
 
The bottleneck, as it seems right now, is the connection between shader engines and ROPs which is tailored to accomodate 32 pixels of 32 bits at a time. Formats like RGB9E5 or RGB10A2 and the like take up as many slots as fully blown FP16 pixels, thus being half rate her also. I can only guess at what the connection itself will bee, but it seems like it can (for each pixel of theoretical throughput) operate on four lanes of 8 bits at a time. If they exceed 8 bits, like in RGB9E5, it's taking double time, either serialized or with paired lanes (and then 2 by 2 serialized). More than 16 Bits and we go four cycle/ groups of four.

One-Channel FP32 pixels seems to be able to only occupy a single time slot for all four lanes, so maybe this is the base unit and can be split two- and four-way really.
Interesting. I always thought format conversion happens in ROPs but that suggests there's some logic for that somewhere at the end of the pixel shader pipe.
It wouldn't explain the slow 4-channel fp32 blend result but maybe this is really something like 1/4 full speed per channel for the blender, so 1/16 of nominal (34GPix/s) ROP rate. At least that would fit all fp32 blend data for both GTX285 and GTX480...
A bit strange that you'd have 48 rops but for most things they aren't any better than 32, though maybe they are very cheap anyway.
Maybe it's time to spend those transistors getting color data back to shader clusters instead and doing blending and writing back through some generic memory controller (which still would handle color compression), that rop design sounds kinda lame. Well for color at least.
 
Interesting. I always thought format conversion happens in ROPs but that suggests there's some logic for that somewhere at the end of the pixel shader pipe.
It wouldn't explain the slow 4-channel fp32 blend result but maybe this is really something like 1/4 full speed per channel for the blender, so 1/16 of nominal (34GPix/s) ROP rate. At least that would fit all fp32 blend data for both GTX285 and GTX480...
A bit strange that you'd have 48 rops but for most things they aren't any better than 32, though maybe they are very cheap anyway.
Maybe it's time to spend those transistors getting color data back to shader clusters instead and doing blending and writing back through some generic memory controller (which still would handle color compression), that rop design sounds kinda lame. Well for color at least.

The way Carsten described the issue it sounded like a bandwidth problem.
 
A bit strange that you'd have 48 rops but for most things they aren't any better than 32, though maybe they are very cheap anyway.

Well it's obvious why they aren't better than 32 for most things but I'm trying to understand if they're better than 32 for anything :???:
 
Well it's obvious why they aren't better than 32 for most things but I'm trying to understand if they're better than 32 for anything :???:
Well if fp32 blend is really quarter-speed per channel that would at least be good for a bit faster fp32 blend rate... Of course that would just be 48 incredibly slow fp32 blend units vs. 32 incredibly slow fp32 blend units but at least that's something...
Also the increased z fill numbers - they still are way below what should be possible as far as I understand, but maybe the big increase there compared to GTX285 is also (at least partly) due to that.

edit: so here's what I think these chips can do with blending per rop, if they had enough bandwidth:
Cypress (probably all Evergreen):
- full rate int8, fp10
- half rate fp16
- quarter rate fp32
Except fp10 those would be same as rv7xx

Fermi:
- full rate int8
- half rate fp16, fp10
- quarter rate (per channel! hence 1/16 for 4 channel) fp32
Those would all be the same as gt200
All numbers except the fp32 ones would be limited by that 32 (at 32bit) pixel shader->rop connection, hence be the same for 32 or 48 rops
(btw does that 32 pixel number go down with cluster count? If that's just a bandwidth limitation it shouldn't right since other clusters can just send pixels down more often?)

Without blending it would be:
Cypress:
- full rate int8, fp10, fp16
- half rate fp32
Again same as rv7xx except fp10

Fermi:
- full rate int8
- half rate fp16, fp10
- one fp32 result per clock, that is full rate single-channel fp32, quarter rate 4-channel fp32
All numbers (without exception) limited by the 32 pixel (at 32bit) shader->rop connection, hence the same for 32 or 48 rops

That theory is on shaky ground... I won't even touch z fillrate here...
Some 2-channel fp32 numbers, please...
 
Last edited by a moderator:
Well if fp32 blend is really quarter-speed per channel that would at least be good for a bit faster fp32 blend rate

That theory makes sense but at a high level the measured performance seems to track more closely to shader throughput than anything else. This is based on Damien's numbers:

pixelfill.png
 
That theory makes sense but at a high level the measured performance seems to track more closely to shader throughput than anything else. This is based on Damien's numbers:

pixelfill.png

Right. So it's still a mystery...
btw we're always talking about 32 pixel (per clock) rasterization limit. But that apparently doesn't affect z fill rate (neither for Cypress nor GF100), so is this only true for pixels actually going to the shader core? I'm wondering what this actually measures...
 
Right. So it's still a mystery...
btw we're always talking about 32 pixel (per clock) rasterization limit. But that apparently doesn't affect z fill rate (neither for Cypress nor GF100), so is this only true for pixels actually going to the shader core? I'm wondering what this actually measures...

I don't really understand what's going on with z-fillrate in general. Can someone confirm how we get high z-fillrates even when AA is not enabled? Are rasterizers actually capable of producing more depth samples than color samples per clock?
 
I don't really understand what's going on with z-fillrate in general. Can someone confirm how we get high z-fillrates even when AA is not enabled? Are rasterizers actually capable of producing more depth samples than color samples per clock?

Yes. I've touched on this in the Cypress architecture piece. There are a number of simplifications that are possible in that area when doing Z-only rendering, and both IHVs leverage this.
 
New subtopic:

How much work would it be for a Fermi followon to add preemptive multitasking support?

The hardware already has multiple simultaneous kernel execution support. It already has a cache and registers can spill to that cache. It already has a powerful scheduler. It even has the luxury of descheduling a running kernel block without any runtime speed loss as other blocks are already using the resources, so there's no context switch penalty.

So as I see it, the main feature needed for preemptive multitasking is the ability to save and restore context, meaning program counters for each warp, predicate masks for each warp, registers, and shared memory. That's a pretty big context, but remember that unlike a CPU, the hardware doesn't need to stall while saving or restoring this context.. it just needs to be descheduled then the bundle of context saved to device memory. (The word "just" here may hide a big job, though.)
In fact, with the uniform 64 bit memory space, even the biggest part of the context, the shared memory, may not even need to be explicitly saved.. it could just be pushed out and restored lazily by treating it exactly like dirty L1 cache lines... the shared memory hardware IS the L1 cache so this functionality is already there.

So, am I missing something else that would be needed for kernels to allow for dynamic task switching? I'm thinking of all kinds of obvious applications like letting a physics kernel cede way to a graphics kernel, and only use the GPU throughput when it's otherwise idle. Or for that matter, run CUDA jobs and still use the GPU for the graphics display.

You could also imagine it being useful for having lots of background tasks idling away, waiting for work. Your particle system code would always be running, waking up only occasionally for data to process.. if there wasn't any it'd announce that it was going back to sleep for a while until a timer (or explicit event trigger) reschedules it. Note that this means the CPU is not involved at all!
 
From my understanding of what I've been told, multiple concurrent kernels on GF100 only refers to kernels of the same type, i.e. different physics solvers for cloth, fluid and so on. They all have to, however, to the same operational mode/context, i.e. Cuda or graphics. Not sure about DX compute shaders though.

Currently, heavy use of Cuda-kernels will still bring the Windows GUI (Win7, Aero G.) to a crawl.
 
Last edited by a moderator:
From my understanding of what I've been told, multiple concurrent kernels on GF100 only refers to kernels of the same type, i.e. different physics solvers for cloth, fluid and so on. They all have to, however, to the same operational mode/context, i.e. Cuda or graphics. Not sure about DX compute shaders though.

Currently, heavy use of Cuda-kernels will still bring the Windows GUI (Win7, Aero G.) to a crawl.

No, in CUDA, it's actually arbitrary kernels.. any SM could be running up to 4 different kernels all at the same time. What it CAN'T do is suspend one of those kernels, swap out its state, and swap in a new one for a different kernel, then resume by reswapping.

You're right that the GPU is in CUDA or graphics mode though. But if you could context switch, the CUDA kernels could be swapped out, the graphics run to paint the frame, then CUDA kernels swapped back in.

This switching ability isn't in Fermi now (at least nobody has even hinted at it) but my question is mostly about how hard it'd be to add since the hardware already can do most of the substeps of context switching.
 
They could have the driver put in extra conditional jumps in the shaders for cooperative multitasking (ie. if task switching flag is set save all the context and end the kernel).
 
No, in CUDA, it's actually arbitrary kernels.. any SM could be running up to 4 different kernels all at the same time. What it CAN'T do is suspend one of those kernels, swap out its state, and swap in a new one for a different kernel, then resume by reswapping.
That was what i meant by "multiple concurrent kernels on GF100 only refers to kernels of the same type, i.e. different physics solvers for cloth, fluid and so on. They all have to, however, to the same operational mode/context, i.e. Cuda or graphics"

This switching ability isn't in Fermi now (at least nobody has even hinted at it) but my question is mostly about how hard it'd be to add since the hardware already can do most of the substeps of context switching.
Right - it doesn't seem to be available in hardware, otherwise Nvidia would have boasted with it too. AFAIK they only went on about having reduced context switching for the whole chip to 20 mikroseconds, which is supposed to be a couple of times faster than with previous geforce cards.
 
Back
Top