Mintmaster
Veteran
Sorry, I wasn't too clear there. I also meant for you to spend more time thinking about the next paragraph.If 384 threads are enough in some 16-SIMD refresh of G80, then why does G80 have 768?
I'm talking about the ALUs when working on ALU instructions. If you look at my example, you see that this is the only aspect of the ALUs that affects the latency hiding necessary. I need to make sure at least 6 warps have ALU instructions ready to run. If that parameter was, say, 16, then I'd start running into trouble and could no longer extend my example to the double-width, double warp size G80 without a loss in performance.
I know (21 FP32 scalars sounds better ). That limitation isn't holding back G80 much. The only time this issue would affect the double-ALU G80 and not the current G80 is if a shader is both heavily ALU limited and also frequently hit the read-after-write limitations of the register file.85 bytes per fragment is 5 vec4 registers.
Such an odd shader is not the reason you think the RF needs to be doubled, so there's no need to discuss this any further.
You need a hotspot that's quite a few instructions in length to get all the warps to bunch up like that. In any case, when I said "statistics takes care of other scenarios", I meant it in terms of their applicability to my argument.I disagree, because shaders can have "hotspots", e.g. where a combination of register allocation per fragment and dependent texturing, say, causes you to chew through available threads resulting in an ALU pipeline stall. As Mike Shebanow says, as you consider smaller and smaller windows of instructions, the shader average throughput is irrelevant - the small window has its own effective ALU:TEX ratio, which you have to balance against other resource constraints (e.g. registers per fragment). The program counters for threads will "bunch-up" behind this hotspot.
A bunch of dependent texture lookups is going to be texture throughput limited unless you have a huge swath of ALU instructions outside this portion of the shader, which makes "bunching up" less likely in the first place. That's not your typical game shader, and "statistics" also takes into account the variation in shaders out there. You don't double the register file solely to improve the performance of 0.01% of the fragments your hardware will ever see.
Also, keep in mind the scope of our debate. We are comparing two methods of double the ALU:TEX ratio of G80. Your way is to double the number of multiprocessors to 4 per cluster. My way is to double the SIMD width and warp size and nothing else - not even the register file. Clearly your way has some advantages in corner cases. However, it's also far more costly. My contention is that overall the difference in performance will be minimal. Even in extreme cases of a 100% serially dependent instruction stream, as in my example, there will be no performance hit. Clearly texture latency hiding is not affected, and there is no pressing need to double the register file.
What NVidia does decide to do is a different story. You had some good points about why NVidia would like to keep 32 pixel warps, and you're probably right.
-----------------------------
I thought of another way to summarize my argument against your claim:
Consider 3 different shaders:Also, you have to double the size of the RF (can't keep the size constant), because you want to double the number objects in flight, since your ALU pipe is now chewing through them twice as fast. Otherwise you've just lost half your latency-hiding.
-X: 200 scalar ALU, 10 TEX
-Y: 100 scalar ALU, 10 TEX
-Z: 100 scalar ALU, 5 TEX
I'll refer to my double-width, double warp size, equal RF modification as G80**. I claim:
A) G80** will run X as fast as G80 runs Y. With double the ALU instructions, we can now feed the double-width SIMD, hiding the same 10 texture instructions almost identically.
B) G80 will run Y at the same speed as Z (since we're ALU limited).
C) Both G80 and G80** will run Z twice as fast as X.
D) The purpose of doubling the ALU:TEX ratio can be summarized as having equal performance with double the ALU load. A shows we've done that.
E) From A, B, and C, we see that G80** is twice as fast as G80 in X and Z.
So whether you look at D or E, we've accomplished our goal without doubling the register file. If you dig further, you can see the fundamentals of this argument are the same as in my other example. Equal TMU throughput between G80** and G80 is the reason A is possible, and is the primary factor thread count can remain equal.