The Official NVIDIA G80 Architecture Thread

16 fragments x 4 clocks looks unhealthy in comparison with a vec3+SF GPU: 4 batches, each of 4 fragments x 1 clock (both GPUs considered as having 16-SIMD ALUs). The latter is 4x faster on the MUL. But that's an extreme case, generally G80 wins out significantly.
Whoops, sorry, I went to bed before realising that the MUL performs the same on both GPUs - since G80 is "4x wider" per clock and 4 clocks per fragment is the "normal" rate for a vec4 operation anyway.

Jawed
 
I don't think I fully get the example (brain is fried - just got home from work) but couldn't you contrive something similiar to get similiarly poor utilization of the vec3? Also if I understand what you're trying to demonstrate - this 4-clock idle time during the RSQ is specific to G80's implementation, not to scalar architectures in general.
Yeah the vec3+SF architecture is the same speed on the MUL and yes, specific to G80's implementation.

Jawed
 
The worst case on G80 is 0% utilisation of the primary ALU:

DP4 r0.w, r1, r2
RSQ r0.w, r0.w
MUL r3, r4, r0.w

the MUL has to wait until the RSQ has completed (which has to wait until the DP4 has completed). So while the SF ALU is working on the RSQ, the primary ALU is idle, which is four clocks.
You think G80 is devoid of any intelligence in its scheduling? We already know it can operate on multiple batches simultaneously, and when the SF unit is used for interpolation during texturing you aren't losing ALU cycles.

DP4 takes 4 cycles, MUL takes 4 cycles, so the 4 cycles for the RSQ can be hidden completely. One example (for 32 pixel batches):

cycles 1-8: DP4 for batch 1
cycles 9-16: DP4 for batch 2, RSQ for batch 1
cycles 17-24: MUL for batch 1, RSQ for batch 2
cycles 25-32: MUL for batch 2

And so on. The primary ALU is at 100% for all cycles. Of course, G80 can handle a lot more than just 2 batches in flight, so we'd see a lot more batches interleaved than this simple example. After all, it can hide hundreds of clocks for texture latency.
 
No, the primary ALU and the SF are bound together by co-issue, meaning they have to be from the same thread.

Jawed
 
Because Bob didn't correct me.
He did correct you, but you didn't seem to pay much attention to it... He implied this code would only take 5 clocks. As to whether he was thinking of a Vec4 MUL or of a Scalar MUL... :) (I do have my little idea of how you can get to either of those figures, anyway)


Uttar
 
The entire point of my code sample was to show that a single stream of code (i.e. not something like an unrolled loop with implicitly parallel or partially overlapping loop iterations) with per-clock instruction dependency will cause the primary ALU to sit idle.

It's an extreme case and Bob's hypothetical case doesn't answer the dependency issue I'm talking about. Bob's answer depends solely on co-issue, not on multiple-thread scheduling.

Jawed
 
Uttar said:
As to whether he was thinking of a Vec4 MUL or of a Scalar MUL...
Whoops. Fixed now.

Jawed said:
Bob's answer depends solely on co-issue, not on multiple-thread scheduling.
No it doesn't.

Jawed said:
per-clock instruction dependency will cause the primary ALU to sit idle.
There are 2 major cases where the primary ALU will be idle: Some global hazard happened causing all threads to stall (texturing from system memory, with poor locality, for example), and if your shaders are very heavy on the ALU1 ops, to a ratio of more than 1:4 scalars.

For example, if you took your original shader and replaced all ops by RSQs, well now you'd have no ALU0 ops and so it will obviously be idle.

There are other smaller ones that can happen but they tend to not be all the frequent in real life (and can be worked around to some extent by the compiler).
 
Back
Top