G80 programmable power

http://www.beyond3d.com/articles/xenos/index.php?p=08

The Xenos shader contains a large number of independent groups of pixels and vertices (threads) which are 16 wide. In order to hide the latency of an instruction for a given thread, a number of other threads are used to "fill in the gaps". By doing this, the ALU's are fully utilized all the time, and the shader can have direct data dependency on every instruction and still run full rate. Xenos has a very large number of these independent threads ready to process, so there are always enough independent instructions to execute such that the ALU's are fully utilized. Each of these different threads can be executing a different shader, can be at different places within the same shader, can be pixels or vertices, etc.

With this complex organisation, the threading mechanisms, the number of threads that are active, or ready to be active so the system hides latency effectively, ATI's testing indicates an average of about 95% efficiency over the shader array in general purpose graphics usage conditions

See Jawed your diagram doesn't take the threading into account. Look at the rather large discrepancy between your tests and ATI's.

This is in fact the basis of my claim that scalar is not necessarily better then vec4 because it all depends on the implementation. I think R580 has brute force, but is far from using ALU's efficiently.
 
Last edited by a moderator:
See Jawed your diagram doesn't take the threading into account.
You're confusing load-balancing/scheduling efficacy in the Xenos description (hiding latency) with per ALU utilisation as I've shown in my diagrams.

Jawed
 
This is in fact the basis of my claim that no scalar is not necessarily better then vec4 because it all depends on the implementation. I think R580 has brute force, but is far from using ALU's efficiently.
The trouble is you can't separate ALU structure from register file structure.

Now, it's true that with each passing year refinements in the relationship of these two components ameliorates utilisation losses.

MAD is a good example of the somewhat testy relationship: MAD requires 3 operands, the only instruction to do so. So if you build an ALU pipeline that supports a single-cycle MAD, you've built a register file that is more complex than perhaps is necessary. Some GPUs, I think, have a 2-cycle MAD, doing MUL followed by ADD, which means 2 operands are all that need fetching.

Now, as it happens, I think R600 does some kind of packing, and has a 2-cycle MAD, i.e. can only fetch 2 operands per clock - but we're just gonna have to wait and see how competitive ATI was when it designed the ALU/register-file.

Jawed
 
I've already posted how I think R600 might execute this:

b3d76.gif


I've diagrammed a jumbo ALU, 32 wide - it doesn't really matter whether it's doing 8x vec4s per clock or 32x scalars. The RCPs might be quicker (2 cycles?) and the RSQ might be longer (8 cycles?).

But it is just a guess.

Jawed​
 
im confused, if vec 4 is better than scalar and takes up less die space why in the world does nvidia go scalar and use it as a marketing checkpoint? not like its a checkpoint thats gonna sell cards.
 
im confused, if vec 4 is better than scalar and takes up less die space why in the world does nvidia go scalar and use it as a marketing checkpoint? not like its a checkpoint thats gonna sell cards.

scalar is better for general purpose stuff and I speculate that vec4 is better for 3D graphics.
 
scalar is better for general purpose stuff and I speculate that vec4 is better for 3D graphics.

Yet, basically all programmers who wanted good efficiency out of Emotion Engine's VU's chose basically the scalar approach processing vertices in parallel in the same register file (instead of the Vec4/AoS approach of 1x128 bits register = 1x128 bits Vector = 4x32 bits components or fields) and it was the same path chosen for SPE's from the get go (AoS/Vec4 functionality was let's say deprecated ;)).
 
scalar is better for general purpose stuff and I speculate that vec4 is better for 3D graphics.
Rwolf, if ALL the operations in the shader used all four channels, then, yes, vec4 would be the obvious choice, but shaders don't use all the channels all the time. For example, a standard dot product only sources three of the four components, thus in a vector scheme ~25% of the hardware is not utilised. In this regard, a scalar system is far more efficient in its use of the FP units.
 
Rwolf, if ALL the operations in the shader used all four channels, then, yes, vec4 would be the obvious choice, but shaders don't use all the channels all the time. For example, a standard dot product only sources three of the four components, thus in a vector scheme ~25% of the hardware is not utilised. In this regard, a scalar system is far more efficient in its use of the FP units.

what % of shaders in a typical next gen game would u say run faster on a vec 4 system. which do u see future games s/w being faster on vec4 or scalar alus.
 
what % of shaders in a typical next gen game would u say run faster on a vec 4 system. which do u see future games s/w being faster on vec4 or scalar alus.

0%. [Assuming identical perfs and all, and 4 times as many scalar ALU as vec4 ALU]
They would all run faster on scalar ALUs.

Most color operations are done on vec3, lots of geometry ops too.
 
Look, there are two ways to think about efficiency. per-flop, and per-transistor. It is completely and utterly impossible to argue against the fact that scalar ALUs are more efficient per-flop. As for the per-transistor metric, that's more arguable. I would tend to believe it's more efficient for DX10-level applications, but I obviously don't have enough data at my disposal to judge of this, so feel free to disagree.

And rwolf, you might have wanted to realize everyone but you was initially talking of per-flop efficiency. And saying "vec4 is better for 3D graphics" is ridiculous. Perhaps if you made that statement a bit less bold by saying something along the lines of "Vec3+Scalar is the best approach in terms of transistor efficiency for 3D graphics", then I might consider you seriously. As it is, you should just go hide yourself under a rock or something already...


Uttar
 
That's a good point but R580 is half the transistors and half the speed with substantially less bandwidth.


Most of those "extra" transistors are for unification, not the shader units. The shader ALU's for the g80 are two times the speed but theorically they also only do 25% more scalar operations at the same time as a r580 at stock, but performance is much more then just 25%.

We will have a better idea once the r600 comes out which way is more wasteful in terms of silicon, transistors. But I am will to bet both approaches will end up very similiar for transistor usage but performance is going to vary on an effeciency basis.

Rangers, yes shader ALU's are decreasing in the over all % of the silicon, the more threads in flight, unification have been increasing the die space for the necessary control silicon.
Even with the g80 with out its secondary mul which has less over all flops then an r580 is much faster the r580. Where is that performance coming from. We all know flops don't mean much, but why, its the way those flops are used and when they are used that makes the difference.
 
What are all those holes doing there?
Are these the correct holes?

b3d82.gif


I realised last night I should diagram this as one half of a cluster, which is why it's 8 wide, and there's 16 pipes (8 clusters x 2) in total.​

Jawed​
 
Most color operations are done on vec3, lots of geometry ops too.

Which means that Vec3 is actually more efficient per transistor than Vec4. I.e. if you have 200 GFLOPs in your pixel shading array, you get more use out of them if your ALU's are setup in a Vec3 + Scalar configuration than if they were setup in a Vec4 + Scalar configuration. Thats why it always used to bug me when people said things like "Xenos is more efficient at pixel shading than G7x".

Load balancing efficiency and per ALU efficiency are two very different things! ;)
 
Back
Top