G80 programmable power

How many shader operations can the G80 do per unified shader ALU?
I know the Xenos can do 10 x 48ALU's x 500MHz = 240GFlop/s

Taking in account the G80 does 10 aswell that gives it:
10 x 128 ALU's x 1350MHz = 1728GFlop/s (1,728TFlop/s)

This would mean it's a very big leap up from the G7x and R580 something which isn't as much shown by games.
Why is that?
 
Your math is completely off due to the frustration of mixed marketing messages. Plus you're asking about shader operations, but quoting FLOPS. Xenos' ALUs do more than G80's ALUs. Maybe someone can quickly supply a link to previous discussions about this, but you might want to search for old threads.
 
A G80 ALU does 2 FLOPS (arguably sometimes 3, if you count that good ole Missing MUL...) per "shader ALU". Keep in mind a fully scalar design will be more efficient per-flops than a Vec4+Scalar one such as in Xenos, however. And please note that this response is a MASSIVE oversimplification - indeed, please just use that good ole Search Button instead if possible! :)


Uttar
 
You can construct cases where Vec4 is just as efficient as scalar, but Uttar is correct, in real code scalar will be more efficient. The question is how much more efficient vs. any die area costs. This is something consumers won't be able to tell due to too many variables.
 
It's a least common denominator sort of thing. You need to deal with nice even numbers and x/1 is a lot simpler than x/4 when you're dealing with numbers between 1 and 4.

Gets a bit more complicated when you start looking at how many processors fit in a given area.
 
There are other issues with vector designs as well such as data rotation. Even if one has a perfectly 4-vector series of operations, the data may need a "transposition" due to stuff like matrix layout, texture reads (Fetch4 complicates this even more). Scalar processors will thus generally guarantee fewer idle ALUs and thus more efficient use of the hardware. However I don't know anything about the manufacturing difficulty of each design... if they were the same size/cost, of course 128 vec4 ALUs would be better than 128 scalar ones ;)
 
It's a simple bin-packing issue. You end up with occasional "holes" in the fpu utilisation.

Agreed, but you have to duplicate a great deal of logic to get that efficency allowing you to have less scalar units.

But if the scalar ALU takes 4 clocks to execute an instruction and the vec4+scalar takes 1 clock the scalar is not as efficent (or at least this implementation). Now in the case of G80 this would only be rarely used unstructions and we don't have any info on R600 yet.
 
Agreed, but you have to duplicate a great deal of logic to get that efficency allowing you to have less scalar units.

But if the scalar ALU takes 4 clocks to execute an instruction and the vec4+scalar takes 1 clock the scalar is not as efficent (or at least this implementation). Now in the case of G80 this would only be rarely used unstructions and we don't have any info on R600 yet.

I think it pretty much is a similar idea to using AoS (.xyzw) or SoA (.xxxx, .yyyy, etc....) organization for your data on vector units such as PlayStation 2's VU's: in the end, even though they unit was designed with AoS usage in mind and had support for horizontal math and broadcast operations, developers wanting to get better utilization of the two vector units re-arranged their data and processed vertices/data vectors in parallel (4 vertices in parallel) and it worked out pretty well ;).
 
rwolf do u think a vec 4 approach is overall better than scalar? a more direct question, do you think g80 would be a better product had nvidia designed it with vec 4 in mind?
 
Agreed, but you have to duplicate a great deal of logic to get that efficency allowing you to have less scalar units.

But if the scalar ALU takes 4 clocks to execute an instruction and the vec4+scalar takes 1 clock the scalar is not as efficent (or at least this implementation). Now in the case of G80 this would only be rarely used unstructions and we don't have any info on R600 yet.


Hasn't the g80 shown the effeciency, lets compare a vec 3+scalar 48 alu r580. It could theoretically do 192 scalar operations at 650 mhz.

The g80 does 128 scalar operations at 1350 mhz.

Lets say we even out the mhz to get a comparative scalar operations thats 265ish scalar operations for a g80 at 650 mhz.

Still the g80 is much more effecient over all when it comes down to end performance its more then 25% faster in most occassions well all occasions I can think of, so effeciency is more at least the way the vec ALU's are set up in the r580.
 
Last edited by a moderator:
Agreed, but you have to duplicate a great deal of logic to get that efficency allowing you to have less scalar units.

But if the scalar ALU takes 4 clocks to execute an instruction and the vec4+scalar takes 1 clock the scalar is not as efficent (or at least this implementation). Now in the case of G80 this would only be rarely used unstructions and we don't have any info on R600 yet.
If you can make a vec4+scalar that executes in 1 clock then your scalar shouldn't take 4 clocks. Either way it doesn't matter if they are pipelined. Just work on other pixels while you wait for the result to be available.

Hmm. It seems you're talking about the special function unit with the 4 clocks comment. That's a separate issue from the vector vs. scalar debate.
 
Hasn't the g80 shown the effeciency, lets compare a vec 3+scalar 48 alu r580. It could theoretically do 192 scalar operations at 650 mhz.

The g80 does 128 scalar operations at 1350 mhz.

Lets say we even out the mhz to get a comparative scalar operations thats 265ish scalar operations for a g80 at 650 mhz.

Still the g80 is much more effecient over all when it comes down to end performance its more then 25% faster in most occassions well all occasions I can think of, so effeciency is more at least the way the vec ALU's are set up in the r580.

Meh, that's not impressive the exact way you lay it out. G80 probably doesn't improve much on it's theoretical increase.

Plus, R580 was probably quite a bit texture limited. What if you did the theoretical comparison with a G71? It would probably look even worse for G80.

And, too mention that G80's die is much bigger per ALU/Shader op..so you really are in trouble when trying to say it's an efficiency upgrade..

That said, I dont see why Nvidia would have done it if they didn't see benefit, plus there are so many other issues here (like DX10, CUDA capability..). It seems ALU's as a portion of the die are declining right now (64 rumored in R600 isn't impressive by past gen standards either).
 
I like to use this awkward snippet of code to show how pipeline utilisation can fall-off:

b3d68.jpg


This is how I think it executes on R580 (I've corrected an error that was on prior postings of this):​

b3d75.gif


And this is how I think it executes on G80:​

b3d79.gif


Jawed​
 
Hasn't the g80 shown the effeciency, lets compare a vec 3+scalar 48 alu r580. It could theoretically do 192 scalar operations at 650 mhz.

The g80 does 128 scalar operations at 1350 mhz.

Lets say we even out the mhz to get a comparative scalar operations thats 265ish scalar operations for a g80 at 650 mhz.

Still the g80 is much more effecient over all when it comes down to end performance its more then 25% faster in most occassions well all occasions I can think of, so effeciency is more at least the way the vec ALU's are set up in the r580.

That's a good point but R580 is half the transistors and half the speed with substantially less bandwidth.
 
Eric Demers (AKA Sireric) in the B3D R580 Arch Interview said:
I'd have to check with the compiler team, but on average, I think we see about 2.3 scalars per instructions being close to the average. Being able to do 2 full scalars (one using VEC and one Scalar) pretty much means that we are pegged out; as well the smaller ALU gets used a lot as well, giving an effective 2~4 scalars per cycle. As well, the average shader instruction (multiple scalars) to texture ratio is around 3 right now, and from what ISVs are telling us, it's likely to increase in the next few years. Consequently, the number of ALUs seems to be hitting the "sweet" spot for new applications (while being slightly underutilized for older apps) as well.

Bold mine, does this mean that when the MADD Vec3 and the ADD Vec3 are only processing a scalar each, the several other potential FLOPs are wasted in that ALU?

Seems scalar ALUs are more optimal.

*Edit: Ah, Jawed diagrams were not there when I posted this, they seem to concure with what I wrote wrt wasted FLOPs.
 
Last edited by a moderator:
I like to use this awkward snippet of code to show how pipeline utilisation can fall-off:

b3d68.jpg


This is how I think it executes on R580 (I've corrected an error that was on prior postings of this):​

b3d75.gif


And this is how I think it executes on G80:​

b3d79.gif


Jawed​

Does your diagram take into account optimizations by the shader compiler? Also comparing R580 to G80 is not comparing vec4 to scalar. R580 is bottlenecked by the fact that three ALUS and three half ALUs service a single pixel. R600 is suppose to be implemented more like Xenos and Xenos is much more efficient isn't it.
 
Last edited by a moderator:
Back
Top