Typical GPU Efficiency

the hell does "efficiency" mean in this context? that's pure marketing FUD, assuming he's talking about theoretical throughput compared to actual IPC in a given app, it's going to vary enormously from app to app, from scene to scene, hell, maybe even from frame to frame. it means nothing.
 
Active processing, perhaps? I can see how "efficiency" might relate to the frequency and length of stalls due to memory latency, or something like that.
 
ya.. I'm thinking efficiency is referring to something like the number of "functional units" within a given chip, which are actually working at any given time, compared to the total number of them...


"functional units" may refer to ALUs, or perhaps more than that..
 
The Baron said:
... it's going to vary enormously from app to app, from scene to scene, hell, maybe even from frame to frame. it means nothing.
The bottleneck varies that often, but the bottleneck is of utmost importance. And he said "typical" which probably means something like "average." And that is certainly not a negligible percentage. It shows that there are a lot of idle processors on high-end PC cards.

He's probably talking about idle versus working ALUs and texture units. Between all the various things that knock efficiency down, 60% is not ridiculous to me.
 
Basicly the whole claim here is because the current games don't staturate the vertex shader units that chips aren't run efficently.

This has got nothing to do with the pixel fillrate really since well almost no cards on current games are vertex limited and since the R500 has a unified shader architexture it doesn't have spare vertex processing capicity.
 
Yes, but at the same time it makes all those extra transistors in the vertex shaders "wasted" for the purpose of most games.
 
While I do believe that the 50-60% figure makes sense, I don't believe the 100% for Xenos.
 
Xmas said:
While I do believe that the 50-60% figure makes sense, I don't believe the 100% for Xenos.

They have two kinds of function units (TMUs; ALUs) in the shaderuint. Because of this it is still possible that one function blocks the other and the 100% goes away. You can try to work around this and add more threads and memory for this threads. But as long as there are more than one kind of function unit is in the shader unit it will be imposible to have always 100%.

But even if we had only ALUs in the shader we can still go deeper. ALUs have a blocks for MUL,ADD and SF. If a instruction only use the MUL block can wie still say that the ALU works with 100%? IMHO no because the FLOPs from the ADD and SF blocks are not used.
 
Given the context of the article what he's talking about is ALU utilisation efficiency and that would not just relate to the notion of distinct VS and PS but also losses of efficiency through latency of both ALU instruction processing and texture. Xenos works to hide all these latencies and they are actually giving a figure if 95% utilisation rate of the ALU arrays (in shader bound applications).
 
DaveBaumann said:
Given the context of the article what he's talking about is ALU utilisation efficiency and that would not just relate to the notion of distinct VS and PS but also losses of efficiency through latency of both ALU instruction processing and texture. Xenos works to hide all these latencies and they are actually giving a figure if 95% utilisation rate of the ALU arrays (in shader bound applications).

But surely if all these latencies are such a large bottleneck (those not relating to seperate shaders) then ATI would also be working to hide them in their other architectures? Namely R520.

Or is there something unique about the R500 that makes this only possible with that chip?
 
Well, Xenos can only handle so many threads. What if all of them are waiting for texture fetches to complete? And it's not like other architectures aren't built to hide these latencies.
 
How many threads do you think it can handle? What propotion of those shader programs are likely to be texture instructions and all demanding them at the same time? And that there'll be no VS instructions in the stack to process?

True enough though, Xenos will be a bit sucky as a fixed function processor, and it does demand a balance of operations tending towards ALU utilisation.
 
DaveBaumann said:
How many threads do you think it can handle? What propotion of those shader programs are likely to be texture instructions and all demanding them at the same time? And that there'll be no VS instructions in the stack to process?
IIRC the leaked specs state 64 threads, but I don't know what granularity that is.
Post processing effects will likely fit this description. Virtually no VS work, and tons of pixels running the exact same shader.
 
The leak isn't accurate in the number of threads active, its more than that. When they are running the same shader there is no reason for all pixels to be executing the same instruction at the same time.
 
The Baron said:
the hell does "efficiency" mean in this context? that's pure marketing FUD, assuming he's talking about theoretical throughput compared to actual IPC in a given app, it's going to vary enormously from app to app, from scene to scene, hell, maybe even from frame to frame. it means nothing.

Good point. I think he's talking about PC scenarios like running Word or a browser versus running a demanding 3d game, etc. In a PC the chips are always running at 100% peak, but sometimes the programs require no more than 50-60% of a given chip's resources--hence he says "efficiency."

I agree with you that it's a poor choice of word because that's much different from assigning some kind of "slowdown" due to poor processing efficiency. If you are running Word, for instance, and only using 50% of the performance resources of your hardware (or 20% or 30%, etc.), Word is still running at 100% efficiency on your hardware because it is running absolutely as fast as your hardware can run it. I think the "inefficiency" he was talking about is actually *software inefficiency* which in his context simply means the software isn't always "efficient" in terms of utilizing the maximum processing power of the chips.

I think understanding this does help in understanding the differences between console architectures and PC architectures, though, but I agree with you that "efficiency" is a somewhat cumbersome way to impart the concept. I don't think he intended it as FUD, really, but just that his contrast of the differences in design strategies between PCs and consoles was not worded as well as it might have been. Consoles have to be both cheap and relatively fast, so the emphasis has to be placed on writing console gaming software that attempts to always use the maximum performance resources available in the chips.
 
Something I would be very curious about is if the change from pixel to vertex processing requires some kind of state change in the functional unit? Or can it have both pixel and vertex instructions in flight together (or at least one behind the other)?
 
A "thread" is something that has a single state. 3 threads are being processed, there are many, many threads ready to be processed and interleved with other threads.
 
WaltC said:
The Baron said:
the hell does "efficiency" mean in this context? that's pure marketing FUD, assuming he's talking about theoretical throughput compared to actual IPC in a given app, it's going to vary enormously from app to app, from scene to scene, hell, maybe even from frame to frame. it means nothing.

Good point. I think he's talking about PC scenarios like running Word or a browser versus running a demanding 3d game, etc. In a PC the chips are always running at 100% peak, but sometimes the programs require no more than 50-60% of a given chip's resources--hence he says "efficiency."

I agree with you that it's a poor choice of word because that's much different from assigning some kind of "slowdown" due to poor processing efficiency. If you are running Word, for instance, and only using 50% of the performance resources of your hardware (or 20% or 30%, etc.), Word is still running at 100% efficiency on your hardware because it is running absolutely as fast as your hardware can run it. I think the "inefficiency" he was talking about is actually *software inefficiency* which in his context simply means the software isn't always "efficient" in terms of utilizing the maximum processing power of the chips.

I think understanding this does help in understanding the differences between console architectures and PC architectures, though, but I agree with you that "efficiency" is a somewhat cumbersome way to impart the concept. I don't think he intended it as FUD, really, but just that his contrast of the differences in design strategies between PCs and consoles was not worded as well as it might have been. Consoles have to be both cheap and relatively fast, so the emphasis has to be placed on writing console gaming software that attempts to always use the maximum performance resources available in the chips.


...huh?


Surely he's just saying that in modern games because current graphics have fixed fuction pixel OR vertex pipelines, most of the time only 60% of transistors (in pipelines at least) are being used because the games are either pixel- or vertex-shader bound and thus either some pixel pipes or some vertex pipes aren't being used most of the time?
 
I wrote about efficiency recently:

If only we could talk in terms of pixel shader instructions, comparisons would start to get meaningful. This example shows SM3 executing 102 instructions in 46.75 cycles, 2.2 instructions per cycle:

http://www.beyond3d.com/forum/viewtopic.php?p=327176#327176

Bearing in mind that NV40 is capable of executing 4 shader instructions per cycle (peak), 55% efficiency, averaged over a long shader like this, seems like a fair representation of the wasteful design that a superscalar ALU architecture amounts to, as transistor budgets go up.

Similarly, having ALUs that cannot operate while at least some of the texturing is being performed leads to a greater loss of efficiency. Though as shaders get longer (and texturing operations amount to a lower percentage of instructions) this particular efficiency loss falls-off.

In other words more and more transistors will be sitting idle as IHVs progress through 90nm into 65nm and beyond, as the number of pipelines increases. Something's got to give and that appears to be what ATI's doing with Xenos and R600.

It appears that R520 will prolly be some kind of superscalar design too (R420 is, but the second ALU has limited, PS1.4, functionality). So R520's only improvement in pipeline efficiency will, presumably, come from making all ALUs in the pixel pipelines equivalently functional.

Another area where ALU efficiency is lost is when dynamic branching occurs. Currently, in NV40, pixel shader code causes a loss of efficiency in branching because around 1000 or so separate pixels are all lumped together, running the longest execution path through the shader. e.g. if one pixel is lit by 5 lights, all ~1000 pixels in the batch are "lit by 5 lights" though predication prevents the superfluous code having any effect on those pixels lit by less than 5 lights.

I think it's fair to say everyone was expecting that this branch commonality would operate at the quad level in NV40, but it's turned out (through experiment) to measure at a larger level of granularity. The loss of efficiency, here, is catastrophic.

It means that developers have avoided implementing shader code that performs dynamic per-pixel branching.

It'll be interesting to see if G70 and R520 can do quad-level dynamic-branching.

Jawed
 
Back
Top