Typical GPU Efficiency

Charmaka · Jun 15, 2005

Bob said:
A point that hasn't been addressed is how much additional hardware is needed to keep such a unified architecture utilized at 100% (or even 90%). If you need ~40% more transistors due to more FIFOs, register file ports, thead management, scoreboarding, etc, then a 60% efficient GPU with 40% more hardware would do just as well as a 100% efficient GPU. Note only that, but it'll have a higher peak performance, which opens the door for more optimzations.

If the 100% processor is defined as having a performance of 1, a 60% processor of the same size has performance 0.6 and a 60% processor which is 40% larger has a performance of 0.84. You'd need ~66% more transistors to get the same kind of speed

Jawed said:
I think it's fair to say everyone was expecting that this branch commonality would operate at the quad level in NV40, but it's turned out (through experiment) to measure at a larger level of granularity. The loss of efficiency, here, is catastrophic.

It means that developers have avoided implementing shader code that performs dynamic per-pixel branching.

Forgive my somewhat poor memory, but isn't that basically what the comments in that ATI presentation about SM3.0 said? That branching in the NV40 was too slow to be useful?

Bob · Jun 15, 2005

If the 100% processor is defined as having a performance of 1, a 60% processor of the same size has performance 0.6 and a 60% processor which is 40% larger has a performance of 0.84. You'd need ~66% more transistors to get the same kind of speed

Ok, so I screwed up my numbers. My question still stands though: is eff(A) / cost(A) > eff(B) / cost(B)? Having a higher efficiency helps you for sure, but does it also increase your cost? By how much?

What do you think is the typical efficiency of Xenos? What do you think is the typical efficiency of RSX? What are the performance and area differences?

How do you compute efficiency?

So many questions... I, for one, would like one of each GPU so I could test them.

Jawed · Jun 15, 2005

Charmaka said:
Forgive my somewhat poor memory, but isn't that basically what the comments in that ATI presentation about SM3.0 said? That branching in the NV40 was too slow to be useful?

Yeah, though it's a Crytek presentation, available from both ATI and NVidia:

http://www.ati.com/developer/gdc/D3DTutorial08_FarCryAndDX9.pdf

from:

http://www.ati.com/developer/techpapers.html

http://developer.nvidia.com/object/gdc_2005_presentations.html

The other relevant page:

http://graphics.stanford.edu/~yoel/notes/

written by Jeremy Sugarman who's building a ray tracer implemented in Brook for GPUs, a "general purpose" programming language for GPUs.

Take a look at the entries for 17 March and 21 February 2005.

Jawed

Mintmaster · Jun 15, 2005

Chalnoth said:
Well, not really. One can see where he's coming from when you just consider a game that has both large and small triangles to render in any given scene. The large triangles will be pixel-limited, and thus the vertex units will sit idle. The small triangles will be vertex-limited, and thus the pixel units will sit idle.

Precisely.

Even for one object, the triangles will be different sizes. No matter how you optimize the geometry, clusters where the faces are at a glancing angle will have higher vertex loads, and they will always change with object motion and rotation. Then you also have culled triangles (backfaces as well as those outside the view frustum), where the pixel/vertex ratio is zero. You really want to fly through these as fast as possible.

What you're talking about is exactly what I did at ATI several years back. Checking the balance, the stalls, location of the bottlenecks, etc.

Bob said:
...more transistors due to more FIFOs, register file ports, thead management, scoreboarding, etc...

True, but remember that traditional GPU's have very large FIFOs to try and eliminate the aforementioned intra-batch fluctuations in pixel/vertex balance. These may actually be reduced since C1 can control the processing rate on both ends of these FIFOs.

DemoCoder · Jun 15, 2005

Well, there's an obvious way to solve this dilemma quite easily without going through all the effort of unified shaders.

Remove vertex shaders from the GPU.

Waaah!? Yep, "unify" them by making only one kind of shader (the most important IMHO) exist: the pixel shader.

Yes, remove them, and move those VS units to the CPU, where they can be far more functional and general purpose at processing geometry. Then let multithreading on the CPU handle keeping the units busy with lots of general purpose work (geometry, physics, sound, etc)

arjan de lumens · Jun 15, 2005

DemoCoder said:
Remove vertex shaders from the GPU.

Waaah!? Yep, "unify" them by making only one kind of shader (the most important IMHO) exist: the pixel shader.

Yes, remove them, and move those VS units to the CPU, where they can be far more functional and general purpose at processing geometry. Then let multithreading on the CPU handle keeping the units busy with lots of general purpose work (geometry, physics, sound, etc)

Sounds like throwing the baby out with the bath-water. There are a whole bunch of reasons why I don't think that is a good idea:

Bandwidth between CPU and GPU - even assuming the CPU could deliver adequate shading performance, I suspect you would need tens of gigabytes/sec to do this properly.
Power efficiency wrt instruction fetch/decode/scheduling, register file lookups etc and number of functional units, and clock speed issues. A 3GHz vertex shader will likely burn about 3-5 times more energy per executed instruction than an array of 500MHz shaders - very bad, given that power consumption is becoming more and more THE limiting factor on both CPU and GPU performance.
Latency masking is generally harder with a general-purpose CPU than with the data streaming model that GPUs normally do for vertex data. This goes for both instruction execution latencies and latencies due to cache misses. You generally need a quite large pool of threads to mask latency as efficiently as current vertex shaders do.
Parallellism within a vertex array - taking a general-purpose multithreaded CPU, it is generally not very feasible to start one thread per vertex like dedicated vertex shader hardware would be likely to do, and if you subdivide the vertex array into too large chunks for suitable multhtreading, you are likely to get memory access patterns that maximize the number of DRAM page breaks.
Vertex texturing, especially with filtered texture lookups, will not perform well at all.

Charmaka · Jun 15, 2005

Not to mention the fact that, if your game happens to be CPU-limited anyway, you're just further crippling your performance.

DemoCoder · Jun 15, 2005

I'm speaking of a cell-like architecture, not a P4 hyperthreading technique. The SPEs are already streaming based, and already have the requisite bandwidth to the GPU. Obviously, this is not going to happen on the PC anytime soon. The problem with vertex shaders is that they are not babies, they are embryos. Its like throwing the embyro out with the bathwater. They are way to limited to do anything non-trivial.

Just about everything "interesting" is moving to per-pixel, and I see vertex shaders almost returning to performing what fixed function hardware used to do in the old days: transform + skinning. The ability to manipulate objects, and create/delete vertices is where the future is headed, and I'm not sure it makes sense for a GPU to tackle this. The only real benefit is data compression, since the non-tesselated/geometry shaded data can be sent over a slow bus, but we know that in the majority of today's most demanding games, the PCIE/AGP bus isn't even a limit.

So to me, unified shading isn't a big deal until WGF2.0/Geometry Shading. Most games aren't vertex shader limited, and those that make heavy use of techniques like stencil volumes, are best left doing volume extrusion on the CPU, not the GPU.

_xxx_ · Jun 16, 2005

DaveBaumann said:
A "thread" is something that has a single state. 3 threads are being processed, there are many, many threads ready to be processed and interleved with other threads.

How many B3D threads can be processed in parallel?

_xxx_ · Jun 16, 2005

Jawed said:
...wasteful design that a superscalar ALU architecture amounts to...

Care to elaborate - what would be the better alternative? You can't NOT go superscalar IMHO.

Megadrive1988 · Jun 16, 2005

Xmas said:
While I do believe that the 50-60% figure makes sense, I don't believe the 100% for Xenos.

of course it's not going to be 100% for Xenos. anyway, I thought I read that it could be 95% for Xenos. I could believe 90+ percent, in relation to the supposed 50 to 60 percent for current non-unified shader architectures.

_xxx_ said:
How many B3D threads can be processed in parallel?

lmao

Megadrive1988 · Jun 16, 2005

DemoCoder said:
Well, there's an obvious way to solve this dilemma quite easily without going through all the effort of unified shaders.

Remove vertex shaders from the GPU.

Waaah!? Yep, "unify" them by making only one kind of shader (the most important IMHO) exist: the pixel shader.

Yes, remove them, and move those VS units to the CPU, where they can be far more functional and general purpose at processing geometry. Then let multithreading on the CPU handle keeping the units busy with lots of general purpose work (geometry, physics, sound, etc)

kind of like Playstation2 with Vecter Units / vertex shaders on the CPU and a pixel-shader based Graphics Synthesizer

Jawed · Jun 16, 2005

_xxx_ said:
Jawed said:

...wasteful design that a superscalar ALU architecture amounts to...

Click to expand...

Care to elaborate - what would be the better alternative? You can't NOT go superscalar IMHO.

Unified shader pool, i.e. Xenos. NV40's 2nd ALU spends most of its time doing nothing.

Jawed

nAo · Jun 16, 2005

Jawed said:
Unified shader pool, i.e. Xenos. NV40's 2nd ALU spends most of its time doing nothing.

How do you know?

Inane_Dork · Jun 16, 2005

nAo said:
Jawed said:

Unified shader pool, i.e. Xenos. NV40's 2nd ALU spends most of its time doing nothing.

Click to expand...

How do you know?

He has a really good magnifying glass and excellent eyesight.

Jawed · Jun 16, 2005

OK, I'll post it all again for the hard of reading:

http://www.beyond3d.com/forum/viewtopic.php?p=547353#547353

I wrote about efficiency recently:

If only we could talk in terms of pixel shader instructions, comparisons would start to get meaningful. This example shows SM3 executing 102 instructions in 46.75 cycles, 2.2 instructions per cycle:

http://www.beyond3d.com/forum/viewtopic.php?p=327176#327176

Bearing in mind that NV40 is capable of executing 4 shader instructions per cycle (peak), 55% efficiency, averaged over a long shader like this, seems like a fair representation of the wasteful design that a superscalar ALU architecture amounts to, as transistor budgets go up.

Similarly, having ALUs that cannot operate while at least some of the texturing is being performed leads to a greater loss of efficiency. Though as shaders get longer (and texturing operations amount to a lower percentage of instructions) this particular efficiency loss falls-off.

In other words more and more transistors will be sitting idle as IHVs progress through 90nm into 65nm and beyond, as the number of pipelines increases. Something's got to give and that appears to be what ATI's doing with Xenos and R600.

It appears that R520 will prolly be some kind of superscalar design too (R420 is, but the second ALU has limited, PS1.4, functionality). So R520's only improvement in pipeline efficiency will, presumably, come from making all ALUs in the pixel pipelines equivalently functional.

Another area where ALU efficiency is lost is when dynamic branching occurs. Currently, in NV40, pixel shader code causes a loss of efficiency in branching because around 1000 or so separate pixels are all lumped together, running the longest execution path through the shader. e.g. if one pixel is lit by 5 lights, all ~1000 pixels in the batch are "lit by 5 lights" though predication prevents the superfluous code having any effect on those pixels lit by less than 5 lights.

I think it's fair to say everyone was expecting that this branch commonality would operate at the quad level in NV40, but it's turned out (through experiment) to measure at a larger level of granularity. The loss of efficiency, here, is catastrophic.

It means that developers have avoided implementing shader code that performs dynamic per-pixel branching.

It'll be interesting to see if G70 and R520 can do quad-level dynamic-branching.

Jawed

_xxx_ · Jun 16, 2005

Jawed said:
_xxx_ said:

Jawed said:

...wasteful design that a superscalar ALU architecture amounts to...

Click to expand...

Care to elaborate - what would be the better alternative? You can't NOT go superscalar IMHO.

Click to expand...

Unified shader pool, i.e. Xenos. NV40's 2nd ALU spends most of its time doing nothing.

Jawed

It's got nothing to do with that. Superscalar per definition:

Superscalar architecture refers to the use of multiple execution units, to allow the processing of more than one instruction at a time. This can be thought of as a form of "internal multiprocessing", since there really are multiple parallel processors inside the CPU. Most modern processors are superscalar; some have more parallel execution units than others.

Know what I mean? From the purely architectural standpoint, a processor for this kind of use MUST be superscalar and will surely remain that way for the future.

EDIT:
even better def:

A superscalar architecture is one in which several
instructions can be initiated simultaneously and
executed independently

from here.

Jawed · Jun 16, 2005

From the NVidia_ImageQuality_v03.pdf:

The NVIDIA GeForce 6 Series introduces an innovative shader architecture that can double the number of operations executed per cycle (Figures 1 and 2). Two shading units per pixel deliver a twofold increase in pixel operations in any given cycle. This increased performance enables a host of complex computations and pixel operations. The result is stunning visual effects and a new level of image sophistication within fast-moving bleeding-edge games and other real-time interactive applications.

This is superscalar, multiple instruction execution per thread per clock. Superscalar doesn't refer to the parallel pipeline architecture of graphics cards.

Jawed

_xxx_ · Jun 16, 2005

Yeah, well that's what I said above, isn't it? Even one of those units is probably superscalar itself. So how would you build a modern processor, be it CPU, GPU or whatever without making it superscalar (in terms of "multiple instructions/cycle independently")?

nAo · Jun 16, 2005

Jawed said:
Bearing in mind that NV40 is capable of executing 4 shader instructions per cycle (peak), 55% efficiency, averaged over a long shader like this, seems like a fair representation of the wasteful design that a superscalar ALU architecture amounts to, as transistor budgets go up.

Do you realize NV40 pixel pipelines can execute a variable number of instructions per clock cycle since ALUs support dual issue and co-issue?
Obviously you can't count shader instructions in a shader and then divide than number by the number of clock cycle that are need to execute that shader

To figure out a decent and meaninful number you should calculate how many flops are being executed per clock (as an average value on a set of 'common' shaders) otherwise you're dividing apples by oranges.

Similarly, having ALUs that cannot operate while at least some of the texturing is being performed leads to a greater loss of efficiency. Though as shaders get longer (and texturing operations amount to a lower percentage of instructions) this particular efficiency loss falls-off.

Au contraire, if what you wrote before it's true using an ALU to help texturing would increase efficiency since as you stated before an ALU is sitting idle most of the time

NV40 ALUs stall only on subsequent instructions needing the result of a texture fetch, if you have other non depedent instruction to execute it doesn't stall.
I don't think the second ALU is idle most of the time otherwise NVIDIA hw designer wouldn't have put a second ALU there.
Obviously they run and analyze thousands of shaders (as ATI does..) to understand which ALUs structure is the most suitable to handle current and near future shaders workload.

In other words more and more transistors will be sitting idle as IHVs progress through 90nm into 65nm and beyond, as the number of pipelines increases. Something's got to give and that appears to be what ATI's doing with Xenos and R600.

What ATI is doing with Xenos is not related to what you're talking here since they're addressing another kind of stalls due to lack of vertex or pixel to shade, with a unified shading scheme they're not addressing efficiency problems related to partially used ALUs.

It'll be interesting to see if G70 and R520 can do quad-level dynamic-branching.

Dunno about R520 but G70 doesn't address the dynamic branching problem AFAIK.

Typical GPU Efficiency

Charmaka

Bob

Jawed

Mintmaster

DemoCoder

arjan de lumens

Charmaka

DemoCoder

_xxx_

_xxx_

Megadrive1988

Megadrive1988

Jawed

nAo

Nutella Nutellae

Inane_Dork

Rebmem Roines

Jawed

_xxx_

Jawed

_xxx_

nAo

Nutella Nutellae

Similar threads