Typical GPU Efficiency

_xxx_ said:
Yeah, well that's what I said above, isn't it? Even one of those units is probably superscalar itself. So how would you build a modern processor, be it CPU, GPU or whatever without making it superscalar (in terms of "multiple instructions/cycle independently")?

You missed the "per thread" part. Older GPUs had one ALU for shader instructions, per pipe.

Both NV40 and R420 have 2 ALUs for shader instructions, per pipe.

NVidia is making the claim that NV40 is superscalar (as opposed to earlier GPUs). Despite the fact that older GPUs can co-issue a vec and a scalar.

I'm using NVidia's terminology here. It actually fits quite well, because most of the time the second ALU (which can do vec2+vec2) is sitting idle. Now if that's not a reduction in efficiency, then I don't know what is :!:

R420 seems to suffer from the same problem, its mini-ALU for PS1.4 often sits doing nothing.

Jawed
 
Jawed said:
_xxx_ said:
most of the time the second ALU (which can do vec2+vec2) is sitting idle. Now if that's not a reduction in efficiency, then I don't know what is :!:

Not having that second ALU in apps that need it would probably introduce more painful issues, I guess?

EDIT:

And nV40 is certainly superscalar, just like R420. Not in whatever nV terms, but per definition. I'll quote the science again:
• Superscalar architectures allow several instructions
to be issued and completed per clock cycle.
• A superscalar architecture consists of a number of
pipelines that are working in parallel.
• Depending on the number and kind of parallel units
available, a certain number of instructions can be
executed in parallel.
 
nAo said:
Jawed said:
Bearing in mind that NV40 is capable of executing 4 shader instructions per cycle (peak), 55% efficiency, averaged over a long shader like this, seems like a fair representation of the wasteful design that a superscalar ALU architecture amounts to, as transistor budgets go up.
Do you realize NV40 pixel pipelines can execute a variable number of instructions per clock cycle since ALUs support dual issue and co-issue?

Of course I do.

Obviously you can't count shader instructions in a shader and then divide than number by the number of clock cycle that are need to execute that shader :)

If a pipeline is capable of executing 4 instructions per clock, but on average only executes 2.2 instructions per clock, then it's running at 55% efficiency, and it's a wasteful architecture. If the mix of instructions doesn't suit the ALU design, then it's wasteful.

To figure out a decent and meaninful number you should calculate how many flops are being executed per clock (as an average value on a set of 'common' shaders) otherwise you're dividing apples by oranges.

As I said, we have an architecture designed to execute 4 instructions per clock, vec3+scalar+vec2+vec2. If the instruction mix on long shaders is entirely unlike that, consisting of vec3+scalar... well you can see where this is going. The vec2+vec2 ALU can also do scalar operations, but it seems there aren't enough of those in the code, either, to make full use of the second ALU.

Why don't you provide an analysis of the average cycle count of the shaders you've written. You write SM3 code, don't you? I don't, which is why I have to scrub around looking for "long, complex" shaders.

Similarly, having ALUs that cannot operate while at least some of the texturing is being performed leads to a greater loss of efficiency. Though as shaders get longer (and texturing operations amount to a lower percentage of instructions) this particular efficiency loss falls-off.
Au contraire, if what you wrote before it's true using an ALU to help texturing would increase efficiency since as you stated before an ALU is sitting idle most of the time :)
You obviously don't understand NV40 very well then do you? The ALU that can't work cos it's doing texturing work is the vec3+scalar ALU, which is the ALU that does most of the work!

NV40 ALUs stall only on subsequent instructions needing the result of a texture fetch, if you have other non depedent instruction to execute it doesn't stall.

I agree. That's not the point I'm making there. R420 has a dedicated ALU for texture address calculation which leaves the main ALU free to execute other shader code as a TMU operation is initiated.

I don't think the second ALU is idle most of the time otherwise NVIDIA hw designer wouldn't have put a second ALU there.

These are the same HW designers that built NV30 aren't they?...

Obviously they run and analyze thousands of shaders (as ATI does..) to understand which ALUs structure is the most suitable to handle current and near future shaders workload.

Which is why NVidia designed a dynamic branching architecture in NV40 that is entirely useless for per-pixel dynamic branching, because the pixel thread batch size is measured at approximately 1000.

In other words more and more transistors will be sitting idle as IHVs progress through 90nm into 65nm and beyond, as the number of pipelines increases. Something's got to give and that appears to be what ATI's doing with Xenos and R600.
What ATI is doing with Xenos is not related to what you're talking here since they're addressing another kind of stalls due to lack of vertex or pixel to shade, with a unified shading scheme they're not addressing efficiency problems related to partially used ALUs.
No, Xenos is a design with one ALU per thread. There is no chance for an ALU to go idle because a thread cannot issue more than one instruction per clock (either because of dependency, or because of an incorrect instruction mix: vec2 ALU sitting idle because there are no vec2 instructions)

It'll be interesting to see if G70 and R520 can do quad-level dynamic-branching.
Dunno about R520 but G70 doesn't address the dynamic branching problem AFAIK.

Well if R520 does that's going to create a riot!

Jawed
 
xxx, you're missing the point, R420 and NV40 are superscalar PIPELINES.

R300 is a good example of a non-superscalar pipeline.

Xenos is not a superscalar design. Each ALU is always active. The PS1.4 ALU in R420 and the vec2+vec2 ALU in NV40 are both inactive a lot of the time.

This is why these designs average out at 50-60% efficiency.

Jawed
 
Jawed said:
If a pipeline is capable of executing 4 instructions per clock, but on average only executes 2.2 instructions per clock, then it's running at 55% efficiency, and it's a wasteful architecture. If the mix of instructions doesn't suit the ALU design, then it's wasteful.
If it executes one vec4 mul and one vec4 mad, it's running at 100%. Despite the capability of executing a peak 4 instructions/clock (and leaving the SFU and NRM_PP out of the picture).

As I said, we have an architecture designed to execute 4 instructions per clock, vec3+scalar+vec2+vec2. If the instruction mix on long shaders is entirely unlike that, consisting of vec3+scalar... well you can see where this is going. The vec2+vec2 ALU can also do scalar operations, but it seems there aren't enough of those in the code, either, to make full use of the second ALU.
Both mul and mad shader units are capable of 3:1 and 2:2 splits.

You obviously don't understand NV40 very well then do you? The ALU that can't work cos it's doing texturing work is the vec3+scalar ALU, which is the ALU that does most of the work!
"The ALU that can't work cos it's doing texturing work" - how can it not work when it's doing work?
And no, the mul ALU is not the one that is doing most of the work, simply because it's restricted to multiplication. The other one can do multiply-add and dot products.

Which is why NVidia designed a dynamic branching architecture in NV40 that is entirely useless for per-pixel dynamic branching, because the pixel thread batch size is measured at approximately 1000.
That doesn't make it entirely useless. But there's room for improvement.

No, Xenos is a design with one ALU per thread. There is no chance for an ALU to go idle because a thread cannot issue more than one instruction per clock (either because of dependency, or because of an incorrect instruction mix: vec2 ALU sitting idle because there are no vec2 instructions)
Xenos ALUs are vec4 + scalar. So the scalar part could sit idle. Or a vec2 instruction wastes two channels. And it always processes 16 vertices/4 quads at once, some of them could be wasted.

Well if R520 does that's going to create a riot!

Jawed
Well, ATI does warn about some performance implications with dynamic branching, too.
 
Xmas, where on earth do you get the idea that NV40 has a MUL ALU and a MAD ALU?

I'm not saying you're wrong, it's just I've never seen NV40's pipeline described in those terms.

Jawed
 
Jawed said:
If a pipeline is capable of executing 4 instructions per clock, but on average only executes 2.2 instructions per clock, then it's running at 55% efficiency, and it's a wasteful architecture. If the mix of instructions doesn't suit the ALU design, then it's wasteful.
You would be right if all the shader ops comprise the same number of floating point ops and if ALUs couldn't change their configuration acting in different ways when it's needed (co-issue).
But this is not the case on NV40 ;)
NV40 ALUs can execute N operations per clock cycle (even more than 4..) but even if a pixel pipe is executing just 2 shader ps per clock this don't mean it's not efficient.
As example we can have a shader composed of dot4 and mul4 instructions.
The first ALU can do a dot4 per cycle, the second one cad do a mul4 per cycle. We can have a 100 instructions long shader with 50 dot4 and 50 mul4 being executed in 50 cycles.
In this case ALUs are running full throttle, nonetheless you're saying the second ALU is sitting idle cause a pixel pipe is not executing 4 shaders ops per cycle only cause it's capable of that, that's funny, LOL :)
You forgot that when a pixel pipes is doing a lot of shader ops per cycle each shader ops is working on a smaller data unit, cause NV40 ALUs are reconfigurable.
In the end it doesn't count how many shader ops you're executing, what matter is how many units are working, we don't care if an ALU is doing a mul4 o 2 mul2, we care about it running and doing calculations most of the time.
That's why your calculations is plain wrong.

As I said, we have an architecture designed to execute 4 instructions per clock, vec3+scalar+vec2+vec2
No, we have an architecture designed in some cases as you wrote above, but it can works in another way, there're many different configurations, that's why you calculation is wrong, you're summing and dividing different things, this is basic algebra and you're not getting it right :)

. If the instruction mix on long shaders is entirely unlike that, consisting of vec3+scalar... well you can see where this is going. The vec2+vec2 ALU can also do scalar operations, but it seems there aren't enough of those in the code, either, to make full use of the second ALU.
The instruction mix can change , that's not a problem, ALUs ARE NOT WORKING on a fixed configuration, they change each clock cycle according on what they're scheduled to execute any given clock cycle.


You obviously don't understand NV40 very well then do you? The ALU that can't work cos it's doing texturing work is the vec3+scalar ALU, which is the ALU that does most of the work!
The 'texture' work' as you call it last one clock cycle.
An ALU is used to do perspective correction on texturing coordinates.


R420 has a dedicated ALU for texture address calculation which leaves the main ALU free to execute other shader code as a TMU operation is initiated.
NV40 can execute other shader code too (using BOTH ALUs) as long these other instructions are not depending upon a previous texture fetch.
There was a full thread about this..

These are the same HW designers that built NV30 aren't they?...
Dunno it they're the same, but I bet smart people learn from their errors.

Which is why NVidia designed a dynamic branching architecture in NV40 that is entirely useless for per-pixel dynamic branching, because the pixel thread batch size is measured at approximately 1000.
It's not useless, it has limitations, it's far from perfect and it could be improved but it's not useless.
There're cases where dynamic braching is going to help you..there are other cases where dynamic branching will slow shaders execution.

No, Xenos is a design with one ALU per thread.
I know, but this it's not related to a unified shading scheme ;)

There is no chance for an ALU to go idle because a thread cannot issue more than one instruction per clock (either because of dependency, or because of an incorrect instruction mix: vec2 ALU sitting idle because there are no vec2 instructions)
Dunno how much flexible Xenos ALU are but everytime you don't have a vec4 and a scalar op to execute per cycle then the ALU will be partially idle ;)
NV40 ALUs address this problem with co-issue.
At the end of the day you should count how many flop you're executing and how many flops you could potentially execute, shaders ops or instructions per clock cycle doesn't mean anything.
 
Jawed said:
xxx, you're missing the point, R420 and NV40 are superscalar PIPELINES.

No, they're superscalar processors, already just because of having multiple pipelines. The meaning of the word superscalar is just that - multiple instructions per clock cycle on the processor level, not the individual pipelines.

R300 is a good example of a non-superscalar pipeline.

Dunno, maybe, but it is a superscalar processor.

Xenos is not a superscalar design. Each ALU is always active. The PS1.4 ALU in R420 and the vec2+vec2 ALU in NV40 are both inactive a lot of the time.

But it surely is. ALU's being active or not has nothing to do with the processor being superscalar or not.

I get what you're saying, but I'm just arguing the meaning of the term itself. All these GPU's _are_ superscalar and in case of NV40 and R520 they're also superpipelined (dividing the stages of a pipeline into substages and thus increasing the number of instructions which are supported by the
pipeline at a given moment), which R300 as I get it is not (probably that's what you meant?).
 
xxx, one last time, in NVidia's marketing material they describe the PIPELINE as superscalar. I've got no reason to argue with that definition.

Of course a GPU is superscalar, taken as a whole.

Jawed
 
Jawed said:
xxx, one last time, in NVidia's marketing material they describe the PIPELINE as superscalar. I've got no reason to argue with that definition.

Of course a GPU is superscalar, taken as a whole.

Jawed

Ok, we were talking about different things then. I was all about GPU as a whole :)
 
nAo said:
Dunno how much flexible Xenos ALU are but everytime you don't have a vec4 and a scalar op to execute per cycle then the ALU will be partially idle ;)

I agree. Plus there's a loss of efficiency due to triangle edges - and fragment batch size when per-pixel dynamic branches are executed.

My argument is that NV40's ALU architecture means that more of its ALU capacity will sit idle, per clock than in Xenos's architecture.

NV40 ALUs address this problem with co-issue.
At the end of the day you should count how many flop you're executing and how many flops you could potentially execute, shaders ops or instructions per clock cycle doesn't mean anything.

I'd agree, except that it seems that a vec2+vec2 ALU (which can also be a scalar+scalar ALU) is not well suited to average shader code. There's an over-abundance of capacity for instruction types that are marginal. If the bulk of pixel shader code is vec3 then why create such a vast amount of vec2 and scalar capacity?

Jawed
 
Xmas said:
Well, ATI does warn about some performance implications with dynamic branching, too.

Yes, you're right. I forgot about that. ATI's annoyingly vague too.

The GPU Gems chapter (thanks for that nAo!) at least describes batches of fragments as "hundreds" in size which hints at NV40's problem.

Jawed
 
nAo said:
The instruction mix can change , that's not a problem, ALUs ARE NOT WORKING on a fixed configuration, they change each clock cycle according on what they're scheduled to execute any given clock cycle.
To be more precise, they change every "quad batch cycle", i.e. after a given instruction has been performed on all quads in a batch.


Jawed said:
Xmas, nAo, look at page 15:

http://www.hotchips.org/archives/hc16/2_Mon/13_HC16_Sess3_Pres1_bw.pdf

Texture address calculation in the first ALU.

Page 16 is interesting, as well...

Jawed
Yes. Look at the example shader code. There's any kind of split (vec4 only, 3:1, 2:2) in both ALUs.

btw,
http://www.beyond3d.com/previews/nvidia/nv40/index.php?p=9
http://www.3dcenter.org/artikel/nv40_pipeline/index_e.php
;)
 
This is interesting:

Code:
# clock 1
texld r0, t0, s0; # tex fetch
madr r0, r0, c1.r, c1.g # _bx2 in tex
nrm_pp r1.rgb, t4 # nrm in shader 0
dp3 r1.r, r1, r0 # 3D dot product in shader 1
mul r0.a, r0, r0 # dual issue in shader 1

# clock 2
mul r1.a, r0.a, c2.a # dual issue in shader 0
mul r0.rgb, r1.r, r0 # dual issue in shader 0
add r0.a, r1.r, r1.r # fx2 in shader 0
mad r0.rg, r0.a, c1, c1.a # mad w/2 const in shader 1
mul r1.ba, r1.a, r0.a, c2 # dual issue in shader 1

# clock 3
rcp r0.a, r0.a # reciprocal in shader 0
mul r0.rg r0, r0.a # div instruction in shader 0
mul r0.a, r0.a, r1.a # dual issue in shader 0
texld r2, r0, s1 # texture fetch
mad r2.rgb, r0.a, r2, c5 # mad in shader 1
abs r0.a, r0.a # abs in shader 1
log r0.a, r0.a # log in shader 1

# clock 4
rcp r0.a, t1.a # reciprocal in shader 0
mul r0.rg, t1, r0.a # div instruction in shader 0
mul r0.a, r0.a, c2.g # dual issue in shader 0
texld r1, r0, s3 # tex fetch
mad r1.rgb, r1, c4, -r2 # mad in shader 1
exp r0.a, r0.a # dual issue in shader 1

# clock 5
texld r0, r1.bar, s2 # texture coordinates swizzle
mad r0.rgb, r0, v0, r1 # color calculation in shader 1
mul r0.a, r1, v0 # dual issue in shader 1

# clock 6
mul r1.rgb, r0.a, c5.a # mul in shader 0
mad r0.rgb, r1, r0.a, r0 # mad in shader 1
mov r0.a, c3.a # move in shader 1
mov oC0, r0 # move in shader 1

I count 30 instructions executed in 6 clocks.

Very impressive.

So, as an example of NV40's capability, how representative is that?

Jawed

(edited, 30, not 31, sigh)
 
Jawed, superscalar design and a unified architecture are completely orthogonal ideas to one another. The efficiency that was talked about with respect to the Xenos has nothing to do with any sort of superscalar issues. It has to do with pixel pipelines waiting for vertex pipelines and vice versa. That is the efficiency that a unified architecture fixes, not the efficiency of making use of all of a superscalar pipeline's units.

And the pipelines of the Xenos are most assuredly superscalar in themselves, it'd be rather silly for them not to be.
 
Xmas, if I understand correctly, then NV40 is able to do:

MUL in ALU 1, which can be either vec3+scalar or vec2+vec2 or vec2+scalar or scalar+scalar

MAD in ALU 2, which can be either vec3+scalar or vec2+vec2 or vec2+scalar or scalar+scalar

So you could do a vec3+scalar MUL and a vec3+scalar MAD in one clock, for example.

So the reason that pipeline efficiency falls off in NV40 is due to dependency in code. e.g. the MUL and MAD can only operate in one clock if there's no dependency between them.

Xenos avoids dependency by having a single ALU, and it seemingly uses thread interleaving to avoid inter-dependent instruction latency.

Jawed
 
Jawed said:
So the reason that pipeline efficiency falls off in NV40 is due to dependency in code. e.g. the MUL and MAD can only operate in one clock if there's no dependency between them.

Xenos avoids dependency by having a single ALU, and it seemingly uses thread interleaving to avoid inter-dependent instruction latency.

Jawed
Buzz seems xenos have the same problem
Providing developers throw instructions at our architecture in the right way, Xenos can run at 100 per cent efficiency all the time
 
Back
Top