Jawed said:
If a pipeline is capable of executing 4 instructions per clock, but on average only executes 2.2 instructions per clock, then it's running at 55% efficiency, and it's a wasteful architecture. If the mix of instructions doesn't suit the ALU design, then it's wasteful.
You would be right
if all the shader ops comprise the same number of floating point ops and if ALUs couldn't change their configuration acting in different ways when it's needed (co-issue).
But this is not the case on NV40
NV40 ALUs can execute N operations per clock cycle (even more than 4..) but even if a pixel pipe is executing just 2 shader ps per clock this don't mean it's not efficient.
As example we can have a shader composed of dot4 and mul4 instructions.
The first ALU can do a dot4 per cycle, the second one cad do a mul4 per cycle. We can have a 100 instructions long shader with 50 dot4 and 50 mul4 being executed in 50 cycles.
In this case ALUs are running full throttle, nonetheless you're saying the second ALU is sitting idle cause a pixel pipe is not executing 4 shaders ops per cycle only cause it's capable of that, that's funny, LOL
You forgot that when a pixel pipes is doing a lot of shader ops per cycle each shader ops is working on a smaller data unit, cause NV40 ALUs are reconfigurable.
In the end it doesn't count how many shader ops you're executing, what matter is how many units are working, we don't care if an ALU is doing a mul4 o 2 mul2, we care about it running and doing calculations most of the time.
That's why your calculations is plain wrong.
As I said, we have an architecture designed to execute 4 instructions per clock, vec3+scalar+vec2+vec2
No, we have an architecture designed in some cases as you wrote above, but it can works in another way, there're many different configurations, that's why you calculation is wrong, you're summing and dividing different things, this is basic algebra and you're not getting it right
. If the instruction mix on long shaders is entirely unlike that, consisting of vec3+scalar... well you can see where this is going. The vec2+vec2 ALU can also do scalar operations, but it seems there aren't enough of those in the code, either, to make full use of the second ALU.
The instruction mix can change , that's not a problem, ALUs ARE NOT WORKING on a fixed configuration, they change each clock cycle according on what they're scheduled to execute any given clock cycle.
You obviously don't understand NV40 very well then do you? The ALU that can't work cos it's doing texturing work is the vec3+scalar ALU, which is the ALU that does most of the work!
The 'texture' work' as you call it last one clock cycle.
An ALU is used to do perspective correction on texturing coordinates.
R420 has a dedicated ALU for texture address calculation which leaves the main ALU free to execute other shader code as a TMU operation is initiated.
NV40 can execute other shader code too (using BOTH ALUs) as long these other instructions are not depending upon a previous texture fetch.
There was a full thread about this..
These are the same HW designers that built NV30 aren't they?...
Dunno it they're the same, but I bet smart people learn from their errors.
Which is why NVidia designed a dynamic branching architecture in NV40 that is entirely useless for per-pixel dynamic branching, because the pixel thread batch size is measured at approximately 1000.
It's not useless, it has limitations, it's far from perfect and it could be improved but it's not useless.
There're cases where dynamic braching is going to help you..there are other cases where dynamic branching will slow shaders execution.
No, Xenos is a design with one ALU per thread.
I know, but this it's not related to a unified shading scheme
There is no chance for an ALU to go idle because a thread cannot issue more than one instruction per clock (either because of dependency, or because of an incorrect instruction mix: vec2 ALU sitting idle because there are no vec2 instructions)
Dunno how much flexible Xenos ALU are but everytime you don't have a vec4 and a scalar op to execute per cycle then the ALU will be partially idle
NV40 ALUs address this problem with co-issue.
At the end of the day you should count how many flop you're executing and how many flops you could potentially execute, shaders ops or instructions per clock cycle doesn't mean anything.