I wrote about efficiency recently:
If only we could talk in terms of pixel shader instructions, comparisons would start to get meaningful. This example shows SM3 executing 102 instructions in 46.75 cycles, 2.2 instructions per cycle:
http://www.beyond3d.com/forum/viewtopic.php?p=327176#327176
Bearing in mind that NV40 is capable of executing 4 shader instructions per cycle (peak), 55% efficiency, averaged over a long shader like this, seems like a fair representation of the wasteful design that a superscalar ALU architecture amounts to, as transistor budgets go up.
Similarly, having ALUs that cannot operate while at least some of the texturing is being performed leads to a greater loss of efficiency. Though as shaders get longer (and texturing operations amount to a lower percentage of instructions) this particular efficiency loss falls-off.
In other words more and more transistors will be sitting idle as IHVs progress through 90nm into 65nm and beyond, as the number of pipelines increases. Something's got to give and that appears to be what ATI's doing with Xenos and R600.
It appears that R520 will prolly be some kind of superscalar design too (R420 is, but the second ALU has limited, PS1.4, functionality). So R520's only improvement in pipeline efficiency will, presumably, come from making all ALUs in the pixel pipelines equivalently functional.
Another area where ALU efficiency is lost is when dynamic branching occurs. Currently, in NV40, pixel shader code causes a loss of efficiency in branching because around 1000 or so separate pixels are all lumped together, running the longest execution path through the shader. e.g. if one pixel is lit by 5 lights, all ~1000 pixels in the batch are "lit by 5 lights" though predication prevents the superfluous code having any effect on those pixels lit by less than 5 lights.
I think it's fair to say everyone was expecting that this branch commonality would operate at the quad level in NV40, but it's turned out (through experiment) to measure at a larger level of granularity. The loss of efficiency, here, is catastrophic.
It means that developers have avoided implementing shader code that performs dynamic per-pixel branching.
It'll be interesting to see if G70 and R520 can do quad-level dynamic-branching.
Jawed