What I'm not seeing is any possibility for a huge leap in processing power per pipeline.
Why am I not surprised?
The R420 is a derivative of the R300 architecture so they've had years to fine tune the hardware design. Here are a few examples of what they could do.
1) Reduce the number of clock cycles it takes to complete a number of key DX9 instructions. We know that a number of them took multiple clock cycles to execute, shouldn't take much to fine tune those.
2) Add a second ALU to each pipe. ATi and nvidia count transistors differently, nvidia include ALL transistors including cache etc which ATi don't. So basing comparisons on nvidia cards is a bad idea, even basing them on previous ATi chips leads to erroneous assumptions.
3) Clock double each pipe's ALU.
Even if none of the above are utilised, which I very highly doubt, there are still lots os options open to ATi for boosting the performance of the R420.
You may not see any possibility, but I suspect the main reason for that is a) Lack of real information on the R420.
b) Transistor counting disparities between ATi and nvidia. i.e it's NOT 222M (for the NV40) vs 160M to 180M (for the R420)
c) You own personal bias.
I honestly don't see a,b or c as very good grounds for making the sort of claims that you currently are.
Now I'm not you're wrong and I'm right, it's far too early to make bold claims about what ATi can and cannot do with the R420.
particularly when stencil volume shadows are used.
Unless ATi add their own Stencil acceleration features...