Uttar said:
Small reminder: MADD+MADD is very unlikely to ever happen on G70 because of register file restrictions.
I disagree. While the four FP32s restriction does make dual-issued FP32 MADDs pretty unlikely (two operands need to be shared by both MADDs), partial precision MADDs seem fairly viable (there are plenty of them in 3DMk06) and won't bust the register bandwidth limit even when all operands are different.
The original NV40 was (MUL or TEX) + MAD. The primary advantage of G70's MAD+MAD is being able to do single-cycle LERPs like ATI, although only when there's no texturing (SUB&MAD), and having a lot more flexibility when it comes to instruction reordering and Vec2+Vec2/Vec3+Vec1 optimizations (part of the advantages in the G70 pipeline for that last point are, however, afaik unrelated).
2+2 and 3+1 are features of NV40, too, so it's a fairly subtle advantage in this respect for G7x.
Personally I would tend to believe that in 3:1 ALU:TEX ratio games, it is a reasonable estimation that to say one of NVIDIA's 24PS pipeline is equivalent to one of ATI's 48PS pipelines. This is because NVIDIA's pipelines can do VERY slightly more per clock, and you can roughly imagine the texturing operation every 3 clocks wasting that back.
I tend to agree, but NVidia's architectures seem to be more sensitive to overall register count - as long as the register count is no more than about 4 or 5 FP32s then they're OK. So they become very much dependent on being able to use _PP to maintain performance. Which seems viable as shorter shaders prolly won't reveal FP16-precision errors.
Now, on the other hand, if you decrease the ALU:TEX ratio, NVIDIA's texturing abilities increase while their arithmetic ones decrease, which gives them an obvious advantage. So below that 1:3, you'd conceputalize each of NVIDIA's pipelines to do more and more than ATI's "pipelines", up until the theorical point of 1:0 and below where it'd become a (24/16) performance ratio between NVIDIA and ATI (DX7-era games, and some DX8-era ones).
Yep, I think this is where the heavy advantage for NV40 and G70 fragment pipelines comes from, with so few games having much arithmetic intensity.
Now, what's more interesting is what happens when the ALU:TEX goes beyond 3:1. Interestingly enough, NVIDIA's ALU1 gets less and less asked to do texture addressing, so their arithmetic power per-pipeline begins to surpass that of ATI's more.
Generally I agree - the NVidia pipeline appears "more flexible", able to gracefully trade texturing and ALU proportions. But I think the true cost transpires in heavy register (and/or FP32 precision) usage.
The only other thing that's worth noting is that the 3:1 ALU:TEX thing has become a little muddled, as far as I can tell. ATI was recommending 3:1 for
R420. To me this means that R580 needs about 9:1 to flourish. The 3:1 ratio in R420 seems to be a function of the latency-hiding capability of the fragment pipeline (i.e. thread size), with the partially decoupled texturing providing a fair degree of texture "pre-fetching", though limited by R420's "stalling" upon dependent texturing. With fairly intensive texturing in most games, I think it's fair to say R420 prolly never saw much in the way of 3:1 until, ahem, after R520 had released, and so analysis of this point in respect of R420 hasn't happened...
Well, that's my interpretation, anyway.
Obviously, they won't reach the equivalent of ATI's 48 pipelines, but perhaps 28-30 quite easily. Which obviously is why NVIDIA doesn't get beaten by 2-2.5x in purely arithmetic tests. Obviously, 3:1 is NVIDIA's weakness, but it gets less dramatic not only below that rato, but also above it.
Truely intense arithmetic tests seem to be all over the shop:
http://www.digit-life.com/articles2/video/3dmark06/3dmark06_11.html
which shows a 35% advantage per fragment pipe for G7x.
The two PS3 tests (Steep Parallax Mapping and Fur) show the opposite, though:
http://www.digit-life.com/articles2/video/r580-part2.html
27% and 18% advantage per pipe in favour of R580 - but they prolly make use of dynamic branching as a performance tweak.
The PS2 tests on that page, Parallax Mapping and Frozen glass show a heavy dependency on _PP for G70. In FP32, though, the former shows a 35% advantage for G70 while the latter shows a 79% advantage.
(7800GTX-512 assumed to be 550MHz and R580 assumed to be 650MHz.)
Jawed