128 units * 1350 MHz * 3 ops/clock / 64 bilinears / 575 MHz = ~14.1 scalars/bilinear, or a vector ratio of ~3.5:1.
If you ignore the extra MUL/SFU hardware, the ratio becomes closer to 2.5:1.
Arguably, you could say the following, too, but you may also claim it isn't completely fair:
R520: 16 units * 650 MHz * 3 ops/clock / 16 bilinears / 650 MHz = 3:1
R580: 48 units * 650 MHz * 3 ops/clock / 16 bilinears / 650 MHz = 9:1
Obviously, the biggest problem with these numbers is that these units aren't quite equivalent to four of G80's scalar units. To simplify comparaison, first, let us not take into consideration the MUL/SFU units, as ATI also has a dedicated MUL for perspective correction, and some dedicated hardware for SFU (I'm not sure how it compares to NVIDIA's, so let us assume it is roughly identical).
Secondly, ATI's units are Vec3+Scalar, which is obviously less efficient than four purely scalar units. In the worst possible case, it's half as efficient; on average, it's certainly not quite that bad. Furthermore, the R580 also has extra ADD units. They aren't always usable and/or exposed, but they still are far from dormant.
So, let's be really generous and say that for advanced workloads (which, you coloweruld argue, might have more scalar ops), four scalar units in NVIDIA's architecture have a 20% higher average effective throughput than one arithmetic pipeline in R5xx. ATI would say it's lower, NVIDIA would say it's higher, so let's keep that as a reasonable guestimate, shall we? For G71, I'd say 2:1 is also a fair estimate, but that's extremely subjective; given the various inefficiencies compared to R520, I'd be tempted to say it's slightly lower than that, but then again its theorical peak is slightly higher too...
Anyway, we have the following (quite subjective and approximative, obviously) numbers:
G80: 1.2*128/4 R5xx-equivalent units * 1350 MHz * 2 ops/clock / 64 bilinears / 575 MHz = 3:1
R520: 16 units * 650 MHz * 2 ops/clock / 16 bilinears / 650 MHz = 2:1
R520: 48 units * 650 MHz * 2 ops/clock / 16 bilinears / 650 MHz = 6:1
And feel free to say that was an exercise in futility as it remains highly approximative, but I think it clearly illustrates the point that G80 has an ALU ratio between that of R520 and R580's, and how much far it is from R580's depends on the latter's efficiency for the given workload. Obviously, in the future, there will be some room to grow G8x's ALU ratio IMO, but it remains to be seen by how much and what timeframes NVIDIA thinsk this will be necessary in.
For G84 and/or G86, an easy way to increase the ratio would be to get rid of the extra bilinear unit per addresser. You would expect the low-end part to be tested less frequently with heavy anisotropic filtering than the high-end ones, so such a compromise would make sense. Another idea would be to make a MADD out of the current MUL, but it's hard to say how expensive that would be. Or, they could do neither, or reserve that for future parts, heh. Who knows at this point
Uttar