G71 does math in parallel, it uses other fragments' math to hide the bilinear latency of the just-issued fragment (even though in G71 all the other fragments' math is actually a texturing instruction running on the top ALU). Latency-hiding is about keeping the ALUs busy, not about keeping the TMUs busy.
You just told me that that G80 only needs 192 threads to keep the ALUs running at maximum efficiency! Do I have to point it out to you? 768 >> 192.
The degree of latency hiding available to GPUs is designed around the maximum number of threads that could be waiting for texture results. That depends on texturing throughput, not ALU throughput. Read below.
Or you have 768 objects that are all waiting for a texture result because the math is dependent on that result. For the sake of texture-cache efficiency (reduced thrashing), it's generally better to perform "round-robin" texturing, rather than letting a subset of objects get one or two texture fetches ahead of others.
You don't have to have all threads marching in lockstep to keep the texture accesses from happening in the same order. The math in each thread is independent from each other. A few one-time stalls and the texture fetches are spaced out as necessary.
Consider an example with 512 threads (16 warps) per multiprocessor and 192 clock fetch latency. Imagine a shader has a 10 scalar ALU instructions between each TEX, every instruction is dependent on the previous one, and we start off in the state you mentioned with every pixel needing a fetch.
We start fetching: Warp 1a, w2a, ... , w16a, w1b, w2b, ... , w16b (a and b refer to the multiprocessor). It takes 8 base clocks to finish a warp, so after w8b is issued, w1a's results start coming in and we can start doing math (slowly at first due to insufficient warps). After w14b is issued, w1a-w6a's results are here, so we have 6 warps to rotate between to keep efficiency at its peak. w1a eventually gets done (and now needs a new texture fetch), followed by w2a, etc. and during this time w7a, w8a, etc start recieving their results, ready to be fed into the ALUs. The process keeps going on and eventually multiprocessor b can start doing math, etc.
The point is that everything just spaces out rather quickly. At that point, 15 warps (480 threads) each needing 10 instructions gives you 600 ALU clocks, or 255 base clocks, to wait for texture results from the first warp.
Note that if you had any fewer ALU instructions, then you'd be texture throughput limited (remember multiprocessor b is also using it), so the ALU will have to idle. This is the critical point you missed.
Now, it's true that you need a little consideration for the fact that 6 warps need to be in the shader for full efficiency, and it takes a few clocks for a warp to get through an ALU pipeline. Thus you do need a little more latency hiding than just the 192 clocks of texture latency. However, this is the major factor that determines how many threads you need, so "# of threads ~= tex. latency * TMU throughput" is a good approximation. That's the
total # of threads being fed by the TMU, so you have to count both multiprocessors.
768 fragments is an onscreen quad of 32*24 pixels. I don't think that number of fragments needing texturing "simultaneously" is some kind of rare coincidence.
With less regular patterns between ALU and texture instruction spacing, statistics takes over. It's a huge fluke for you to get lots of texture requests unless you have a texture heavy shader.
When ATI expanded the ALU pipe count in R580 from R520, do you think they kept the register file capacity the same? No, it grew 3x too. The batch count per cluster remained the same (128 batches) and the size of each batch tripled (16 fragments to 48 fragments).
Yeah, and IMO that was a stupid design decision. Why on earth do you need 6144 fragments for 12 Vec3+scalar ALUs and one TMU quad? G71 has around that many fragments to hide the latency of six TMU quads. R580 can wait a whopping 1500+ clocks for a texture result when the TMUs are churning out results at their peak rate, as I mentioned before. The ALUs can wait ~500 clocks before needing to return to the same batch. It's ridiculous.
Anyway, do we have solid proof that the R580 does indeed have triple the register file? Say, by running a pure math shader and increasing the registers? If it's really true, that would really suggest that R520 is a whole lot of nothing, i.e. >150M transistors is scheduling logic.