How balanced you are depends on what's the size of your units... Consider that also in the context that the ALU-TEX (and TEX-ROP, etc.) ratio varies from game to game, from frame to frame, and inside a single frame.
And, sadly, we're still in a world where all credence is given to average frame rates. It makes my blood boil. The distortions brought about by silly-high max-FPS have no place in architectural analysis.
[...] it would be unfair to say that R580's number of ALUs was "wasteful" when, clearly, they were quite damn cheap in terms of transistor count.
Oh, I agree completely. Even when you take into account the associated register file to support gazillions of fragments in flight, R580's ALUs were a low-cost upgrade. I'm still sure ATI could have produced the same performance in the same games with 32 ALU pipes, not 48, but the difference in die size would have barely been worth discussing.
I do agree that NVIDIA's ALU ratio is too low though, and their triangle setup performance is also too low given everything else. In a Z-only or shadowmap pass, G80 must be so horribly triangle setup-limited that it's not even funny. Similarly, R600 and RV630 must be horribly Z-limited in the same scenarios.
The most damning aspect of R6xx, for me, is the fact that ATI put together all that bandwidth and then left it on the table. Now is the time for 4xAA samples per loop, not in another 2 years' time. NVidia got this right, even if only just about so. Xenos was a huge misdirection. ARGH.
If NV and AMD got all their ratios and all their units absolutely perfect on their first try, then where would be the fun in discussing all this?!
I can't help thinking that R600 looks like an autumn 2005/spring 2006 GPU that was just a bit too ambitious for the process it was due to come out on... Vista slippage took the pressure off ATI. Eric has said it himself that it was never designed to be 2x the performance of R5xx (even if it occasionally gets there).
I would really like to see NVidia go to a 4-SIMDs per cluster design (or double the width of each SIMD). It would also be really good if they dropped the "free trilinear" TMU architecture - G84 seems quite happy. I think G80's problem is that the thread scheduling costs so much that they had to cap the number of clusters, so bumped up the TMUs to compensate, but in doing so had to move "NVIO" off-die, because TMU-scaling is very coarse-grained.
ATI, meanwhile, was determined to implement an
architecture that will run until "D3D12" - I'm referring to the virtualisation and threading model ...
Jawed