Jawed
Legend
My argument with this "idealised" stance is that the actual sequence of operations undertaken by the processor results in varying performance for the same ALU:TEX.This is something I have mentioned several times in architecture threads, and I was going to make my own thread about it last year.
It is a common misconception that registers per SM is metric necessary for hiding latency. What you want to look at is registers per texture unit, because that's the latency you want to hide. If you double the ALUs but keep the TUs the same (or in this case reduce them), then you do not need to double the total register count to have the same latency hiding ability. I wrote a program to simulate the way SIMD engines process wavefronts and it confirms my conviction on the matter.
Latency hiding = # threads / tex throughput
(More specifically, the last term is average texture clause throughput. I know NVidia doesn't use clauses, but you can still group texture accesses together by dependency to create quasi-clauses and get a slightly understimated value of latency hiding)
http://developer.amd.com/gpu_assets/PLDI08Tutorial.pdf
You can clearly see in versions 3, 4 and 5 that, despite identical ALU:TEX, performance varies substantially.
5 is faster than 3 despite the fact that 5 has less threads in flight per SIMD than 3. The estimated threads for version 3 is 256/28 = 9, while for version 5 it is 256/38 = 6. (Both estimates are subject to clause-temporary overhead. Also I suspect that 256 is not the correct baseline, something like 240 might be better, not sure...)
Evergreen GPUs support 16-long TEX clauses as opposed to the 8-long clauses seen in R600-RV790. There are two reasons to do this:
- all clause switches increase the latency experienced by an individual hardware thread as switching has latency, so packing TEX instructions into a lower count of clauses reduces the total latency experienced
- TEX clauses are sensitive to cache behaviour, so a doubling in TEX clause length can increase coherency
Going back in time, your argument is that if AMD doubles ALU:TEX, e.g. 8:1 in the next GPU, but leaving the overall ALU/TEX architecture alone, that each ALU would only need half the register file. The 256KB of aggregate register file per SIMD we see in Evergreen would be enough for the next GPU. Well, clearly this is fallacious as version 5 above would be reduced to a mere 3 hardware threads, killing throughput (3 hardware threads means that both ALU and TEX clauses cannot be 100% occupied by hardware threads, since both require pairs of hardware threads for full utilisation).
Separately, I've long maintained that careful management of register spill can be used to amplify the effective size of the register file (hardly news: CPUs are continuously doing this). NVidia's older architectures and ATI's current one take a rather naive and useless all-or-nothing approach to register spill, i.e. there's zero optimisation for the register-spill case, so once it's induced the wheels fall off. These GPUs are set up for fencing, and don't like it when you come armed with a baseball bat.
AMD will have to catch-up and implement register spill properly. One of a long list of catch-ups, in comparison with Fermi.
Jawed