You asked the right question though, as it's a useful comparison
But how did HD2900GT work? It had 12 TMUs.
R600 architecture matches the count of quad-ALUs per SIMD to the count of quad-TMUs, as in R6xx TMU count says nothing about ALU SIMD count. So 240 ALU lanes is 4 SIMDs each containing 3 quad VLIW-5 (60 ALU lanes per SIMD). So 3 quad-TMUs, where each quad-TMU feeds its results solely to the same numbered quad-ALU inside all 4 of the SIMDs.
Code:
TMU1 TMU2 TMU3
| | |
ALU1 ALU2 ALU3 - SIMD 1
| | |
ALU1 ALU2 ALU3 - SIMD 2
| | |
ALU1 ALU2 ALU3 - SIMD 3
| | |
ALU1 ALU2 ALU3 - SIMD 4
So in that sense this new design is like R600. So the question is: what kind of bus architecture, what kind of inter-cache sharing is used, and how far do texturing results travel?
Because R6xx never exceeded 4 SIMDs, we don't know what TMU sharing arrangement would have been implemented with more SIMDs and more TMUs. e.g. it appears likely these new GPUs have TMUs localised to an RPE, whereas R600 shares them across all SIMDs.
R5xx and R6xx don't match ROPs to MCs, whereas R7xx does. R7xx therefore keeps a vast amount of bandwidth local to an MC and so it's easier to get away without a ring bus. Though in fact the internal bandwidth from L2s->L1s is ~4x higher than off-die bandwidth. Being unidirectional helps, I guess (obviously commands go the other way, but not a huge bandwidth).
There's also shader export data that needs to go to the MCs (GS export or PS export) but that's relatively low bandwidth, e.g. 32 4-byte results per clock.
Barts with 64 TMUs might only have 8 L1s, which is an improvement on the 10 of Juniper. Each L1 in this setup would be dual-ported, feeding two quad-TMUs, which obviously adds local complexity.
The other side of the bandwidth question is making read/write UAV access efficient. Global atomics are very efficient in Evergreen, but more generic operations are un-cached. Really they should be cached (this would also help register spill). The patents I've been referring to seem to hint that they will hold data written to them by SIMDs, not merely that they will hold texel data.
In the extreme it's possible they've implemented a distributed coherent read/write L1, which would make read-write UAV work really fast. L2, in this scenario, wouldn't do coherency, it merely increases efficiency of MC and ROP operations. I suppose...