Maybe I'm missing something but isn't the whole point of the R600 setup where one quad TMU serves the same quad in each SIMD to better balance texturing work across the chip? If each TMU is instead tied to a SIMD then that TMU goes unused if that SIMD isn't running code requiring texturing.
I think the balancing of texturing workload comes from:
- the high level thread allocation policy of the GPU spreads all workloads equally - screen-space tiling of pixel shader workload is the best example of this
- L2 cache associativity - supports the coherency of multiple concurrent vertex/geometry/pixel threads so that texels aren't evicted too early
The sheer count of threads in each shader unit then keeps the TUs busy. Don't forget that texturing is a "look ahead" process in R6xx (just like R5xx) - texture results can be delivered dozens of clock cycles ahead of when they're actually required.
Looking at the way code is assembled on R6xx it seems that up to 8 texture fetches are performed in a single clause. (This comes at considerable register cost...)
In R6xx if at least one of the threads running in the four SIMDs need texturing work the TMU's get utilized.
I think it's reasonable to view R600 as having a single 16-wide TU which is shared across all four ALU SIMDs (that are 16-wide). We know L2 is centralised in R600 so it makes sense that the TUs are organised as a single SIMD processor. Each texturing clause then runs on the TU over 4 clocks, delivering 64 texturing results back to the originating batch.
So assume that RV770, with its 24 ALU quads, has a 32-wide TU, with quads A-H.
This is where I've revised my thinking, working in terms of batch size, not in terms of ALU SIMD width.
In the 12-SIMD RV770 each batch is 32-wide (2 quads * 4 clocks), or has 8 quads:
- TU A - batch 1
- TU B - batch 2
- TU C - batch 3
- ...
- TU H - batch 8
So each of the 12 SIMDs takes it in turn to "control" the TU, for what is effectively 1 TU clock per instruction in the TU clause.
In the 4-SIMD RV770, each batch is 96-wide (6 quads * 4 clocks), or 24 quads:
- TU A - batch 1, 9, 17
- TU B - batch 2, 10, 18
- ...
- TU H - batch 8, 16, 24
And so each of the 4 SIMDs takes it in turn to control the TU, with each batch's texture clause running for 3 TU clocks per instruction.
Note that the mapping from TU to ALUs is not 1:1. The mapping is from a physical quad in the TU to logical quads in the batch. In the latter configuration, batch quads 1, 9 and 17 belong to SIMD quads 1, 3, 5, while batch quads 8, 16 and 24 belong to SIMD quads 2, 4, 6.
This latter organisation isn't what I proposed earlier
I've revised because I think the key is that there's a single TU, and I've found a way of thinking about a batch that enables "filling" a single TU processor.
I'm averse to the 12-SIMD version simply because of the large amount of control overhead... Also, I wonder if it's compatible with the concept of a single TU. Note that in this configuration each clause only runs for 1 clock in the TU pipeline. Is it reasonable to presume the TU can execute a different instruction on each successive clock or does it need to do so for several clocks?
This is similar to the way the ALU pipeline runs an instruction for 4 clocks. In R600 the TU runs an instruction for 4 clocks (still guessing). In the 4-SIMD RV770 each instruction would run for 3 clocks.
Hmm...
---
My earlier suggestion for a 4 SIMD RV770 would feature four 8-wide TUs. Each TU would be under the control of a single ALU SIMD. Each TU clause would run for 12 clocks per instruction (24 quads in the batch divided by 2 quads in the TU)... Seems pretty unlikely.
---
So, after all that, the 1-clock per TU clause instruction makes me think it's unlikely that RV770 is a 12-SIMD design. But that presumes all this stuff about there being a single SIMD for the TUs is correct. I'm left reckoning the 4-SIMD design is most likely (though 3-clocks per TU clause instruction makes me a bit wary, would be nicer if it was 4).
Jawed