Perhaps each TU can have short-cycle access to the nearest TU, but it would seem sensible to assume that there are frequent cases where one L1's data set would be in the same locality as another's.
To keep replication from gutting the effectiveness of the caches, maybe each TU can have delayed access to other L1s, or the global data share picks up on shared lines and saves a copy.
In prior GPUs (going back to R300) L1s are per filter pipe - that is, per pixel. So a texel actually appears in multiple L1s. I think it was in uncompressed form - certainly in Xenos the texels are decompressed before being put in L1.
In RV770 we can see a decompression unit within the TU. Indeed R6xx is the same. I did think for a while that texels were stored in uncompressed form within R6xx's texture caches
So historically it's been normal to have duplication across the L1s in ATI's GPUs.
But now we're talking about duplication across quads, not within a quad. Again, in the past, I think that was par for the course.
What I'm wary of is that making the TUs interconnect many:many with the L1s is just way more complexity than necessary. With a crossbar between L1 and L2 there's already a many:many. The L1s should be 10s of KB. In R600 the L1s are 32KB for texels and 32KB for vertex data. 16KB for texels in RV630/5 and 0KB for vertex data in RV610/20.
There could be some kind of relationship with the local data share per SIMD, and I wish I had some clarity as to its use.
Synchronization within and between SIMDs would be facilitated by the data shares, but they could also be used to house temp copies of L1 lines, or even contexts for pending clauses from other batches to keep more in flight.
I suppose LDS is for inter-element sharing and perhaps is blind to context.
I suspect LDS is for only one context at a time:
- fetch from register file into LDS (e.g. R1, R4, R12)
- then any number of inter-element reads
- since LDS is not a register file, inter-element writes are not allowed, so there is no "clean-up", it's just released
In R6xx there's a constant cache that's under programmer control. This cache is to support D3D10 constant buffers. CBs can be huge (4096 fp32s) and there can be 16 of them bound to a shader at any one time. So one of the Sequencer instructions is to fetch specific cache lines for the ALU instructions to then read from.
Because R6xx issues clauses of ALU instructions (from 1 to 128 instructions) which are "atomic", when the sequencer performs a constant cache fetch the cache is set up for the entire duration of the clause.
So I'm thinking that LDS inter-element sharing would work the same way. Instead of a Sequencer instruction to fetch from global constant cache into the ALU's constant cache lines (I think there are 2 lines), there's an LDS fetch instruction that reads from register file into LDS. Naturally these fetch instructions are invisible to the ALUs, there's no latency as such.
Presumably GDS is similar, but implies that Sequencers talk to each other. Presumably the bandwidth is low in this case, e.g. 1 register per element, as opposed to, say, 4 registers per element in LDS.
Fun.
If that were the case, vertex work and synchronization operations would be bottlenecked at one access per cycle for the entire chip, assuming the data shares enable synchronization primitives.
The GDS would become a global serialization structure, something that could have been handled with a few "mass halt" signal lines instead.
I'm thinking those caches might be banked or multiported. Maybe not 10-way, but definitely more than single-ported.
I presume synchronisation of elements across a wodge of contexts (hardware threads) is something that the Sequencers would signal to each other. This would be just another status bit for each context.
NVidia has the concept of blocks, within which elements can share data, each block consisting of multiple contexts. The developer specifies the block size based on the amount of per-element data that needs to be shared through PDC.
In theory, since ATI virtualises the register file, it's possible to share all extent contexts'. Looking at the newer RV770 diagrams there's no "memory read/write cache" (which I presume was the mechanism for moving register file data to/from video memory) as there was in R6xx, so presumably that's what GDS is doing. But GDS looks rather isolated. Bit confusing.
Jawed