Given a unified TU/RBE, random RT access implies read/write from non-local RBEs (or local fetch and some crazy "tile" cache coherency, which I'm guessing is highly unlikely).
Scatter isn't coherent across strands in any meaningful sense. Not only that, but scatter implies arbitrary write order to colliding addresses. D3D11 basically says you're on your own if you use colliding addresses from distinct strands. So there's no coherency to maintain, as far as I can tell.
During normal pixel shading render back end (colour write, blend) operations happen in cluster memory associated with the TU/RBE. Fetching data from memory is no different from fetching a tile of texels, conceptually. Say the fetch is actioned through L2, which lives close to the MCs.
Atomics are obviously different, because the memory system is explicitly told to serialise accesses to each given address (but not to serialise the entire set of all accesses by all strands). This functionality would stay outside of the clusters.
I presume that if a kernel does a non-atomic read of an address at the same time as that address is a candidate for atomic updates by that kernel, then it's up to the programmer to fence these properly, otherwise suffer the consequences of indeterminate ordering.
How do you see a unified TU/RBE working?
The key aspect to me is that a pixel shader is allowed to fetch data from its position in all render targets currently bound (8 distinct buffers + Z/stencil). This is logically the same as fetching from a set of non-compressed textures. The pixel shader is then able to update all of those buffers, again solely for its position.
These operations are a bit like reading/writing shared memory. I think it was Trinibwoy who suggested a while back that NVidia could do ROP processing in the multiprocessors using shared memory as a buffer.
Currently, in ATI's architecture, shared memory is idle while pixel shading. It's only usable when running a compute kernel. So, LDS might be a candidate for this kind of usage.
I suspect the more-stringent texture-filtering requirements of D3D11 might make ATI return to single cycle fp16 filtering, which then provides the precision required to perform 8-bit pixel blending at full speed.
Z-test, hierarchical-Z, colour/z (de-)compression for render targets sounds like something that should stay close to the MCs. Clearly the tiled nature of rendering buffers makes it possible to separate the reading-from-memory/de-compression of buffers from the RBE operations and then the compression/writing-to-memory. These operations are all atomic at the tile level, and nominally only one cluster is performing atomic updates on any given tile. The question then becomes one of the added latencies that arise in moving render buffer data into clusters and then back. I'm not convinced the latencies matter, per se.
The simple case of append, with any kind of structured data, has no strict ordering defined as far as I can tell. i.e. each cluster can generate a local tile of data to be appended - when the tile is full it can be posted to the memory system to be slotted (and compacted?) into its destination in memory.
I can't see why both ATI and NVidia GPUs couldn't work this way, to be honest.
Jawed