First, the eight 16kB segments are called "global memory R/W cache with atomics" and the block diagram shows two paths. One from the write combining buffer straigth to the memory controller, but also the possibility to go from the write combining buffer through the global memory R/W cache to the memory controller.
Yeah, that's what prompted me to do the edit.
Furthermore it is explicitly stated, that while read only loads go through the texture caches, unordered loads are served by that R/W cache. I know that this conflicts with what Micah Villmow said in the Stream Developer Forum. But he obviously referred to the current state of the OpenCL implementation and he also mentioned (in another thread) that it uses raw UAVs and not typed ones (opposed to DirectCompute which uses typed or structured UAVs), so it wouldn't work either way. I would interprete it in such a way that it should be possible for the hardware. But how to get this done, is not really clear to me.
I can't find in the ISA a cached read that isn't using the vertex or texture caches, nor a side-effect of atomics (i.e. RTN).
Well, that's not strictly true, MEM_RD allows cached reads, but if writes and reads are in the same kernel, caching is not allowed, see the UNCACHED bit of MEM_RD_WORD_0 on page 2-58. I don't know which cache is implicated in cached reads
I suspect the write-combining cache, because I interpret this to be a part of shader export functionality, and shader export functionality has to be able to send data back for more shading (e.g. vertex data is collated and sent back for pixel shading).
There is no read-path shown for the caches, i.e. the atomic RTN path isn't shown. I suspect it's from R/W cache through shader export. If that's true, then the entire path for cached MEM_RD is through colour buffer cache and write-combining cache. Or through colour buffer cache and shader export, bypassing write-combining.
With all these paths it certainly seems logical that general UAV cached RW should occur through the eight 16KB colour buffer caches. I can't think why it doesn't, but everywhere I turn in the ISA it's barred. I might be nothing more than addressing restrictions - these restrictions are like those seen in R700 and earlier GPUs where only a single UAV is available. Providing fully generic multi-UAV RW addressing/caching seems to be the crunch point, but I don't get why atomics aren't hobbled. I can only think that the atomics are stuffed into a queue for address collision resolution and the slowness here is seen as too slow for general RW.
Generally I feel this kind of stuff was a victim of the belt-tightening that resulted in a 334mm² Cypress instead of a ~480mm² one.
I expect AMD to implement something pretty much the same as Fermi's cache hierarchy. Though that has a wrinkle or two because TEX L1 and L1/shared-memory sit beside each other - similar to how Larrabee's semi-decoupled TUs have their own L1s. Since ATI L1 holds decompressed texels and is dual-purpose texture and vertex cache, this distinction might not apply.
Btw., one can specify both for all EXPORT_RAT_INST_xxx instructions, MEM_EXPORT_WRITE(_IND) as well as MEM_EXPORT_WRITE(_IND)_ACK. The latter doesn't return until the write has been carried out to memory. Looks almost like a writethrough switch, isn't it?
ACK might be solely to ensure that Sequencer can keep track of logical barriers, i.e. in order to effect a work-group memory barrier it needs to know that the writes have completed. Note also that fetches can have ACK. At the hardware thread level fetch ACK isn't required because of the clause-by-clause serialisation, but work-group scheduling can't work without ACK.
Atomics don't have ACK, they only have RTN which doesn't imply ACK (though it doesn't imply a lack of write-through either). It would make sense that atomics don't wait for write-through, after all atomics are being optimised for latency and a memory barrier can always be erected if write-through is required.
So I'm dubious it's for write-through, per se - I can't think of a scenario in which the reduced-latency for cache operations (as opposed to shader operations) associated with write-through would be useful.
But you are right, the documentation is still quite incomplete in some respects. It is impossible (at least for me) to figure out how some things are supposed to work.
Not to mention, IL documentation is lagging further behind it seems
Jawed