Untyped read/write/atomic with MUBUF, image read/write/atomic with MIMG.
MTBUF has read/write for typed buffers, with the type being dictated by a resource constant.
The AMD presentation doesn't list out atomics for this one, and that does sound like it could be used for a UAV with its lack of ordering.
Nvidia's graphics export pipe is the most integrated with the cache hierarchy, since the ROPs use the L2.
AMD seems to be less so since GDS and graphics export have a side path and the ROPs are separate.
IVB looks at a higher level to resemble an earlier AMD GPU, possibly before the introduction of that little UAV cache the preceded the R/W cache hierarchy.
The ROP path seems specialized enough to keep a separation for all three. Nvidia's done the most to update the graphics domain, hence why it seems the ROP path is the most tightly integrated.
AMD's compute side has been overhauled, but it seems like its current design has compromised on a a CU array that prioritizes each CU being able to serve different compute clients. The modestly evolved graphics domain exists at a slight remove, with the specialized export bus between the freer compute array and the ordered ROP and GDS hardware.
Perhaps Intel hasn't opted for closing the loop yet because of the cost involved in making the leap, and because it's really not hurting as badly for compute performance thanks to its CPU dominance.
And who knows, may be adding a small CPU core or two (bobcat ish) on a discrete might not be such a bad idea after all.
They may do it because the shrinking volume of the discrete market may make it too expensive to have a GPU-only chip. There may be a range of APUs, with some having a very high balance of GPU capability. Perhaps a gamer system with dual sockets, one heavy on the CPU, the other on GPU?