This strikes me as a curious discrepancy. How do compute shaders get around not having to use ROPs, and why? Are ROPs in current GPUs merely a vestigal organ of past GPU design practices; are they really needed if compute shaders can get by without using them?
ROPs are units which process the pre-defined outputs calculations on a pre-defined raster on a pre-defined buffer offering special features as multi-sampling. Which means the ROPs also know about stuff that happens in the rasterizer, and they know stuff about the pixel neighbourhood in the same quad, which means they can write/blend in 4 pixel bursts. There are controls to specify what ROPs do, like blending modes, alpha test etc. That's a lot of global state, and it's extreme slow to change that. But they are extreme good and fast at what they are doing. They are the equivalent of a TMU, but for writing. There are anly a few of those units available, enough to at most write to 4 different surfaces simultaniously per input-set (which is a pixel instance). That's unique.
Compute doesn't have any pre-defined outputs, you bind outputs the same way you bind inputs. You can't attach ROPs to compute shaders, because is no rasterizer suppose to be involved, and there is absolute nil persitent state wanted to be connected to a compute shader. All state is local and essentially buffers of various kinds (constant buffer, read-only texture buffer, read-write texture buffer, just memory buffer etc.). They are minimal to have much more flexibility. Their writing capability is severly limited though, basically only variations of uint-moves. But they can be yielded very well because there is almost no chip-wide state to switch. There are more possible units for writing than for ROPS, enough for 8 different surfaces. But you can only write to one per input-set (which is a thread) at a time. You can only write at most 2x uints/floats.
One pixel shader instance equals one compute thread in granularity. You can run 256 quads (256x4 = 1024), and 1024 compute threads, and you could issue 4x the number of writes with ROPs than with compute, and 2x the size, 4 floats per ROP write vs. 2 floats per thread write. The maximum is thus 4x4x4 bytes = 64 bytes per ROP and 1x2x4 = 8 bytes per thread.
Additionally, because the compute outputs are tightly coupled to the memory controller, you need to worry about banking conflicts. The ROPs have caches and the issue doesn't exist. On the other hand you have absolute control what you want to do with your output in compute, including atomic operations. Something like that doesn't exist for ROPs.
With DX12 we get typed UAV writes, and the writing will be a bit better, but we're still far from having ROP-like writing functionality for compute.
I hope I got it all broadly correct, some numbers are different for different architectures.