Jawed
Legend
L2 (global memory) atomics are effectively a superset of ROP functions, but it could be argued that when a rasteriser generates work items, there is no need to invalidate L1 lines. Compute atomics must invalidate simply because any CU anywhere on the GPU could use an atomic on that same address, so all L1 lines that cache that L2 line need to be invalidated.
128B cache lines are "relatively large" compared with render target pixels (at least simple 32-bit per pixel formats), and of course the alignment of rasterised fragments to cache lines is quite coarse and will generally suffer "quad misalignment". So it would seem best to talk about the bandwidth amplification of L1 in terms of cache lines rather than pixels.
128B cache lines are "relatively large" compared with render target pixels (at least simple 32-bit per pixel formats), and of course the alignment of rasterised fragments to cache lines is quite coarse and will generally suffer "quad misalignment". So it would seem best to talk about the bandwidth amplification of L1 in terms of cache lines rather than pixels.