I do agree that SM 6.0 feature list looks highly GCN centric (WavePrefixSum is also a single instruction on GCN and WaveBallot returns a 64 bit mask, which efficiently maps to a single 64 bit scalar register on GCN) . Console developers (us included) have given plenty of feedback to get these features included to PC DirectX. These features make it much easier to port console games to PC.Word of warning - these docs probably got posted by accident and are certainly not finalized. Ordered atomics (they are really more fences) as in GCN in particular are something I doubt we can support efficiently, and I certainly haven't heard that NVIDIA can either although I have no firm info on that one.
Ordered atomics are similar to rasterizer ordered views, but for compute shaders. If multiple threads (waves) access the same atomic counter they are executed in submission order. GPUs tend to execute waves (and workgroups) roughly in submission order, minimizing the potential stalls. When stalls occur, the latency can be hidden in a similar way as memory latency (cache misses).
The best thing about ordered atomics is that this feature allows you to do execute algorithms with prefix sums in a single pass. Without ordered atomics you often need to: store intermediate results (from registers & LDS) to memory -> wait for GPU idle -> execute separate global prefix sum kernel for output data -> wait for GPU idle -> load prefix sum data + all other data back to registers & LDS. It's clear that even if ordered atomics cause some stalls, it is a significant efficiency boost (performance & power) to algorithms containing global prefix sums.
Last edited: