Actually, shader interlocks are supported on recent AMD HW. The reason why they don't expose them in either their GL/VK drivers is because it's a bad idea to use them since executing critical sections is a high latency operation. Their recommendation is that you're better off using linked lists for doing arbitrary blending or OIT.
Ordered shader interlock is implemented, you mean? And only on Vega / RDNA family. (SOPP, S_SENDMSG, MSG_ORDERED_PS_DONE).
There is no native unordered shader interlock support, and the ordered one appears to be hard-wired with severe implications on efficiency of rasterization (not just denoted CS is blocked off, but whole work generation is stalled).
With the instructions supported yet, you could only construct unordered CS support by using mix of atomics and sleeps. Worst case scenario, as you get serialized execution with with random latency in between serialized parts on top.
Special CS support with first-to-arrive logic is actually simpler (as you may cache locally once shared mode has been reached), but still inefficient to implement in software:
Code:
if(*init_guard == 2) {
// NOP, lucky cache hit
} else {
int state = atomicCompSwap(init_guard, 0, 1);
if(state == 0) {
init();
atomicExchange(init_guard, 2);
} else while(state != 2) {
sleep();
state = atomicCompSwap(init_guard, 2, 2);
}
}
With sleep instruction (SOPP S_SLEEP) not exposed by any intrinsic, atomicCompSwap loop is still a bad choice. So there got to be some hardware arbitration or at least an intrinsic to handle that properly without an (unthrottled) spin-lock.
The whole thing is then probably interleaved with memory management. No visible page fault handler in RDNA, but in order to provide the benefits as described in the patent, that logic has at least to operate on a virtual memory segment which is subject to being dropped on L2 cache eviction. LDS or GDS don't fit the size requirements, and spilling to main memory is failing the point of using texture compression.
At least for RDNA 1.0, I don't see such a capability documented yet, but doesn't sound too far off either.
For the purpose of texture space shading, sub-allocations linked from instanced lookup table should suffice. Effectively good old tiled / partially resident texture, but with device managed allocation strategy.