have no idea how they manage that — there could be a watchdog that monitors the cache utilization and occupancy suspends/launches new waves to optimize things, or it could be a more primitive load-balancing system that operates at the driver level. It is also not clear at all what AMD does.
Most likely both require assistance by the shader compiler to flag the lower part of the stack as "preemptable", so rather than a hard register count, you now have a peak working set size, and a total register count for the deepest part of the call tree. Worst case it requires hoisting a couple of registers from a lower page to an upper page on call to actually minimize the hot working set size by compaction of active registers.
You should have 3 instructions that are aiding:
- (Optional) Flagging a higher part of the virtual register file as "soon to be required" - that's the first thing you need to do first the preamble of a function call. When not using this instruction, the first "bad" register access need to send the wave into a trap state.
- (Optional) Flagging a lower part of the virtual register file as "cold" - permit or even encourage preemption, but contents must be preserved. This is what you need to do last in the preamble of a call, after hoisting all registers into the new hot set. When not using this instruction, then there's no voluntary preemption supported, and bad scheduling decisions may need to happen to resolve deadlocks by essentially randomized preemption.
- Flagging a part of the register set as "discarded", that's what you do in the footer of a function call.
Not all functions necessarily need to do all the flagging, that would be wasteful. Only recursive functions (after a set number of folded recursions) or functions with a huge individual register footprint warant taking the risk of dynamic register allocation.
I don't think the shader compiler can properly estimate what time share the shader spends in that "big" and "small" working set size, so it's a static scheduling decision to allow e.g. a 50%, 100% or even 200% overcomissioning of the "worst case register count" but the "hot working set" remains a hard limit that can't be trivially underflown without catastrophic preemption.
A bad estimation that results in excessive swapping of the register file may be corrected in runtime by increasing the reserved "hot" working set of the offending shader for future waves. You are aiming to keep the chances of the worst case spillage low, but you do have to accept that worst case in order to reap in the benefits of overcommission.
Based on their slides, AMD's other types of on-chip memory (LDS/L0 cache) are depicted to be physically separate ...
That slide might very well be accurate. But even in the worst case, nothing speaks against memory management performed by a memory management kernel that is operating outside the "normal" visibility constraints. You don't necessarily have only the instructions available that the ISA docs expose as "user visible" either, and it's almost safe to assume that software interrupt routines with elevated privileges existed for a while now.
Also if you have some "race condition" in your program where you're continuously modifying certain memory locations (threadgroup or tile) throughout the lifetime of the program, isn't that dangerous (crash/incorrect results) to allocate registers from other pools of on-chip memory ?
No. You should rarely be even able access a register that's outside the active working set / assigned part of the register file if the shader compiler did its job properly. When you do, or you happen to hit something that has truly been preempted, you will most likely hit some sort of trap state for the wave. And you can only resume once the dependency is resolved, so you never were able to see anything stale. Since this is a rare case, resolving this dependency may very well happen in pure software, just an ordinary interrupt routine.
Other than latency, it's never visible to the wave that it has just caused hot-swapping of parts of the register file, LDS or really any other of the resources that are not tied in directly with the memory hierarchy.
There is only the cache hierarchy backed by system memory, and "registers" and "threadgroup memory" are special ways of addressing, using narrow "local" addresses instead of 64-bit (48/52 bits used) virtual addresses. Everything is backed by system memory, but ideally almost all "register" accesses come out of cache.
You can't assume that its tied in with the "regular" memory hierarchy just because its mapped into a single logical address space for the sake of unified instruction encoding. Backing different parts of the logical address space with different fixed functions or hardware pools is not that uncommon, respectively neither is assigning different retention and mapping strategies based on virtual address range. E.g. direct addressing for control registers, basic offset addressing for registers in the register file with a linear layout, page table based addressing for main memory. And not all those hardware pools need to be tied into a hierarchy by direct interconnects in order to enabled preemption or reloading, interrupts can handle all the rare cases where a virtual address from any of those distinct ranges is currently unmapped.
The patent only describes the instruction mentioned above to flag a part of the register file as "discarded" and "soo to be required". Even though from what I can skim of the patent, it's not even trying to describe the whole mechanism from the perspective of the user code in a way that would permit an efficient system, but rather focuses on the implementation of the necessary interrupt routine that performs the actual register file compaction. Because there's a tricky little constraint, and that is the control block of a wave only reserves a single offset for the register file (and maybe an upper limit?) rather than a full mapping table, so registers in the register file actually need to be shifted around and control blocks need to be updated in order to actually achieve contigous blocks in which a new wave can be placed.
In the most naive form only the top of the callstack can be returned to the pool. In fact, I believe this patent was filed to describe a possible pure interrupt based solution for an older version of RDNA, not RDNA4. And also that is was rejected internally at AMD first in that naive form! (And by the way: It's a pure software patent. That's not enforcable in most of the civilized world

)
Because flagging the lower half of the callstack as eliglble for preemption in case of deep callstacks is necessary to improve scheduling decisions by a lot, but that one actually does require a hardware change, because instead of just "base" and "upper limit", the control block also needs to track a "currently accessible lower limit" such that a truncated view of the register file can be expressed at all, and thus be trapped.