AMD RDNA4 Architecture Speculation

It seems to me that the schema you describe is more akin to what AMD is doing (as described by the patent here: Register Compaction with Early Release). If Apple indeed operated that way, there would be no change in occupancy on "heavy" shaders and one would still need to do dispatch fine-tuning — but M3 behavior seems to be very different. Some pieces of evidence is dramatic performance improvements on complex shaders (e.g. Blender — that's before the hardware RT kicks in) and also in the Blender patches (where it is mentioned that no dispatch group fine-tuning is needed as the system will take care of occupancy automatically).
How is my description of Apple closer to AMD when they don't seem to have any unified/flexible on-chip memory ? Based on their slides, AMD's other types of on-chip memory (LDS/L0 cache) are depicted to be physically separate ...

How/Where else would they release register memory to ? Do they just spill to higher level memory (to allocate more registers) later on to increase occupancy during mid-shader execution ? That doesn't seem very fast or performant ...

I'll be interesting to see what information is disclosed their ISA docs about this subject ...
All this does raise an interesting question — I don't think we have a very good idea how these things are managed. Getting though multiple layers of marketing BS to the actual technical details can be incredibly tricky. At least my understanding of what Apple does is that they virtualize everything, even register access. So registers are fundamentally backed by system memory and can be cached closer to the SIMDs to improve performance (here is the relevant patent: CACHE CONTROL TO PRESERVE REGISTER DATA). This seems to suggest that you can in principle launch as many waves as you want, just that in some instances the performance will suck since you'll run out of cache. I have no idea how they manage that — there could be a watchdog that monitors the cache utilization and occupancy suspends/launches new waves to optimize things, or it could be a more primitive load-balancing system that operates at the driver level. It is also not clear at all what AMD does. It doesn't seem like their system is fully automatic.
I imagine that Apple absolutely can launch as many waves as they want at the start but I think they "starve certain sources of memory by priority" by carving out other pools of on-chip memory before they spill to device memory so it still fits with their claim that 'occupancy' isn't dependent on worst case register allocation ...

Also if you have some "race condition" in your program where you're continuously modifying certain memory locations (threadgroup or tile) throughout the lifetime of the program, isn't that dangerous (crash/incorrect results) to allocate registers from other pools of on-chip memory ? I know that they can allocate more registers by spilling from higher level memory but now performance is a concern again ...
 
The way I understood dr_ribit's description of what Apple is doing (I didn't read the patent) is that there are no other pools of on-chip memory. There is only the cache hierarchy backed by system memory, and "registers" and "threadgroup memory" are special ways of addressing, using narrow "local" addresses instead of 64-bit (48/52 bits used) virtual addresses. Everything is backed by system memory, but ideally almost all "register" accesses come out of cache.

I assume that register "deallocation" is simply a bit flag in the instruction encoding per register input which indicates that this register value is no longer needed, thus the cache entry can be discarded without writing back to system memory.
 
have no idea how they manage that — there could be a watchdog that monitors the cache utilization and occupancy suspends/launches new waves to optimize things, or it could be a more primitive load-balancing system that operates at the driver level. It is also not clear at all what AMD does.
Most likely both require assistance by the shader compiler to flag the lower part of the stack as "preemptable", so rather than a hard register count, you now have a peak working set size, and a total register count for the deepest part of the call tree. Worst case it requires hoisting a couple of registers from a lower page to an upper page on call to actually minimize the hot working set size by compaction of active registers.

You should have 3 instructions that are aiding:
  • (Optional) Flagging a higher part of the virtual register file as "soon to be required" - that's the first thing you need to do first the preamble of a function call. When not using this instruction, the first "bad" register access need to send the wave into a trap state.
  • (Optional) Flagging a lower part of the virtual register file as "cold" - permit or even encourage preemption, but contents must be preserved. This is what you need to do last in the preamble of a call, after hoisting all registers into the new hot set. When not using this instruction, then there's no voluntary preemption supported, and bad scheduling decisions may need to happen to resolve deadlocks by essentially randomized preemption.
  • Flagging a part of the register set as "discarded", that's what you do in the footer of a function call.
Not all functions necessarily need to do all the flagging, that would be wasteful. Only recursive functions (after a set number of folded recursions) or functions with a huge individual register footprint warant taking the risk of dynamic register allocation.

I don't think the shader compiler can properly estimate what time share the shader spends in that "big" and "small" working set size, so it's a static scheduling decision to allow e.g. a 50%, 100% or even 200% overcomissioning of the "worst case register count" but the "hot working set" remains a hard limit that can't be trivially underflown without catastrophic preemption.

A bad estimation that results in excessive swapping of the register file may be corrected in runtime by increasing the reserved "hot" working set of the offending shader for future waves. You are aiming to keep the chances of the worst case spillage low, but you do have to accept that worst case in order to reap in the benefits of overcommission.

Based on their slides, AMD's other types of on-chip memory (LDS/L0 cache) are depicted to be physically separate ...
That slide might very well be accurate. But even in the worst case, nothing speaks against memory management performed by a memory management kernel that is operating outside the "normal" visibility constraints. You don't necessarily have only the instructions available that the ISA docs expose as "user visible" either, and it's almost safe to assume that software interrupt routines with elevated privileges existed for a while now.

Also if you have some "race condition" in your program where you're continuously modifying certain memory locations (threadgroup or tile) throughout the lifetime of the program, isn't that dangerous (crash/incorrect results) to allocate registers from other pools of on-chip memory ?
No. You should rarely be even able access a register that's outside the active working set / assigned part of the register file if the shader compiler did its job properly. When you do, or you happen to hit something that has truly been preempted, you will most likely hit some sort of trap state for the wave. And you can only resume once the dependency is resolved, so you never were able to see anything stale. Since this is a rare case, resolving this dependency may very well happen in pure software, just an ordinary interrupt routine.

Other than latency, it's never visible to the wave that it has just caused hot-swapping of parts of the register file, LDS or really any other of the resources that are not tied in directly with the memory hierarchy.

There is only the cache hierarchy backed by system memory, and "registers" and "threadgroup memory" are special ways of addressing, using narrow "local" addresses instead of 64-bit (48/52 bits used) virtual addresses. Everything is backed by system memory, but ideally almost all "register" accesses come out of cache.
You can't assume that its tied in with the "regular" memory hierarchy just because its mapped into a single logical address space for the sake of unified instruction encoding. Backing different parts of the logical address space with different fixed functions or hardware pools is not that uncommon, respectively neither is assigning different retention and mapping strategies based on virtual address range. E.g. direct addressing for control registers, basic offset addressing for registers in the register file with a linear layout, page table based addressing for main memory. And not all those hardware pools need to be tied into a hierarchy by direct interconnects in order to enabled preemption or reloading, interrupts can handle all the rare cases where a virtual address from any of those distinct ranges is currently unmapped.

The patent only describes the instruction mentioned above to flag a part of the register file as "discarded" and "soo to be required". Even though from what I can skim of the patent, it's not even trying to describe the whole mechanism from the perspective of the user code in a way that would permit an efficient system, but rather focuses on the implementation of the necessary interrupt routine that performs the actual register file compaction. Because there's a tricky little constraint, and that is the control block of a wave only reserves a single offset for the register file (and maybe an upper limit?) rather than a full mapping table, so registers in the register file actually need to be shifted around and control blocks need to be updated in order to actually achieve contigous blocks in which a new wave can be placed.

In the most naive form only the top of the callstack can be returned to the pool. In fact, I believe this patent was filed to describe a possible pure interrupt based solution for an older version of RDNA, not RDNA4. And also that is was rejected internally at AMD first in that naive form! (And by the way: It's a pure software patent. That's not enforcable in most of the civilized world :) )

Because flagging the lower half of the callstack as eliglble for preemption in case of deep callstacks is necessary to improve scheduling decisions by a lot, but that one actually does require a hardware change, because instead of just "base" and "upper limit", the control block also needs to track a "currently accessible lower limit" such that a truncated view of the register file can be expressed at all, and thus be trapped.
 
Last edited:
You can't assume that its tied in with the "regular" memory hierarchy just because its mapped into a single logical address space for the sake of unified instruction encoding.
Sorry for the confusion and going somewhat off-topic, but I was referring to what Apple is doing since M3, not AMD. And that's not about "a single logical address space for the sake of unified instruction encoding", but explicitly about dynamically sharing the on-chip memory between registers, threadgroup memory, and cache. They can still be separate logical address spaces.

"And now that register, threadgroup, tile, stack, and buffer data are all cached on chip, this has allowed us to redesign the on-chip memories into fewer larger caches that service all these memory types. This flexibility will benefit shaders that don't make heavy use of each memory type. In the past, if a compute kernel didn't use, for example, threadgroup memory, its corresponding on-chip storage would go completely unused. Now, the on-chip storage will be dynamically assigned to the memory types that are used by your shaders, giving them more on-chip storage than they had in the past, and ultimately, better performance."
 
but explicitly about dynamically sharing the on-chip memory between registers, threadgroup memory, and cache.
Huh, curious that this approach ended up being efficient at all. I mean that this implies that the M3 is backing the register file with a cache that has multiple associativity rather than a simple adder+mux. There's definitely been a trade-off there between better occupation and increased complexity in the by far hottest path, considering that you usually need multiple registers per instruction and thread. I can not see that paying off, in terms of required transistors per cache line. Maybe in terms of efficiency though, as most of the extra silicon required to give the L0 a sufficient number of ports is inactive on average, as opposed to needing more SRAM cells, especially when assuming a limited associativity.

Amazing how fundamentally different those architectures are in detail, and yet so (relatively) close in terms of power efficiency and design goals. I suppose "transistor count" has actually stopped being a meaningful metric a while back (other than raw wafer cost), and it's since been a race instead to reduce the number of active transistors per cycle instead, and to willingly accept "padding" transistors all over the place.
 
Huh, curious that this approach ended up being efficient at all. I mean that this implies that the M3 is backing the register file with a cache that has multiple associativity rather than a simple adder+mux. There's definitely been a trade-off there between better occupation and increased complexity in the by far hottest path, considering that you usually need multiple registers per instruction and thread. I can not see that paying off, in terms of required transistors per cache line. Maybe in terms of efficiency though, as most of the extra silicon required to give the L0 a sufficient number of ports is inactive on average, as opposed to needing more SRAM cells, especially when assuming a limited associativity.
I'd imagine that there would be a small "register L0"/operand reuse cache to cover the majority of register accesses. So a small number of comparators + mux for an SRAM block that's significantly smaller than the traditional GPU register file. And, unlike accesses to global or threadgroup memory, register indexing is uniform across a simdgroup.
 
RDNA4 Instruction Set Architecture Reference Guide was posted over at Anandtech's forum.
My apologies if it has already been posted in the thread.
"3.3.3. Dynamic VGPR Allocation & Deallocation" provides some meat for the dynamic register allocation discussion above.

It seems very similar to the setmaxnreg PTX instruction which Nvidia added in Hopper and which is available on all Blackwell GPUs.
The corresponding SASS instruction is USETMAXREG (https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#hopper-instruction-set)
 
Last edited:
It seems the one subtle difference is that:

* Nvidia requires a workgroup (thread block) register budget set at compile time. So the pool is fixed size and shared only amongst wraps within the block.

* AMD seems to require no fixed budget set at compile time, as far as I can tell from the recent LLVM patches. So it seems the VGPRs can flow freely between any active waves of any workgroup on the CU.
 
Yep it's a *lot* more flexible than NVIDIA except the limitation that it's "all or nothing": your entire WGP/CU only runs either kernels with this enabled or disabled at a time, not both. And the "granularity" of 16 or 32 VGPRs is *GPU wide* (with associated max of 128 or 256 VGPRs, the fact you might want >128 is the key reason for 32 granularity) and cannot be changed per kernel, i.e. drain everything waiting for full idle to change it probably, which is not great since I can think of good reasons to use >128 VGPRs for matrix multiplication-like kernels, and so the granularity might just end up being 32 most of the time, which is quite big... but understandable to minimise cost.

There's no magic bullet to prevent deadlocks in the general case with threadgroup barriers. There's no automatic spilling or anything - it's not a cache. But S_ALLOC_VGPR can return failure which the kernel could use to change its behaviour and manually avoid deadlock, e.g. by manually spilling to "3.3.6. Scratch (Private) Memory" to give other waves the chance to get what they need. There's a minimum of 1 alloc (16 or 32 VGPRs) for all waves whether they are active or not, and (optionally?) a guarantee of the full 8 allocations (128 or 256 VGPRs) for at least 1 wave at any time. That feels like it's reserving quite a large % of the register file in the 32 granularity mode but I didn't calculate it.

I like it, unlike some other attempts at similar goals.

It remains fairly simple and avoids some limitations of NVIDIA's approach (e.g. if you need 3 warpgroups/384 threads, you can't use all 512 registers, because the allocation granularity per warpgroup is 8, and 512 is not divisible by 3, so you can only use 504... I think maybe you can bypass that by launching a 4th warpgroup and not using it, but then that'll influence the compiler's heuristics in other ways, and I'd rather not risk it - not very important, just highlighting how hacky NVIDIA's approach is here compared to AMD's).

For those who have better knowledge of AMD/RDNA's architecture than I do: when they say it doesn't apply to graphics, does that include raytracing? Or is that 'compute' and this is expected to be aggressively used in raytracing? And do they definitely not support this for graphics shaders at all, or might that be redacted from the public ISA because they consider it too limited/complicated to be exposed publicly?
 
For those who have better knowledge of AMD/RDNA's architecture than I do: when they say it doesn't apply to graphics, does that include raytracing? Or is that 'compute' and this is expected to be aggressively used in raytracing? And do they definitely not support this for graphics shaders at all, or might that be redacted from the public ISA because they consider it too limited/complicated to be exposed publicly?
This feature is stated to support only wave32.

AMD seems to default to wave64 for graphics shaders. Meanwhile, RT shaders are compiled as CS, typically wave32.

So most likely a case of “no use case in wave64, don’t bother”.
 
Last edited:
Back
Top