AMD RDNA4 Architecture Speculation

It seems to me that the schema you describe is more akin to what AMD is doing (as described by the patent here: Register Compaction with Early Release). If Apple indeed operated that way, there would be no change in occupancy on "heavy" shaders and one would still need to do dispatch fine-tuning — but M3 behavior seems to be very different. Some pieces of evidence is dramatic performance improvements on complex shaders (e.g. Blender — that's before the hardware RT kicks in) and also in the Blender patches (where it is mentioned that no dispatch group fine-tuning is needed as the system will take care of occupancy automatically).
How is my description of Apple closer to AMD when they don't seem to have any unified/flexible on-chip memory ? Based on their slides, AMD's other types of on-chip memory (LDS/L0 cache) are depicted to be physically separate ...

How/Where else would they release register memory to ? Do they just spill to higher level memory (to allocate more registers) later on to increase occupancy during mid-shader execution ? That doesn't seem very fast or performant ...

I'll be interesting to see what information is disclosed their ISA docs about this subject ...
All this does raise an interesting question — I don't think we have a very good idea how these things are managed. Getting though multiple layers of marketing BS to the actual technical details can be incredibly tricky. At least my understanding of what Apple does is that they virtualize everything, even register access. So registers are fundamentally backed by system memory and can be cached closer to the SIMDs to improve performance (here is the relevant patent: CACHE CONTROL TO PRESERVE REGISTER DATA). This seems to suggest that you can in principle launch as many waves as you want, just that in some instances the performance will suck since you'll run out of cache. I have no idea how they manage that — there could be a watchdog that monitors the cache utilization and occupancy suspends/launches new waves to optimize things, or it could be a more primitive load-balancing system that operates at the driver level. It is also not clear at all what AMD does. It doesn't seem like their system is fully automatic.
I imagine that Apple absolutely can launch as many waves as they want at the start but I think they "starve certain sources of memory by priority" by carving out other pools of on-chip memory before they spill to device memory so it still fits with their claim that 'occupancy' isn't dependent on worst case register allocation ...

Also if you have some "race condition" in your program where you're continuously modifying certain memory locations (threadgroup or tile) throughout the lifetime of the program, isn't that dangerous (crash/incorrect results) to allocate registers from other pools of on-chip memory ? I know that they can allocate more registers by spilling from higher level memory but now performance is a concern again ...
 
The way I understood dr_ribit's description of what Apple is doing (I didn't read the patent) is that there are no other pools of on-chip memory. There is only the cache hierarchy backed by system memory, and "registers" and "threadgroup memory" are special ways of addressing, using narrow "local" addresses instead of 64-bit (48/52 bits used) virtual addresses. Everything is backed by system memory, but ideally almost all "register" accesses come out of cache.

I assume that register "deallocation" is simply a bit flag in the instruction encoding per register input which indicates that this register value is no longer needed, thus the cache entry can be discarded without writing back to system memory.
 
Back
Top