Dynamic register allocation in GPUs

Lurkmass · Mar 1, 2025

Scott_Arm said:
Seems like they've adopted Apple's "Dynamic Caching" idea from the M3. Registers are dynamically allocated at runtime instead of the worst case. Apple's solution also dynamically allocations threadgroup memory and stack memory.

According to a former Apple graphics enginer, a closer description to their "dynamic caching" technology would be dynamic 'deallocation' which allows them to release unified/flexible on-chip memory during runtime depending on whichever path of execution or branch is taking place within a shader. This does not help them increase occupancy since their hardware is unable to issue more waves opportunistically ...

Based on AMD's slides about their dynamic register allocation technology, their hardware can have variable occupancy during during mid-shader execution but there's no mention or hints of a 'unified/flexible' on-chip memory pool space where we can do variable allocation between each type of memory (register/tile/buffer/stack) like as with Apple's dynamic caching ...

Scott_Arm · Mar 1, 2025

Lurkmass said:
According to a former Apple graphics enginer, a closer description to their "dynamic caching" technology would be dynamic 'deallocation' which allows them to release unified/flexible on-chip memory during runtime depending on whichever path of execution or branch is taking place within a shader. This does not help them increase occupancy since their hardware is unable to issue more waves opportunistically ...

Based on AMD's slides about their dynamic register allocation technology, their hardware can have variable occupancy during during mid-shader execution but there's no mention or hints of a 'unified/flexible' on-chip memory pool space where we can do variable allocation between each type of memory (register/tile/buffer/stack) like as with Apple's dynamic caching ...

That seems counter to the Apple info.

Explore GPU advancements in M3 and A17 Pro - Tech Talks - Videos - Apple Developer

Learn how Dynamic Caching, the next-generation shader core, hardware-accelerated ray tracing, and hardware-accelerated mesh shading of...

developer.apple.com

Lurkmass · Mar 1, 2025

Scott_Arm said:
That seems counter to the Apple info.

Explore GPU advancements in M3 and A17 Pro - Tech Talks - Videos - Apple Developer

Learn how Dynamic Caching, the next-generation shader core, hardware-accelerated ray tracing, and hardware-accelerated mesh shading of...

developer.apple.com

From how I understood that presentation in the contex t of his consultation, prior GPU designs used to have a "maximum fixed amount" of each specific memory types (register/threadgroup/tile) that you could allocate BEFORE spilling to higher level caches/memory. Normally in this design you usually have unused memory resources depending on the shader/kernel (compute = unused tile memory, graphics = unused threadgroup memory, etc.) and that if you wanted to allocate more of a specific memory resource than possible, you would usually spill allocation to slower/higher latency caches and memory ...

What dynamic caching does is that you can flexibly carve out unused memory resources to allocate more memory for other memory types that are in use. Occupancy is improved in a sense where you can avoid more cases of spilling to higher latency memory so your shader/kernel spends less time waiting/idling on memory accesses but otherwise you won't see the hardware launch more waves. It's conceptually similar to Nvidia Volta's unified L1/shared memory pool but it goes one step further and unifies register memory space as well!

On AMD, their latest hardware design seemingly can apparently dynamically vary the number of waves in flight throughout the execution of a shader/kernel ...

Scott_Arm · Mar 1, 2025

@Lurkmass The vid I linked says on-chip register memory is now dynamically allocated and de-allocated, which allows for higher thread occupancy by being able to to schedule more SIMDgroups at the same time. They say prior to Apple family 9 they would have to allocate the worst-case in terms of registers from the register file for the entire execution of the shader. For Apple family 9 they show the registers can be dynamically allocated for each part of the program, instead of the worst case.

Maybe I'm misunderstanding the difference you are explaining.

Bondrewd · Mar 1, 2025

Scott_Arm said:
The vid I linked says on-chip register memory is now dynamically allocated and de-allocated, which allows for higher thread occupancy by being able to to schedule more SIMDgroups at the same time. They say prior to Apple family 9 they would have to allocate the worst-case in terms of registers from the register file for the entire execution of the shader.

yes, AMD also does that.
But Apple goes an extra mile and makes VRF, L1 and shmem a unified SRAM pile that is dynamically allocated at runtime. So while RDNA4 is limited at 192K of VRF, Apple can be whatever the maximum allocation is available from the shared SRAM slab.
AMD has a patent for this too but idk when will they implement it.

trinibwoy said:
In theory N48 should be much better at keeping its SIMDs fed with work on complex shaders.

yeah that's how it does RTRT pretty much.
Mind that RDNA4 does not have a FF BVH walker the way Nvidia or Intel have them.

dr_ribit · Mar 4, 2025

Lurkmass said:
From how I understood that presentation in the contex t of his consultation, prior GPU designs used to have a "maximum fixed amount" of each specific memory types (register/threadgroup/tile) that you could allocate BEFORE spilling to higher level caches/memory. Normally in this design you usually have unused memory resources depending on the shader/kernel (compute = unused tile memory, graphics = unused threadgroup memory, etc.) and that if you wanted to allocate more of a specific memory resource than possible, you would usually spill allocation to slower/higher latency caches and memory ...

What dynamic caching does is that you can flexibly carve out unused memory resources to allocate more memory for other memory types that are in use. Occupancy is improved in a sense where you can avoid more cases of spilling to higher latency memory so your shader/kernel spends less time waiting/idling on memory accesses but otherwise you won't see the hardware launch more waves. It's conceptually similar to Nvidia Volta's unified L1/shared memory pool but it goes one step further and unifies register memory space as well!

On AMD, their latest hardware design seemingly can apparently dynamically vary the number of waves in flight throughout the execution of a shader/kernel ...

It seems to me that the schema you describe is more akin to what AMD is doing (as described by the patent here: Register Compaction with Early Release). If Apple indeed operated that way, there would be no change in occupancy on "heavy" shaders and one would still need to do dispatch fine-tuning — but M3 behavior seems to be very different. Some pieces of evidence is dramatic performance improvements on complex shaders (e.g. Blender — that's before the hardware RT kicks in) and also in the Blender patches (where it is mentioned that no dispatch group fine-tuning is needed as the system will take care of occupancy automatically).

All this does raise an interesting question — I don't think we have a very good idea how these things are managed. Getting though multiple layers of marketing BS to the actual technical details can be incredibly tricky. At least my understanding of what Apple does is that they virtualize everything, even register access. So registers are fundamentally backed by system memory and can be cached closer to the SIMDs to improve performance (here is the relevant patent: CACHE CONTROL TO PRESERVE REGISTER DATA). This seems to suggest that you can in principle launch as many waves as you want, just that in some instances the performance will suck since you'll run out of cache. I have no idea how they manage that — there could be a watchdog that monitors the cache utilization and occupancy suspends/launches new waves to optimize things, or it could be a more primitive load-balancing system that operates at the driver level. It is also not clear at all what AMD does. It doesn't seem like their system is fully automatic.

Lurkmass · Mar 4, 2025

dr_ribit said:
It seems to me that the schema you describe is more akin to what AMD is doing (as described by the patent here: Register Compaction with Early Release). If Apple indeed operated that way, there would be no change in occupancy on "heavy" shaders and one would still need to do dispatch fine-tuning — but M3 behavior seems to be very different. Some pieces of evidence is dramatic performance improvements on complex shaders (e.g. Blender — that's before the hardware RT kicks in) and also in the Blender patches (where it is mentioned that no dispatch group fine-tuning is needed as the system will take care of occupancy automatically).

How is my description of Apple closer to AMD when they don't seem to have any unified/flexible on-chip memory ? Based on their slides, AMD's other types of on-chip memory (LDS/L0 cache) are depicted to be physically separate ...

How/Where else would they release register memory to ? Do they just spill to higher level memory (to allocate more registers) later on to increase occupancy during mid-shader execution ? That doesn't seem very fast or performant ...

I'll be interesting to see what information is disclosed their ISA docs about this subject ...

dr_ribit said:
All this does raise an interesting question — I don't think we have a very good idea how these things are managed. Getting though multiple layers of marketing BS to the actual technical details can be incredibly tricky. At least my understanding of what Apple does is that they virtualize everything, even register access. So registers are fundamentally backed by system memory and can be cached closer to the SIMDs to improve performance (here is the relevant patent: CACHE CONTROL TO PRESERVE REGISTER DATA). This seems to suggest that you can in principle launch as many waves as you want, just that in some instances the performance will suck since you'll run out of cache. I have no idea how they manage that — there could be a watchdog that monitors the cache utilization and occupancy suspends/launches new waves to optimize things, or it could be a more primitive load-balancing system that operates at the driver level. It is also not clear at all what AMD does. It doesn't seem like their system is fully automatic.

I imagine that Apple absolutely can launch as many waves as they want at the start but I think they "starve certain sources of memory by priority" by carving out other pools of on-chip memory before they spill to device memory so it still fits with their claim that 'occupancy' isn't dependent on worst case register allocation ...

Also if you have some "race condition" in your program where you're continuously modifying certain memory locations (threadgroup or tile) throughout the lifetime of the program, isn't that dangerous (crash/incorrect results) to allocate registers from other pools of on-chip memory ? I know that they can allocate more registers by spilling from higher level memory but now performance is a concern again ...

Xmas · Mar 4, 2025

The way I understood dr_ribit's description of what Apple is doing (I didn't read the patent) is that there are no other pools of on-chip memory. There is only the cache hierarchy backed by system memory, and "registers" and "threadgroup memory" are special ways of addressing, using narrow "local" addresses instead of 64-bit (48/52 bits used) virtual addresses. Everything is backed by system memory, but ideally almost all "register" accesses come out of cache.

I assume that register "deallocation" is simply a bit flag in the instruction encoding per register input which indicates that this register value is no longer needed, thus the cache entry can be discarded without writing back to system memory.

Ext3h · Mar 4, 2025

dr_ribit said:
have no idea how they manage that — there could be a watchdog that monitors the cache utilization and occupancy suspends/launches new waves to optimize things, or it could be a more primitive load-balancing system that operates at the driver level. It is also not clear at all what AMD does.

Most likely both require assistance by the shader compiler to flag the lower part of the stack as "preemptable", so rather than a hard register count, you now have a peak working set size, and a total register count for the deepest part of the call tree. Worst case it requires hoisting a couple of registers from a lower page to an upper page on call to actually minimize the hot working set size by compaction of active registers.

You should have 3 instructions that are aiding:

(Optional) Flagging a higher part of the virtual register file as "soon to be required" - that's the first thing you need to do first the preamble of a function call. When not using this instruction, the first "bad" register access need to send the wave into a trap state.
(Optional) Flagging a lower part of the virtual register file as "cold" - permit or even encourage preemption, but contents must be preserved. This is what you need to do last in the preamble of a call, after hoisting all registers into the new hot set. When not using this instruction, then there's no voluntary preemption supported, and bad scheduling decisions may need to happen to resolve deadlocks by essentially randomized preemption.
Flagging a part of the register set as "discarded", that's what you do in the footer of a function call.

Not all functions necessarily need to do all the flagging, that would be wasteful. Only recursive functions (after a set number of folded recursions) or functions with a huge individual register footprint warant taking the risk of dynamic register allocation.

I don't think the shader compiler can properly estimate what time share the shader spends in that "big" and "small" working set size, so it's a static scheduling decision to allow e.g. a 50%, 100% or even 200% overcomissioning of the "worst case register count" but the "hot working set" remains a hard limit that can't be trivially underflown without catastrophic preemption.

A bad estimation that results in excessive swapping of the register file may be corrected in runtime by increasing the reserved "hot" working set of the offending shader for future waves. You are aiming to keep the chances of the worst case spillage low, but you do have to accept that worst case in order to reap in the benefits of overcommission.

Lurkmass said:
Based on their slides, AMD's other types of on-chip memory (LDS/L0 cache) are depicted to be physically separate ...

That slide might very well be accurate. But even in the worst case, nothing speaks against memory management performed by a memory management kernel that is operating outside the "normal" visibility constraints. You don't necessarily have only the instructions available that the ISA docs expose as "user visible" either, and it's almost safe to assume that software interrupt routines with elevated privileges existed for a while now.

Lurkmass said:
Also if you have some "race condition" in your program where you're continuously modifying certain memory locations (threadgroup or tile) throughout the lifetime of the program, isn't that dangerous (crash/incorrect results) to allocate registers from other pools of on-chip memory ?

No. You should rarely be even able access a register that's outside the active working set / assigned part of the register file if the shader compiler did its job properly. When you do, or you happen to hit something that has truly been preempted, you will most likely hit some sort of trap state for the wave. And you can only resume once the dependency is resolved, so you never were able to see anything stale. Since this is a rare case, resolving this dependency may very well happen in pure software, just an ordinary interrupt routine.

Other than latency, it's never visible to the wave that it has just caused hot-swapping of parts of the register file, LDS or really any other of the resources that are not tied in directly with the memory hierarchy.

Xmas said:
There is only the cache hierarchy backed by system memory, and "registers" and "threadgroup memory" are special ways of addressing, using narrow "local" addresses instead of 64-bit (48/52 bits used) virtual addresses. Everything is backed by system memory, but ideally almost all "register" accesses come out of cache.

You can't assume that its tied in with the "regular" memory hierarchy just because its mapped into a single logical address space for the sake of unified instruction encoding. Backing different parts of the logical address space with different fixed functions or hardware pools is not that uncommon, respectively neither is assigning different retention and mapping strategies based on virtual address range. E.g. direct addressing for control registers, basic offset addressing for registers in the register file with a linear layout, page table based addressing for main memory. And not all those hardware pools need to be tied into a hierarchy by direct interconnects in order to enabled preemption or reloading, interrupts can handle all the rare cases where a virtual address from any of those distinct ranges is currently unmapped.

The patent only describes the instruction mentioned above to flag a part of the register file as "discarded" and "soo to be required". Even though from what I can skim of the patent, it's not even trying to describe the whole mechanism from the perspective of the user code in a way that would permit an efficient system, but rather focuses on the implementation of the necessary interrupt routine that performs the actual register file compaction. Because there's a tricky little constraint, and that is the control block of a wave only reserves a single offset for the register file (and maybe an upper limit?) rather than a full mapping table, so registers in the register file actually need to be shifted around and control blocks need to be updated in order to actually achieve contigous blocks in which a new wave can be placed.

In the most naive form only the top of the callstack can be returned to the pool. In fact, I believe this patent was filed to describe a possible pure interrupt based solution for an older version of RDNA, not RDNA4. And also that is was rejected internally at AMD first in that naive form! (And by the way: It's a pure software patent. That's not enforcable in most of the civilized world

)

Because flagging the lower half of the callstack as eliglble for preemption in case of deep callstacks is necessary to improve scheduling decisions by a lot, but that one actually does require a hardware change, because instead of just "base" and "upper limit", the control block also needs to track a "currently accessible lower limit" such that a truncated view of the register file can be expressed at all, and thus be trapped.

Xmas · Mar 5, 2025

Ext3h said:
You can't assume that its tied in with the "regular" memory hierarchy just because its mapped into a single logical address space for the sake of unified instruction encoding.

Sorry for the confusion and going somewhat off-topic, but I was referring to what Apple is doing since M3, not AMD. And that's not about "a single logical address space for the sake of unified instruction encoding", but explicitly about dynamically sharing the on-chip memory between registers, threadgroup memory, and cache. They can still be separate logical address spaces.

"And now that register, threadgroup, tile, stack, and buffer data are all cached on chip, this has allowed us to redesign the on-chip memories into fewer larger caches that service all these memory types. This flexibility will benefit shaders that don't make heavy use of each memory type. In the past, if a compute kernel didn't use, for example, threadgroup memory, its corresponding on-chip storage would go completely unused. Now, the on-chip storage will be dynamically assigned to the memory types that are used by your shaders, giving them more on-chip storage than they had in the past, and ultimately, better performance."

Explore GPU advancements in M3 and A17 Pro - Tech Talks - Videos - Apple Developer

Learn how Dynamic Caching, the next-generation shader core, hardware-accelerated ray tracing, and hardware-accelerated mesh shading of...

developer.apple.com

Ext3h · Mar 5, 2025

Xmas said:
but explicitly about dynamically sharing the on-chip memory between registers, threadgroup memory, and cache.

Huh, curious that this approach ended up being efficient at all. I mean that this implies that the M3 is backing the register file with a cache that has multiple associativity rather than a simple adder+mux. There's definitely been a trade-off there between better occupation and increased complexity in the by far hottest path, considering that you usually need multiple registers per instruction and thread. I can not see that paying off, in terms of required transistors per cache line. Maybe in terms of efficiency though, as most of the extra silicon required to give the L0 a sufficient number of ports is inactive on average, as opposed to needing more SRAM cells, especially when assuming a limited associativity.

Amazing how fundamentally different those architectures are in detail, and yet so (relatively) close in terms of power efficiency and design goals. I suppose "transistor count" has actually stopped being a meaningful metric a while back (other than raw wafer cost), and it's since been a race instead to reduce the number of active transistors per cycle instead, and to willingly accept "padding" transistors all over the place.

Xmas · Mar 5, 2025

Ext3h said:
Huh, curious that this approach ended up being efficient at all. I mean that this implies that the M3 is backing the register file with a cache that has multiple associativity rather than a simple adder+mux. There's definitely been a trade-off there between better occupation and increased complexity in the by far hottest path, considering that you usually need multiple registers per instruction and thread. I can not see that paying off, in terms of required transistors per cache line. Maybe in terms of efficiency though, as most of the extra silicon required to give the L0 a sufficient number of ports is inactive on average, as opposed to needing more SRAM cells, especially when assuming a limited associativity.

I'd imagine that there would be a small "register L0"/operand reuse cache to cover the majority of register accesses. So a small number of comparators + mux for an SRAM block that's significantly smaller than the traditional GPU register file. And, unlike accesses to global or threadgroup memory, register indexing is uniform across a simdgroup.

Xmas · Thursday at 4:19 PM

LordEC911 said:
RDNA4 Instruction Set Architecture Reference Guide was posted over at Anandtech's forum.
My apologies if it has already been posted in the thread.

"3.3.3. Dynamic VGPR Allocation & Deallocation" provides some meat for the dynamic register allocation discussion above.

It seems very similar to the setmaxnreg PTX instruction which Nvidia added in Hopper and which is available on all Blackwell GPUs.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#miscellaneous-instructions-setmaxnreg

The corresponding SASS instruction is USETMAXREG (https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#hopper-instruction-set)

pTmdfx · Saturday at 1:23 PM

It seems the one subtle difference is that:

* Nvidia requires a workgroup (thread block) register budget set at compile time. So the pool is fixed size and shared only amongst wraps within the block.

* AMD seems to require no fixed budget set at compile time, as far as I can tell from the recent LLVM patches. So it seems the VGPRs can flow freely between any active waves of any workgroup on the CU.

Arun · Saturday at 2:04 PM

Yep it's a *lot* more flexible than NVIDIA except the limitation that it's "all or nothing": your entire WGP/CU only runs either kernels with this enabled or disabled at a time, not both. And the "granularity" of 16 or 32 VGPRs is *GPU wide* (with associated max of 128 or 256 VGPRs, the fact you might want >128 is the key reason for 32 granularity) and cannot be changed per kernel, i.e. drain everything waiting for full idle to change it probably, which is not great since I can think of good reasons to use >128 VGPRs for matrix multiplication-like kernels, and so the granularity might just end up being 32 most of the time, which is quite big... but understandable to minimise cost.

There's no magic bullet to prevent deadlocks in the general case with threadgroup barriers. There's no automatic spilling or anything - it's not a cache. But S_ALLOC_VGPR can return failure which the kernel could use to change its behaviour and manually avoid deadlock, e.g. by manually spilling to "3.3.6. Scratch (Private) Memory" to give other waves the chance to get what they need. There's a minimum of 1 alloc (16 or 32 VGPRs) for all waves whether they are active or not, and (optionally?) a guarantee of the full 8 allocations (128 or 256 VGPRs) for at least 1 wave at any time. That feels like it's reserving quite a large % of the register file in the 32 granularity mode but I didn't calculate it.

I like it, unlike some other attempts at similar goals.

It remains fairly simple and avoids some limitations of NVIDIA's approach (e.g. if you need 3 warpgroups/384 threads, you can't use all 512 registers, because the allocation granularity per warpgroup is 8, and 512 is not divisible by 3, so you can only use 504... I think maybe you can bypass that by launching a 4th warpgroup and not using it, but then that'll influence the compiler's heuristics in other ways, and I'd rather not risk it - not very important, just highlighting how hacky NVIDIA's approach is here compared to AMD's).

For those who have better knowledge of AMD/RDNA's architecture than I do: when they say it doesn't apply to graphics, does that include raytracing? Or is that 'compute' and this is expected to be aggressively used in raytracing? And do they definitely not support this for graphics shaders at all, or might that be redacted from the public ISA because they consider it too limited/complicated to be exposed publicly?

pTmdfx · Saturday at 3:00 PM

Arun said:
For those who have better knowledge of AMD/RDNA's architecture than I do: when they say it doesn't apply to graphics, does that include raytracing? Or is that 'compute' and this is expected to be aggressively used in raytracing? And do they definitely not support this for graphics shaders at all, or might that be redacted from the public ISA because they consider it too limited/complicated to be exposed publicly?

This feature is stated to support only wave32.

AMD seems to default to wave64 for graphics shaders. Meanwhile, RT shaders are compiled as CS, typically wave32.

So most likely a case of “no use case in wave64, don’t bother”.

pTmdfx · Saturday at 3:26 PM

Arun said:
And do they definitely not support this for graphics shaders at all, or might that be redacted from the public ISA because they consider it too limited/complicated to be exposed publicly?

Now that I think more about it…

It might have to do with how registers are allocated for Wave64 as well. I recall reading that RDNA 3 seems to:

1. swizzles the VGPR banks used for the wave64 high lanes (0<->1, 2<->3), so e.g., V0_lo is on bank 0, but V0_hi is on bank 1.
2. ~~stores the low halves and high halves as two separate contiguous ranges~~ (drawn in diagram but not called out in text, so perhaps not)

This scheme allows wave64 to feed both 32-lane VALU pipes with two 64-lane operands bank conflict free.

e.g., V0_lo, V0_hi, V1_lo, V1_hi can be read in a single cycle with the scheme.

Depending on the scheme's actual implementation, it could be at odds (?) with how s_alloc_vgpr works.
Though naively speaking, it could have worked still if the lo & high are stored as physically adjacent pairs.

Lurkmass · Saturday at 5:18 PM

Arun said:
For those who have better knowledge of AMD/RDNA's architecture than I do: when they say it doesn't apply to graphics, does that include raytracing? Or is that 'compute' and this is expected to be aggressively used in raytracing? And do they definitely not support this for graphics shaders at all, or might that be redacted from the public ISA because they consider it too limited/complicated to be exposed publicly?

May depend on the API in question. If you're using RTPSOs, ray tracing is definitely ran on their compute pipelines. If you're using inline RT/ray query API with graphics shaders, I don't know if it's truly running on the graphics pipeline or if they just perform a compute dispatch during within those API but I'm inclined to believe the latter. They don't have a specialized HW pipeline for RT like you would see on Intel HW where they have a callable shading pipeline where all RT shaders can be compiled to callable shaders but using inline RT/ray query prevents their HW from making use of this pipeline thus forcing them to run RT on their compute pipeline ...

Don't know if the gfx pipeline exception is an inherent HW design limitation or an artificial driver/compiler limitation yet ...

fellix · Sunday at 9:23 AM

With DXR, Radeon is executing RT shaders in the compute queue as indicated by AMD's own profiler:

Xmas · Sunday at 4:27 PM

pTmdfx said:
It seems the one subtle difference is that:

* Nvidia requires a workgroup (thread block) register budget set at compile time. So the pool is fixed size and shared only amongst wraps within the block.

Interesting point. So the only thing you can do that somewhat resembles varying occupancy is to launch a single kernel doing its own scheduling, with the maximum number of warpgroups that should be active at any given time (where warpgroups that should be idle for a while run setmaxnreg.dec 24 followed maybe by a mbarrier.try_wait). A shame that 24 is the minimum you can set.

Arun said:
I like it, unlike some other attempts at similar goals.

It remains fairly simple and avoids some limitations of NVIDIA's approach (e.g. if you need 3 warpgroups/384 threads, you can't use all 512 registers, because the allocation granularity per warpgroup is 8, and 512 is not divisible by 3, so you can only use 504... I think maybe you can bypass that by launching a 4th warpgroup and not using it, but then that'll influence the compiler's heuristics in other ways, and I'd rather not risk it - not very important, just highlighting how hacky NVIDIA's approach is here compared to AMD's).

I don't know, with all these limitations I'd expect it to provide much less practical benefit than what it appears to on the surface.

On the other hand, comparing RDNA4 WGP to GB20x SM (keeping in mind that RX7090XT has 42 WGPs to RTX5080's 84 SMs), a WGP has 3x the registers of an SM (4x for scalar), with 2x the FP32 and tensor throughput. So RDNA4 should already be much less limited by registers, anyway.

Dynamic register allocation in GPUs

Lurkmass

Scott_Arm

Explore GPU advancements in M3 and A17 Pro - Tech Talks - Videos - Apple Developer

Lurkmass

Explore GPU advancements in M3 and A17 Pro - Tech Talks - Videos - Apple Developer

Scott_Arm

Bondrewd

dr_ribit

Lurkmass

Xmas

Porous

Ext3h

Xmas

Porous

Explore GPU advancements in M3 and A17 Pro - Tech Talks - Videos - Apple Developer

Ext3h

Xmas

Porous

Xmas

Porous

pTmdfx

Arun

Unknown.

pTmdfx

pTmdfx

Lurkmass

fellix

Xmas

Porous

Similar threads