AMD RDNA4 Architecture Speculation

According to a former Apple graphics enginer, a closer description to their "dynamic caching" technology would be dynamic 'deallocation' which allows them to release unified/flexible on-chip memory during runtime depending on whichever path of execution or branch is taking place within a shader. This does not help them increase occupancy since their hardware is unable to issue more waves opportunistically ...

Based on AMD's slides about their dynamic register allocation technology, their hardware can have variable occupancy during during mid-shader execution but there's no mention or hints of a 'unified/flexible' on-chip memory pool space where we can do variable allocation between each type of memory (register/tile/buffer/stack) like as with Apple's dynamic caching ...

That seems counter to the Apple info.
 
That seems counter to the Apple info.
From how I understood that presentation in the context of his consultation, prior GPU designs used to have a "maximum fixed amount" of each specific memory types (register/threadgroup/tile) that you could allocate BEFORE spilling to higher level caches/memory. Normally in this design you usually have unused memory resources depending on the shader/kernel (compute = unused tile memory, graphics = unused threadgroup memory, etc.) and that if you wanted to allocate more of a specific memory resource than possible, you would usually spill allocation to slower/higher latency caches and memory ...

What dynamic caching does is that you can flexibly carve out unused memory resources to allocate more memory for other memory types that are in use. Occupancy is improved in a sense where you can avoid more cases of spilling to higher latency memory so your shader/kernel spends less time waiting/idling on memory accesses but otherwise you won't see the hardware launch more waves. It's conceptually similar to Nvidia Volta's unified L1/shared memory pool but it goes one step further and unifies register memory space as well!

On AMD, their latest hardware design seemingly can apparently dynamically vary the number of waves in flight throughout the execution of a shader/kernel ...
 
GB203 has the same 64MB of L2 as N48 has for IC. The difference is just 8MBs of L2 on N48 which doesn't sound like a lot.

N48 also has 24MB in vector registers vs 21MB on GB203. And 6MB of WGP cache/LDS vs 10.5MB SM cache on GB203. Basically a wash in terms of on-chip storage.

One of the biggest differences between the architectures is scheduler to register ratio. Maximum thread occupancy is only 33% higher on N48 (16 vs 12) but it has 3x the register capacity per scheduler. In theory N48 should be much better at keeping its SIMDs fed with work on complex shaders.
 
@Lurkmass The vid I linked says on-chip register memory is now dynamically allocated and de-allocated, which allows for higher thread occupancy by being able to to schedule more SIMDgroups at the same time. They say prior to Apple family 9 they would have to allocate the worst-case in terms of registers from the register file for the entire execution of the shader. For Apple family 9 they show the registers can be dynamically allocated for each part of the program, instead of the worst case.

Maybe I'm misunderstanding the difference you are explaining.
 
The vid I linked says on-chip register memory is now dynamically allocated and de-allocated, which allows for higher thread occupancy by being able to to schedule more SIMDgroups at the same time. They say prior to Apple family 9 they would have to allocate the worst-case in terms of registers from the register file for the entire execution of the shader.
yes, AMD also does that.
But Apple goes an extra mile and makes VRF, L1 and shmem a unified SRAM pile that is dynamically allocated at runtime. So while RDNA4 is limited at 192K of VRF, Apple can be whatever the maximum allocation is available from the shared SRAM slab.
AMD has a patent for this too but idk when will they implement it.
In theory N48 should be much better at keeping its SIMDs fed with work on complex shaders.
yeah that's how it does RTRT pretty much.
Mind that RDNA4 does not have a FF BVH walker the way Nvidia or Intel have them.
 
This seems like a semantics argument.

TSMC N4 is (at least claimed by TSMC) to be an iterative node enhancement with density, efficiency and performance gains.

TSMC 4N despite the naming from all reporting is just a customization of TSMC N5.
TSMC lists N4 as a member of the 5nm node family. N4 is generic improvement of N5, 4N is custom improvement of N5. There is no reason to believe one or the other is better until proven otherwise. In fact, power efficiency of N4 is worse than e.g. N5P (as stated by TSMC).
 
They're all kinda custom given the amount of DTCO involved and custom metal stacks.
4N is basically NVIDIA's customization of N5 (4N basically means For NVIDIA). However, TSMC says N4P is a better version of N5, It's said that N4P has 6% higher transistor density and 6% higher performance than N5 (or 22% better efficiency than N5).
 
4N is basically NVIDIA's customization of N5 (4N basically means For NVIDIA). However, TSMC says N4P is a better version of N5, It's said that N4P has 6% higher transistor density and 6% higher performance than N5 (or 22% better efficiency than N5).
They all use the latest PDKs available so whatever NV uses is N4p too. PDKs are generally forwards-compatible in a given node family (AMD is using N5X/N4X overdrive xtor tunings for desktop CPU parts even if the A0 TO was N5p/N4p etc).

Either way N48 is N4C (or at least TPU says so) aka the cost-down version with less masks etc.
 
TPU is wrong though, Andreas Schiliing (from HardwareLuxx) got it straight from AMD that it's N4P.
Not sure, Hardwareluxx made mistakes before too.
I'll wait for a die teardown by, let's say, people.

Either way all client gfx shipped this year will be some kind of N5/N4 family derivative.
 
4N is basically NVIDIA's customization of N5 (4N basically means For NVIDIA). However, TSMC says N4P is a better version of N5, It's said that N4P has 6% higher transistor density and 6% higher performance than N5 (or 22% better efficiency than N5).
ALL major players customize the process to some extent. We have literally zero information confirming or even suggesting 4N is any different or more customized or whatever. NVIDIA wanting to call their process with their own name doesn't tell anything.
 
ALL major players customize the process to some extent. We have literally zero information confirming or even suggesting 4N is any different or more customized or whatever. NVIDIA wanting to call their process with their own name doesn't tell anything.
We are in agreement here, I think N4P is better (possible significantly better) than 4N.
 
We are in agreement here, I think N4P is better (possible significantly better) than 4N.
Could be, or it could be literally same thing with few tweaks here and there, or anything else really. We can't even be certain all 4Ns are literally the same process, they could have moved to new PDKs for example, as suggested earlier.
 
Is the AI accelerator a new hardware block or similar to RDNA 3 WMMA on the main vector unit? Hard to tell from the diagram
According to Osvaldo, both RT and ML are still shared, just more efficiently.

this confirms AMD sticking to their guns in RT & ML accel: both still rely heavily on shared CU resources (VGPRs & SIMD32s) but the latter have big improvements (better allocation, block moves etc) so sharing is more efficient

 
@Lurkmass The vid I linked says on-chip register memory is now dynamically allocated and de-allocated, which allows for higher thread occupancy by being able to to schedule more SIMDgroups at the same time. They say prior to Apple family 9 they would have to allocate the worst-case in terms of registers from the register file for the entire execution of the shader. For Apple family 9 they show the registers can be dynamically allocated for each part of the program, instead of the worst case.

Maybe I'm misunderstanding the difference you are explaining.
How I interpret it is that Apple family 9 GPUs may naturally raise the floor in terms of thread occupancy depending on what the compiler does because we can allocate more register memory (from other sources of memory types) than what our limited register file sizes would normally allow so that we would start with an initial higher baseline number of SIMDgroups at the start of execution as opposed prior hardware generations ...

An implicated design point behind dynamic caching is that there is a "one way decay" model at play where register memory can only be 'demoted' to other (tile/thread group/cache) memory types and that you can't promote them back into register memory during execution ...

These constraints would seem to line up with the information (still needing to allocate for worst case at start & no mid-shader increase in available registers) given by the ex-Apple employee hence the conclusion that freeing up register memory does not change the number of SIMDgroups in flight because that released memory is then reused to allocate other memory resources ...
 
Last edited:
According to Osvaldo, both RT and ML are still shared, just more efficiently.

RDNA 4 CU tensor throughput matches Blackwell’s SM for all formats (no FP4 though). That’s without the use of dedicated ALUs. It’s a very elegant design. Will be interesting to see benchmarks of mixed tensor and standard compute workloads. The 9070 XT has more TOPS than the 5070 Ti.
 
Back
Top