AMD RDNA4 Architecture Speculation

This seems like a semantics argument.

TSMC N4 is (at least claimed by TSMC) to be an iterative node enhancement with density, efficiency and performance gains.

TSMC 4N despite the naming from all reporting is just a customization of TSMC N5.
TSMC lists N4 as a member of the 5nm node family. N4 is generic improvement of N5, 4N is custom improvement of N5. There is no reason to believe one or the other is better until proven otherwise. In fact, power efficiency of N4 is worse than e.g. N5P (as stated by TSMC).
 
They're all kinda custom given the amount of DTCO involved and custom metal stacks.
4N is basically NVIDIA's customization of N5 (4N basically means For NVIDIA). However, TSMC says N4P is a better version of N5, It's said that N4P has 6% higher transistor density and 6% higher performance than N5 (or 22% better efficiency than N5).
 
4N is basically NVIDIA's customization of N5 (4N basically means For NVIDIA). However, TSMC says N4P is a better version of N5, It's said that N4P has 6% higher transistor density and 6% higher performance than N5 (or 22% better efficiency than N5).
They all use the latest PDKs available so whatever NV uses is N4p too. PDKs are generally forwards-compatible in a given node family (AMD is using N5X/N4X overdrive xtor tunings for desktop CPU parts even if the A0 TO was N5p/N4p etc).

Either way N48 is N4C (or at least TPU says so) aka the cost-down version with less masks etc.
 
TPU is wrong though, Andreas Schiliing (from HardwareLuxx) got it straight from AMD that it's N4P.
Not sure, Hardwareluxx made mistakes before too.
I'll wait for a die teardown by, let's say, people.

Either way all client gfx shipped this year will be some kind of N5/N4 family derivative.
 
4N is basically NVIDIA's customization of N5 (4N basically means For NVIDIA). However, TSMC says N4P is a better version of N5, It's said that N4P has 6% higher transistor density and 6% higher performance than N5 (or 22% better efficiency than N5).
ALL major players customize the process to some extent. We have literally zero information confirming or even suggesting 4N is any different or more customized or whatever. NVIDIA wanting to call their process with their own name doesn't tell anything.
 
ALL major players customize the process to some extent. We have literally zero information confirming or even suggesting 4N is any different or more customized or whatever. NVIDIA wanting to call their process with their own name doesn't tell anything.
We are in agreement here, I think N4P is better (possible significantly better) than 4N.
 
We are in agreement here, I think N4P is better (possible significantly better) than 4N.
Could be, or it could be literally same thing with few tweaks here and there, or anything else really. We can't even be certain all 4Ns are literally the same process, they could have moved to new PDKs for example, as suggested earlier.
 
Is the AI accelerator a new hardware block or similar to RDNA 3 WMMA on the main vector unit? Hard to tell from the diagram
According to Osvaldo, both RT and ML are still shared, just more efficiently.

this confirms AMD sticking to their guns in RT & ML accel: both still rely heavily on shared CU resources (VGPRs & SIMD32s) but the latter have big improvements (better allocation, block moves etc) so sharing is more efficient

 
@Lurkmass The vid I linked says on-chip register memory is now dynamically allocated and de-allocated, which allows for higher thread occupancy by being able to to schedule more SIMDgroups at the same time. They say prior to Apple family 9 they would have to allocate the worst-case in terms of registers from the register file for the entire execution of the shader. For Apple family 9 they show the registers can be dynamically allocated for each part of the program, instead of the worst case.

Maybe I'm misunderstanding the difference you are explaining.
How I interpret it is that Apple family 9 GPUs may naturally raise the floor in terms of thread occupancy depending on what the compiler does because we can allocate more register memory (from other sources of memory types) than what our limited register file sizes would normally allow so that we would start with an initial higher baseline number of SIMDgroups at the start of execution as opposed prior hardware generations ...

An implicated design point behind dynamic caching is that there is a "one way decay" model at play where register memory can only be 'demoted' to other (tile/thread group/cache) memory types and that you can't promote them back into register memory during execution ...

These constraints would seem to line up with the information (still needing to allocate for worst case at start & no mid-shader increase in available registers) given by the ex-Apple employee hence the conclusion that freeing up register memory does not change the number of SIMDgroups in flight because that released memory is then reused to allocate other memory resources ...
 
Last edited:
According to Osvaldo, both RT and ML are still shared, just more efficiently.

RDNA 4 CU tensor throughput matches Blackwell’s SM for all formats (no FP4 though). That’s without the use of dedicated ALUs. It’s a very elegant design. Will be interesting to see benchmarks of mixed tensor and standard compute workloads. The 9070 XT has more TOPS than the 5070 Ti.
 
How are they getting the same throughput then?

I’m going off AMD’s numbers but I’m not sure how they’re calculated. GB203 does 1024 FP16 TOPs per clock per SM and an N48 CU hits the same number with only 128 FP32 ALUs. They’re squeezing 8 FP16 ops out of each ALU. This is without sparsity. Black magic maybe.
 
I’m going off AMD’s numbers but I’m not sure how they’re calculated. GB203 does 1024 FP16 TOPs per clock per SM and an N48 CU hits the same number with only 128 FP32 ALUs. They’re squeezing 8 FP16 ops out of each ALU. This is without sparsity. Black magic maybe.
RDNA 3 was “256 FP16 ops/clk” by virtue of doing 64 Dot2 per clock. Each Dot2 is worth 4 ops (C = C + (A1*B1 + A2*B2)).

It is probably done as many passes of two fp16x2 operands + one 32-bit operand, repeatedly fed into the Dot2 ALU. That would use half of the theoretical maximum operand bandwidth, which is six 32-bit operands.

So what likely happened with RDNA 4 is that WMMA gets rearranged with an improved mixed-precision dot product ALU — Dot4 for FP16 (8 ops/clk), Dot8 for FP8 (16 ops/clk), Dot16 for INT4 (32 ops/clk). This could be tapping off the max operand bandwidth in the same way as VOPD “dual issue”. For example, FP16 Dot4 takes two VGPR pairs (two fp16x4) + one 32-bit operand.

This still puts RDNA 4 at half the ops/clk of CDNA 3. Though IMO this is in line with expectation — CDNA 2/3 CU had doubled the operand bandwidth to support full rate FP64 / 2xFP32. This also meant they can feed twice the packed data into the dot product ALUs.
 
Last edited:
I’m going off AMD’s numbers but I’m not sure how they’re calculated. GB203 does 1024 FP16 TOPs per clock per SM and an N48 CU hits the same number with only 128 FP32 ALUs. They’re squeezing 8 FP16 ops out of each ALU. This is without sparsity. Black magic maybe.
Right, magic.
They have additional ALUs for that. The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.
 
while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously
they very evidently arent, which is why NV moved to 1 GEMM core per SMSP since Ampere.
Scheduling is PITA!
 
Right, magic.
They have additional ALUs for that. The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.
I guess in RDNA4 the WMMA op's are handled by an actual dedicated ALU, but very tightly integrated with the vector units i.e. sharing the same data path and issue port so concurrent execution is not possible, just like on RDNA3, but the capabilities are significantly enhanced -- extended type support and sparsity.
 
Back
Top