AMD RDNA4 Architecture Speculation

How are they getting the same throughput then?

I’m going off AMD’s numbers but I’m not sure how they’re calculated. GB203 does 1024 FP16 TOPs per clock per SM and an N48 CU hits the same number with only 128 FP32 ALUs. They’re squeezing 8 FP16 ops out of each ALU. This is without sparsity. Black magic maybe.
 
I’m going off AMD’s numbers but I’m not sure how they’re calculated. GB203 does 1024 FP16 TOPs per clock per SM and an N48 CU hits the same number with only 128 FP32 ALUs. They’re squeezing 8 FP16 ops out of each ALU. This is without sparsity. Black magic maybe.
RDNA 3 was “256 FP16 ops/clk” by virtue of doing 64 Dot2 per clock. Each Dot2 is worth 4 ops (C = C + (A1*B1 + A2*B2)).

It is probably done as many passes of two fp16x2 operands + one 32-bit operand, repeatedly fed into the Dot2 ALU. That would use half of the theoretical maximum operand bandwidth, which is six 32-bit operands.

So what likely happened with RDNA 4 is that WMMA gets rearranged with an improved mixed-precision dot product ALU — Dot4 for FP16 (8 ops/clk), Dot8 for FP8 (16 ops/clk), Dot16 for INT4 (32 ops/clk). This could be tapping off the max operand bandwidth in the same way as VOPD “dual issue”. For example, FP16 Dot4 takes two VGPR pairs (two fp16x4) + one 32-bit operand.

This still puts RDNA 4 at half the ops/clk of CDNA 3. Though IMO this is in line with expectation — CDNA 2/3 CU had doubled the operand bandwidth to support full rate FP64 / 2xFP32. This also meant they can feed twice the packed data into the dot product ALUs.
 
Last edited:
I’m going off AMD’s numbers but I’m not sure how they’re calculated. GB203 does 1024 FP16 TOPs per clock per SM and an N48 CU hits the same number with only 128 FP32 ALUs. They’re squeezing 8 FP16 ops out of each ALU. This is without sparsity. Black magic maybe.
Right, magic.
They have additional ALUs for that. The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.
 
Right, magic.
They have additional ALUs for that. The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.
I guess in RDNA4 the WMMA op's are handled by an actual dedicated ALU, but very tightly integrated with the vector units i.e. sharing the same data path and issue port so concurrent execution is not possible, just like on RDNA3, but the capabilities are significantly enhanced -- extended type support and sparsity.
 
I guess in RDNA4 the WMMA op's are handled by an actual dedicated ALU, but very tightly integrated with the vector units i.e. sharing the same data path and issue port so concurrent execution is not possible, just like on RDNA3, but the capabilities are significantly enhanced -- extended type support and sparsity.
Naaa, if AMD wanted a dedicated GEMM block they could've done MFMA without DPFP/SPFP GEMM support.
 
They have additional ALUs for that.

Doesn’t seem that way. The dot4 explanation above makes more sense.

The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.

Yeah.
 
Right, magic.
They have additional ALUs for that. The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.
A warp-wide 16x8x16 HMMA instruction with 16-bit accumulation should take 16 cycles on GB20x. It takes 8 registers as input and 2 as output. With 32-bit accumulation it takes 32 cycles, but only 2 more registers as input and output. You should be able to issue instructions to other pipelines (including the F32 ALUs) while the tensor cores are doing their work, there's plenty of register bandwidth left over.
 
they very evidently arent, which is why NV moved to 1 GEMM core per SMSP since Ampere
which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations
NVIDIA definitely runs tensor and fp32 ops concurrently, especially now with their tensor cores busy almost 100% of the time (doing upscaling, frame generation, denoising, HDR post processing, and in the future neural rendering).

Latest NVIDIA generations have become exceedingly better at mixing all 3 workloads (tensor+ray+fp32) concurrently, I read somewhere (I can't find the source now) that ray tracing + tensor are the most common concurrent ops, followed by ray tracing + fp32/tensor + fp32.

 
Last edited:
I see lots of noise, flickering, and ghosting .. I guess it's okay as a first step, but overall it still needs a lot of work.

Thats fine. They’re making real moves now which bodes well for broader RT adoption in coming years. Hopefully they follow through and we see meaningful improvements in AMD partnered tithes.
 
I see lots of noise, flickering, and ghosting .. I guess it's okay as a first step, but overall it still needs a lot of work.
Like most AMD demos, they look kind of cheap.

On the tech side, it clearly isn't ready. The end of the demo with all the foliage is pretty much a mess of pixels. But all in all, I'm really surprised that they are already tackling the whole package of technologies at once. Two years from now it should already be pretty good.
 
It’s very ghosty. The most positive thing I can take away from this launch is that continued improvements in rasterization architecture are indeed possible. AMD has made a solid step forward.
 
Back
Top