AMD RDNA4 Architecture Speculation

How are they getting the same throughput then?

I’m going off AMD’s numbers but I’m not sure how they’re calculated. GB203 does 1024 FP16 TOPs per clock per SM and an N48 CU hits the same number with only 128 FP32 ALUs. They’re squeezing 8 FP16 ops out of each ALU. This is without sparsity. Black magic maybe.
 
I’m going off AMD’s numbers but I’m not sure how they’re calculated. GB203 does 1024 FP16 TOPs per clock per SM and an N48 CU hits the same number with only 128 FP32 ALUs. They’re squeezing 8 FP16 ops out of each ALU. This is without sparsity. Black magic maybe.
RDNA 3 was “256 FP16 ops/clk” by virtue of doing 64 Dot2 per clock. Each Dot2 is worth 4 ops (C = C + (A1*B1 + A2*B2)).

It is probably done as many passes of two fp16x2 operands + one 32-bit operand, repeatedly fed into the Dot2 ALU. That would use half of the theoretical maximum operand bandwidth, which is six 32-bit operands.

So what likely happened with RDNA 4 is that WMMA gets rearranged with an improved mixed-precision dot product ALU — Dot4 for FP16 (8 ops/clk), Dot8 for FP8 (16 ops/clk), Dot16 for INT4 (32 ops/clk). This could be tapping off the max operand bandwidth in the same way as VOPD “dual issue”. For example, FP16 Dot4 takes two VGPR pairs (two fp16x4) + one 32-bit operand.

This still puts RDNA 4 at half the ops/clk of CDNA 3. Though IMO this is in line with expectation — CDNA 2/3 CU had doubled the operand bandwidth to support full rate FP64 / 2xFP32. This also meant they can feed twice the packed data into the dot product ALUs.
 
Last edited:
I’m going off AMD’s numbers but I’m not sure how they’re calculated. GB203 does 1024 FP16 TOPs per clock per SM and an N48 CU hits the same number with only 128 FP32 ALUs. They’re squeezing 8 FP16 ops out of each ALU. This is without sparsity. Black magic maybe.
Right, magic.
They have additional ALUs for that. The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.
 
Right, magic.
They have additional ALUs for that. The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.
I guess in RDNA4 the WMMA op's are handled by an actual dedicated ALU, but very tightly integrated with the vector units i.e. sharing the same data path and issue port so concurrent execution is not possible, just like on RDNA3, but the capabilities are significantly enhanced -- extended type support and sparsity.
 
I guess in RDNA4 the WMMA op's are handled by an actual dedicated ALU, but very tightly integrated with the vector units i.e. sharing the same data path and issue port so concurrent execution is not possible, just like on RDNA3, but the capabilities are significantly enhanced -- extended type support and sparsity.
Naaa, if AMD wanted a dedicated GEMM block they could've done MFMA without DPFP/SPFP GEMM support.
 
They have additional ALUs for that.

Doesn’t seem that way. The dot4 explanation above makes more sense.

The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.

Yeah.
 
Right, magic.
They have additional ALUs for that. The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.
A warp-wide 16x8x16 HMMA instruction with 16-bit accumulation should take 16 cycles on GB20x. It takes 8 registers as input and 2 as output. With 32-bit accumulation it takes 32 cycles, but only 2 more registers as input and output. You should be able to issue instructions to other pipelines (including the F32 ALUs) while the tensor cores are doing their work, there's plenty of register bandwidth left over.
 
they very evidently arent, which is why NV moved to 1 GEMM core per SMSP since Ampere
which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations
NVIDIA definitely runs tensor and fp32 ops concurrently, especially now with their tensor cores busy almost 100% of the time (doing upscaling, frame generation, denoising, HDR post processing, and in the future neural rendering).

Latest NVIDIA generations have become exceedingly better at mixing all 3 workloads (tensor+ray+fp32) concurrently, I read somewhere (I can't find the source now) that ray tracing + tensor are the most common concurrent ops, followed by ray tracing + fp32/tensor + fp32.

 
Last edited:
I see lots of noise, flickering, and ghosting .. I guess it's okay as a first step, but overall it still needs a lot of work.

Thats fine. They’re making real moves now which bodes well for broader RT adoption in coming years. Hopefully they follow through and we see meaningful improvements in AMD partnered tithes.
 
I see lots of noise, flickering, and ghosting .. I guess it's okay as a first step, but overall it still needs a lot of work.
Like most AMD demos, they look kind of cheap.

On the tech side, it clearly isn't ready. The end of the demo with all the foliage is pretty much a mess of pixels. But all in all, I'm really surprised that they are already tackling the whole package of technologies at once. Two years from now it should already be pretty good.
 
It’s very ghosty. The most positive thing I can take away from this launch is that continued improvements in rasterization architecture are indeed possible. AMD has made a solid step forward.
 
From how I understood that presentation in the context of his consultation, prior GPU designs used to have a "maximum fixed amount" of each specific memory types (register/threadgroup/tile) that you could allocate BEFORE spilling to higher level caches/memory. Normally in this design you usually have unused memory resources depending on the shader/kernel (compute = unused tile memory, graphics = unused threadgroup memory, etc.) and that if you wanted to allocate more of a specific memory resource than possible, you would usually spill allocation to slower/higher latency caches and memory ...

What dynamic caching does is that you can flexibly carve out unused memory resources to allocate more memory for other memory types that are in use. Occupancy is improved in a sense where you can avoid more cases of spilling to higher latency memory so your shader/kernel spends less time waiting/idling on memory accesses but otherwise you won't see the hardware launch more waves. It's conceptually similar to Nvidia Volta's unified L1/shared memory pool but it goes one step further and unifies register memory space as well!

On AMD, their latest hardware design seemingly can apparently dynamically vary the number of waves in flight throughout the execution of a shader/kernel ...

It seems to me that the schema you describe is more akin to what AMD is doing (as described by the patent here: Register Compaction with Early Release). If Apple indeed operated that way, there would be no change in occupancy on "heavy" shaders and one would still need to do dispatch fine-tuning — but M3 behavior seems to be very different. Some pieces of evidence is dramatic performance improvements on complex shaders (e.g. Blender — that's before the hardware RT kicks in) and also in the Blender patches (where it is mentioned that no dispatch group fine-tuning is needed as the system will take care of occupancy automatically).

All this does raise an interesting question — I don't think we have a very good idea how these things are managed. Getting though multiple layers of marketing BS to the actual technical details can be incredibly tricky. At least my understanding of what Apple does is that they virtualize everything, even register access. So registers are fundamentally backed by system memory and can be cached closer to the SIMDs to improve performance (here is the relevant patent: CACHE CONTROL TO PRESERVE REGISTER DATA). This seems to suggest that you can in principle launch as many waves as you want, just that in some instances the performance will suck since you'll run out of cache. I have no idea how they manage that — there could be a watchdog that monitors the cache utilization and occupancy suspends/launches new waves to optimize things, or it could be a more primitive load-balancing system that operates at the driver level. It is also not clear at all what AMD does. It doesn't seem like their system is fully automatic.
 
Back
Top