AMD RDNA4 Architecture Speculation

I guess in RDNA4 the WMMA op's are handled by an actual dedicated ALU, but very tightly integrated with the vector units i.e. sharing the same data path and issue port so concurrent execution is not possible, just like on RDNA3, but the capabilities are significantly enhanced -- extended type support and sparsity.
Naaa, if AMD wanted a dedicated GEMM block they could've done MFMA without DPFP/SPFP GEMM support.
 
They have additional ALUs for that.

Doesn’t seem that way. The dot4 explanation above makes more sense.

The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.

Yeah.
 
Right, magic.
They have additional ALUs for that. The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.
A warp-wide 16x8x16 HMMA instruction with 16-bit accumulation should take 16 cycles on GB20x. It takes 8 registers as input and 2 as output. With 32-bit accumulation it takes 32 cycles, but only 2 more registers as input and output. You should be able to issue instructions to other pipelines (including the F32 ALUs) while the tensor cores are doing their work, there's plenty of register bandwidth left over.
 
they very evidently arent, which is why NV moved to 1 GEMM core per SMSP since Ampere
which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations
NVIDIA definitely runs tensor and fp32 ops concurrently, especially now with their tensor cores busy almost 100% of the time (doing upscaling, frame generation, denoising, HDR post processing, and in the future neural rendering).

Latest NVIDIA generations have become exceedingly better at mixing all 3 workloads (tensor+ray+fp32) concurrently, I read somewhere (I can't find the source now) that ray tracing + tensor are the most common concurrent ops, followed by ray tracing + fp32/tensor + fp32.

 
Last edited:
I see lots of noise, flickering, and ghosting .. I guess it's okay as a first step, but overall it still needs a lot of work.

Thats fine. They’re making real moves now which bodes well for broader RT adoption in coming years. Hopefully they follow through and we see meaningful improvements in AMD partnered tithes.
 
I see lots of noise, flickering, and ghosting .. I guess it's okay as a first step, but overall it still needs a lot of work.
Like most AMD demos, they look kind of cheap.

On the tech side, it clearly isn't ready. The end of the demo with all the foliage is pretty much a mess of pixels. But all in all, I'm really surprised that they are already tackling the whole package of technologies at once. Two years from now it should already be pretty good.
 
It’s very ghosty. The most positive thing I can take away from this launch is that continued improvements in rasterization architecture are indeed possible. AMD has made a solid step forward.
 
Likely not going to be at the prices I want, but I wholeheartedly welcome a potential price war between 16GB variants of sub-$500 (?) cards.

1741102039952.png
 
Back
Top