AMD RDNA4 Architecture Speculation

Bondrewd · Mar 1, 2025

fellix said:
I guess in RDNA4 the WMMA op's are handled by an actual dedicated ALU, but very tightly integrated with the vector units i.e. sharing the same data path and issue port so concurrent execution is not possible, just like on RDNA3, but the capabilities are significantly enhanced -- extended type support and sparsity.

Naaa, if AMD wanted a dedicated GEMM block they could've done MFMA without DPFP/SPFP GEMM support.

trinibwoy · Mar 1, 2025

DegustatoR said:
They have additional ALUs for that.

Doesn’t seem that way. The dot4 explanation above makes more sense.

DegustatoR said:
The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.

Yeah.

Xmas · Mar 1, 2025

DegustatoR said:
Right, magic.
They have additional ALUs for that. The difference with Nvidia (probably, I'm not entirely sure on how Nvidia does that in Blackwell either) is that in AMD's case the ALUs are a part of the SIMDs while in Nvidia's SM they are a separate "SIMD" to the shading one - which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations.

A warp-wide 16x8x16 HMMA instruction with 16-bit accumulation should take 16 cycles on GB20x. It takes 8 registers as input and 2 as output. With 32-bit accumulation it takes 32 cycles, but only 2 more registers as input and output. You should be able to issue instructions to other pipelines (including the F32 ALUs) while the tensor cores are doing their work, there's plenty of register bandwidth left over.

trinibwoy · Mar 1, 2025

Better quality stream of AMDs path tracing + ML demo.

DavidGraham · Mar 1, 2025

Bondrewd said:
they very evidently arent, which is why NV moved to 1 GEMM core per SMSP since Ampere

DegustatoR said:
which presumably allows Nvidia to run both workloads simultaneously - but I doubt that it happens a lot in practice due to bandwidth limitations

NVIDIA definitely runs tensor and fp32 ops concurrently, especially now with their tensor cores busy almost 100% of the time (doing upscaling, frame generation, denoising, HDR post processing, and in the future neural rendering).

Latest NVIDIA generations have become exceedingly better at mixing all 3 workloads (tensor+ray+fp32) concurrently, I read somewhere (I can't find the source now) that ray tracing + tensor are the most common concurrent ops, followed by ray tracing + fp32/tensor + fp32.

Concurrent execution of CUDA and Tensor cores

Yes, that is what it means. I don’t know where you got that. If the compiler did not schedule tensor core instructions along with other instructions, what else would it be doing? NOP? Empty space? Maybe you are mixing up what the compiler does and what the warp scheduler does. The warp...

forums.developer.nvidia.com

I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada?

There isn’t much difference between Turing, Ampere and Ada in this area. This question in various forms comes up from time to time, here is a recent thread. It’s also necessary to have a basic understanding of how instructions are issued and how work is scheduled in CUDA GPUs, unit 3 of this...

forums.developer.nvidia.com

DavidGraham · Mar 1, 2025

trinibwoy said:
Better quality stream of AMDs path tracing + ML demo.

I see lots of noise, flickering, and ghosting .. I guess it's okay as a first step, but overall it still needs a lot of work.

trinibwoy · Mar 1, 2025

DavidGraham said:
I see lots of noise, flickering, and ghosting .. I guess it's okay as a first step, but overall it still needs a lot of work.

Thats fine. They’re making real moves now which bodes well for broader RT adoption in coming years. Hopefully they follow through and we see meaningful improvements in AMD partnered tithes.

Charlietus · Mar 1, 2025

DavidGraham said:
I see lots of noise, flickering, and ghosting .. I guess it's okay as a first step, but overall it still needs a lot of work.

Like most AMD demos, they look kind of cheap.

On the tech side, it clearly isn't ready. The end of the demo with all the foliage is pretty much a mess of pixels. But all in all, I'm really surprised that they are already tackling the whole package of technologies at once. Two years from now it should already be pretty good.

techuse · Mar 1, 2025

It’s very ghosty. The most positive thing I can take away from this launch is that continued improvements in rasterization architecture are indeed possible. AMD has made a solid step forward.

digitalwanderer · Mar 3, 2025

RDNA 4 and 9000 series official reveal:

Kaotik · Mar 3, 2025

digitalwanderer said:
RDNA 4 and 9000 series official reveal:

That's a replay from the livestreamed event from last friday

Florin · Mar 4, 2025

Techspit appears to have leaked some RX 9070 XT power consumption figures in today's Nvidia RTX 5070 review

trinibwoy · Mar 4, 2025

Wow nice catch.

DegustatoR · Mar 4, 2025

So about similar in power comparison to how RDNA3 was vs Ada?

Flappy Pannus · Mar 4, 2025

Likely not going to be at the prices I want, but I wholeheartedly welcome a potential price war between 16GB variants of sub-$500 (?) cards.

Man from Atlantis · Mar 5, 2025

Florin · Mar 5, 2025

Do we have a time yet for the review embargo?

Nisaaru · Mar 5, 2025

That's more than I expected.

trinibwoy · Mar 5, 2025

Florin said:
Do we have a time yet for the review embargo?

It’s usually 9AM EST.

DegustatoR · Mar 5, 2025

Nisaaru said:
That's more than I expected.

Not really. XT will end up faster than 5070Ti in games which favor AMD h/w. This is a result where I'd expect the average w/o RT to be.

AMD RDNA4 Architecture Speculation

Bondrewd

trinibwoy

Meh

Xmas

Porous

trinibwoy

Meh

DavidGraham

Concurrent execution of CUDA and Tensor cores

I need help understanding how concurrency of CUDA Cores and Tensor Cores works between Turing and Ampere/Ada?

DavidGraham

trinibwoy

Meh

Charlietus

techuse

digitalwanderer

Kaotik

Drunk Member

Florin

Merrily dodgy

trinibwoy

Meh

DegustatoR

Flappy Pannus

Man from Atlantis

Florin

Merrily dodgy

Nisaaru

trinibwoy

Meh

DegustatoR

Similar threads