AMD RDNA4 Architecture Speculation

trinibwoy · Friday at 2:24 PM

Dampf said:
So still no BVH traversal in hardware? Instead we get one more intersection engine? Am I understanding that right?

Seems that way.

HUB did some projections of 9070 performance based on AMD’s numbers and it’s looking like a repeat of RDNA 3 vs Ada. Similar raster for a lot less money but falling short in RT. It’s great to see AMD publicly embracing path tracing which bodes well for future architectures.

DegustatoR · Friday at 2:27 PM

Kaotik said:

Average here is -7% so that not "2%" already.
Then you have to discard FC6 result there as this one was basically never bound by its RT implementation and is just showing shading difference. After which you get 9% difference.
Then there are other RT titles and modes which they are not showing here.
As I've said expect the difference to be in 10-20% range based on 7900GRE results. I've done some math with these.

And what's the point of OC vs non-OC comparisons? GB203 cards also overclock.

DavidGraham · Friday at 2:31 PM

Navi48 is N4P, GB203 is 4N.

https://twitter.com/x/status/1895464566936814045

LordEC911 · Friday at 2:38 PM

DavidGraham said:
Navi48 is N4P, GB203 is 4N.

https://twitter.com/x/status/1895464566936814045

Weird take, unless there is some surprise from GB205... everything we've seen so far shows it can't touch N48.

Charlietus · Friday at 2:49 PM

DavidGraham said:
Navi48 is N4P, GB203 is 4N.

https://twitter.com/x/status/1895464566936814045

Most of the additional transistors are for the bigger cache since RDNA 4 doesn't use GDDR7 and it has to compensate with that for more bandwidth.

But I don't have the data, so this isn't necessarily the truth.

trinibwoy · Friday at 2:52 PM

DegustatoR said:
One way would be that AMD is selling a SKU at $600 which is based on a full chip which is awfully close in its size/complexity to GB203 which Nvidia is selling at $1000. This is a solid price premium on the Nvidia side which seemingly doesn't have any production level explanation so it can be attributed to Nvidia's margin.
Another way would be that this SKU isn't hitting even the cut down one (5070Ti) on the Nvidia side while being more complex than the full one (5080). So in PPA AMD is still behind Nvidia's architecture here.

Gotta factor in GDDR7 vs 6 which doesn’t technically affect PPA but certainly affects BOM. You also need to consider power. A 9070 XT at 5080 power will be somewhat faster than at 304W.

Picao84 · Friday at 2:59 PM

DegustatoR said:
One way would be that AMD is selling a SKU at $600 which is based on a full chip which is awfully close in its size/complexity to GB203 which Nvidia is selling at $1000. This is a solid price premium on the Nvidia side which seemingly doesn't have any production level explanation so it can be attributed to Nvidia's margin.

You don't say so

Charlietus · Friday at 3:13 PM

DegustatoR said:
GB203 has the same 64MB of L2 as N48 has for IC. The difference is just 8MBs of L2 on N48 which doesn't sound like a lot.

Then I stand corrected

Can we say that Nvidia's architecture is still around 30% more efficient than AMD's?

DegustatoR · Friday at 3:18 PM

Charlietus said:
Can we say that Nvidia's architecture is still around 30% more efficient than AMD's?

On average? It's probably less than 30, 20% maybe.
On peak workloads favoring Nvidia's strengths (RT, some ML? also memory bandwidth in this round) it could be more than 30% probably.

trinibwoy · Friday at 3:59 PM

Dynamic register allocation seems like a pretty big deal and very impressive if it works well. Static allocation has been a bedrock of GPU compute since day one. Hopefully AMD shares more about how that works.

The presentation was way more focused on RT and ML than I expected. AMD even left out the usual flops and bandwidth numbers. It was all about TOPS. Is the AI accelerator a new hardware block or similar to RDNA 3 WMMA on the main vector unit? Hard to tell from the diagram.

Scott_Arm · Friday at 4:02 PM

fellix said:
At 22:40 mark: "Dynamic register allocation..."

Optimizations for better VOPD handling?

Seems like they've adopted Apple's "Dynamic Caching" idea from the M3. Registers are dynamically allocated at runtime instead of the worst case. Apple's solution also dynamically allocations threadgroup memory and stack memory.

CarstenS · Friday at 4:03 PM

trinibwoy said:
Dynamic register allocation seems like a pretty big deal and very impressive if it works well. Static allocation has been a bedrock of GPU compute since day one. Hopefully AMD shares more about how that works.

The presentation was way more focused on RT and ML than I expected. AMD even left out the usual flops and bandwidth numbers. It was all about TOPS. Is the AI accelerator a new hardware block or similar to RDNA 3 WMMA on the main vector unit? Hard to tell from the diagram.

From what I've gathered so far, I'd say it's a (big, but) evolutionary step from RDNA3.

trinibwoy · Friday at 4:26 PM

CarstenS said:
From what I've gathered so far, I'd say it's a (big, but) evolutionary step from RDNA3.

What does the AI accelerator do exactly? Is it operand collection / format conversion for matrix ops that then run on the 64 regular ALUs in each SIMD?

Another interesting bit - seems RDNA 3's 256KB L1 shader array cache is no longer there. No mention of it in the deck or the diagram.

Bondrewd · Friday at 5:34 PM

Scott_Arm said:
Seems like they've adopted Apple's "Dynamic Caching" idea from the M3. Registers are dynamically allocated at runtime instead of the worst case. Apple's solution also dynamically allocations threadgroup memory and stack memory.

that's actually from Gen (yes the Intel gen).

trinibwoy said:
Is it operand collection / format conversion for matrix ops that then run on the 64 regular ALUs in each SIMD?

yea.

trinibwoy said:
Another interesting bit - seems RDNA 3's 256KB L1 shader array cache is no longer there. No mention of it in the deck or the diagram.

demoted to a buffer.

trinibwoy said:
Dynamic register allocation seems like a pretty big deal and very impressive if it works well. Static allocation has been a bedrock of GPU compute since day one. Hopefully AMD shares more about how that works.

the most quirk chungus part is OoO memory fills a-la Cortex A510.

Charlietus · Friday at 5:36 PM

Bondrewd said:
that's actually from Gen (yes the Intel gen).

yea.

demoted to a buffer.

the most quirk chungus part is OoO memory fills a-la Cortex A510.

Wait... Big Chungus?

Bondrewd · Friday at 5:39 PM

Charlietus said:
Big Chungus?

no the big chungus (Navi4C) is gone.
but we have the quirky little brother left.

pjbliverpool · Friday at 5:43 PM

According to TPU FSR4 requires 779 AI TOPS which pretty much confirms it has very little to do with PSSR (which runs on the PS5 Pros 300 TOPs) and will hopefully be a much superior solution. Also the 9070 (non XT) offers almost 1200 TOPs or around 4x the PS5 Pros AI capability at raster levels which are presumably more like 50% higher, so clearly little to no architectural relation there either from an AI perspective.

As a product the 9070XT seems pretty exciting. ~4070Ti Super level performance for 75% of the price with what will hopefully be an upscaler comparable to DLSS 3 along with comparable frame gen capabilities. They even apparently have their own AI based denoiser in response to Ray Reconstruction. Hopefully it's competitive.

raytracingfan · Friday at 5:48 PM

DavidGraham said:

Has AMD shown off this Neural Volumetric Simulation before? It sounds interesting.

TopSpoiler said:
View attachment 13224

If OBB intersection is exposed directly by DXR then it can be used for ray tracing voxels. With this and Nvidia's Linear-Swept Spheres, there are now more options for representing geometry than just triangle meshes.

LordEC911 · Friday at 5:57 PM

raytracingfan said:
Has AMD shown off this Neural Volumetric Simulation before? It sounds interesting.

I remember this blog post detailing some research that was linked in the Anandtech RDNA4 thread a few months ago.
Trying to find that I also noticed this ROCm Blog which might be semi-related.

Bondrewd · Friday at 5:59 PM

RobertR1 said:
uDNA should have a strong emphasis on RT performance from the start and not a catch up scenario.

they just build better shader cores.

Charlietus said:
I thought that RDNA 4 was going to be a lifeboat waiting for UDNA but it's better than that.

no such thing as UDNA.

AMD RDNA4 Architecture Speculation

trinibwoy

Meh

DegustatoR

DavidGraham

LordEC911

Charlietus

trinibwoy

Meh

Picao84

Charlietus

DegustatoR

trinibwoy

Meh

Scott_Arm

CarstenS

Moderator

trinibwoy

Meh

Bondrewd

Charlietus

Bondrewd

pjbliverpool

B3D Scallywag

raytracingfan

LordEC911

Bondrewd

Similar threads