AMD RDNA4 Architecture Speculation

So still no BVH traversal in hardware? Instead we get one more intersection engine? Am I understanding that right?

Seems that way.

HUB did some projections of 9070 performance based on AMD’s numbers and it’s looking like a repeat of RDNA 3 vs Ada. Similar raster for a lot less money but falling short in RT. It’s great to see AMD publicly embracing path tracing which bodes well for future architectures.
 
1740752506763.png
Average here is -7% so that not "2%" already.
Then you have to discard FC6 result there as this one was basically never bound by its RT implementation and is just showing shading difference. After which you get 9% difference.
Then there are other RT titles and modes which they are not showing here.
As I've said expect the difference to be in 10-20% range based on 7900GRE results. I've done some math with these.

And what's the point of OC vs non-OC comparisons? GB203 cards also overclock.
 
One way would be that AMD is selling a SKU at $600 which is based on a full chip which is awfully close in its size/complexity to GB203 which Nvidia is selling at $1000. This is a solid price premium on the Nvidia side which seemingly doesn't have any production level explanation so it can be attributed to Nvidia's margin.
Another way would be that this SKU isn't hitting even the cut down one (5070Ti) on the Nvidia side while being more complex than the full one (5080). So in PPA AMD is still behind Nvidia's architecture here.

Gotta factor in GDDR7 vs 6 which doesn’t technically affect PPA but certainly affects BOM. You also need to consider power. A 9070 XT at 5080 power will be somewhat faster than at 304W.
 
One way would be that AMD is selling a SKU at $600 which is based on a full chip which is awfully close in its size/complexity to GB203 which Nvidia is selling at $1000. This is a solid price premium on the Nvidia side which seemingly doesn't have any production level explanation so it can be attributed to Nvidia's margin.
You don't say so 🤔
 
Dynamic register allocation seems like a pretty big deal and very impressive if it works well. Static allocation has been a bedrock of GPU compute since day one. Hopefully AMD shares more about how that works.

The presentation was way more focused on RT and ML than I expected. AMD even left out the usual flops and bandwidth numbers. It was all about TOPS. Is the AI accelerator a new hardware block or similar to RDNA 3 WMMA on the main vector unit? Hard to tell from the diagram.

architecture-6.jpg
 
At 22:40 mark: "Dynamic register allocation..."

Optimizations for better VOPD handling?

Seems like they've adopted Apple's "Dynamic Caching" idea from the M3. Registers are dynamically allocated at runtime instead of the worst case. Apple's solution also dynamically allocations threadgroup memory and stack memory.
 
Dynamic register allocation seems like a pretty big deal and very impressive if it works well. Static allocation has been a bedrock of GPU compute since day one. Hopefully AMD shares more about how that works.

The presentation was way more focused on RT and ML than I expected. AMD even left out the usual flops and bandwidth numbers. It was all about TOPS. Is the AI accelerator a new hardware block or similar to RDNA 3 WMMA on the main vector unit? Hard to tell from the diagram.
From what I've gathered so far, I'd say it's a (big, but) evolutionary step from RDNA3.
 
From what I've gathered so far, I'd say it's a (big, but) evolutionary step from RDNA3.

What does the AI accelerator do exactly? Is it operand collection / format conversion for matrix ops that then run on the 64 regular ALUs in each SIMD?

Another interesting bit - seems RDNA 3's 256KB L1 shader array cache is no longer there. No mention of it in the deck or the diagram.

rdna3-arch.jpg

rdna4-arch.jpg
 
Seems like they've adopted Apple's "Dynamic Caching" idea from the M3. Registers are dynamically allocated at runtime instead of the worst case. Apple's solution also dynamically allocations threadgroup memory and stack memory.
that's actually from Gen (yes the Intel gen).
Is it operand collection / format conversion for matrix ops that then run on the 64 regular ALUs in each SIMD?
yea.
Another interesting bit - seems RDNA 3's 256KB L1 shader array cache is no longer there. No mention of it in the deck or the diagram.
demoted to a buffer.
Dynamic register allocation seems like a pretty big deal and very impressive if it works well. Static allocation has been a bedrock of GPU compute since day one. Hopefully AMD shares more about how that works.
the most quirk chungus part is OoO memory fills a-la Cortex A510.
 
According to TPU FSR4 requires 779 AI TOPS which pretty much confirms it has very little to do with PSSR (which runs on the PS5 Pros 300 TOPs) and will hopefully be a much superior solution. Also the 9070 (non XT) offers almost 1200 TOPs or around 4x the PS5 Pros AI capability at raster levels which are presumably more like 50% higher, so clearly little to no architectural relation there either from an AI perspective.

As a product the 9070XT seems pretty exciting. ~4070Ti Super level performance for 75% of the price with what will hopefully be an upscaler comparable to DLSS 3 along with comparable frame gen capabilities. They even apparently have their own AI based denoiser in response to Ray Reconstruction. Hopefully it's competitive.
 
Back
Top