RDNA3 Efficiency [Spinoff from RDNA4]

I feel like I have missed something obvious here based on these latest posts.

The 71% uplift came between Vega and Navi, with a new architecture and an improved process. Navi10 to Navi20 gave a 20% uplift on the same process. Finally Navi20 to Navi30 gave a 35% uplift on a new process again. Is 35% improved power efficiency a good result on an improved process if the previous architecture managed a 20% uplift on the same process?

I find the discussion interesting but hard to follow for a lay person such as myself.
 
Is 35% improved power efficiency a good result on an improved process if the previous architecture managed a 20% uplift on the same process?
No, especially when they downclocked the shit out of N3x'es.
It's alright but not at .93v.
 
I'm not drawing any conclusions about RDNA 2. I was trying to understand the claim that RDNA 3 is broken.

Even if we accept that it didn't ultimately meet AMD's internal targets and is pulling too much power at nominal voltages the fact remains that delivered performance is fine and competitive with other architectures on TSMC 5nm. So definitely not "broken" or an "outlier".

AMD aimed really high and missed is a more believable story.
Well, weren't the original claims for the N5 process that it could deliver 30% lower power consumption at the same clock speeds? So if that still applies and you're comparing the 6700 XT and 7800 XT, which clock almost the same according to TPU, then only 12% of the gain in efficiency is coming from architectural improvements. If you compared RDNA 2 to RDNA 1 also at the same clocks, you would see a much larger gain.
 
I feel like I have missed something obvious here based on these latest posts.

The 71% uplift came between Vega and Navi, with a new architecture and an improved process. Navi10 to Navi20 gave a 20% uplift on the same process. Finally Navi20 to Navi30 gave a 35% uplift on a new process again. Is 35% improved power efficiency a good result on an improved process if the previous architecture managed a 20% uplift on the same process?

I find the discussion interesting but hard to follow for a lay person such as myself.

It's not so simple to look at it just in those terms.

By design RNDA2 traded die space, really cost (higher), for efficiency via onchip cache.

By design RDNA3 traded efficiency for cost (lower) via chiplets and mixed process.
 
I feel like I have missed something obvious here based on these latest posts.

The 71% uplift came between Vega and Navi, with a new architecture and an improved process. Navi10 to Navi20 gave a 20% uplift on the same process. Finally Navi20 to Navi30 gave a 35% uplift on a new process again. Is 35% improved power efficiency a good result on an improved process if the previous architecture managed a 20% uplift on the same process?

I find the discussion interesting but hard to follow for a lay person such as myself.
There's so much cherry picking going on.

One of the problems is that you can define efficiency in various ways, and doing so for an entire architecture is even more messy. As somebody showed before, 'efficiency' even amongst the same range of products on the same architecture can vary a fair bit. And GPU's have been increasingly 'pushed hard' out the box compared to before, especially in the midrange segment.

RDNA2 absolutely has demonstrated a 50% uplift over RDNA1 in performance per watt if you're talking 4k gaming, if you're talking iso-performance and if you're talking iso-clocks. The reason we dont see this result in many benchmarks for out-the-box product performance is that all these efficiency improvements done for the architecture were utilized to push clockspeed much harder. And it was quite impressive to achieve this with no process improvement.

Given RDNA3's significant architecture changes and switch to 5nm, I'd say it's fair to be disappointed in its lack of efficiency improvement. Whether we want to call it an 'outlier' or not is a difficult thing to get in the weeds with, but it almost assuredly should have achieved more and failed to hit AMD's own claimed figures, due to its performance not being as good as expected for whatever still unknown reasons.
 
Given RDNA3's significant architecture changes and switch to 5nm, I'd say it's fair to be disappointed in its lack of efficiency improvement. Whether we want to call it an 'outlier' or not is a difficult thing to get in the weeds with, but it almost assuredly should have achieved more and failed to hit AMD's own claimed figures, due to its performance not being as good as expected for whatever still unknown reasons.
agreed, seems like RDNA3 totally wasted newer node advantage

TSMC 5nm Node uses EUV and improves logic density by 1.8x compared to 7nm. For SRAM, the density is 1.3x higher. Just like the 7nm Node, 5nm will have two variants with one optimized for Mobile and other for HPC. The mobile node will allow 15% higher performance or 30% lower power consumption. A second version of 5nm, is 7% faster. Both versions also will use EUV. TSMC is gaining some traction for 5nm.
 
No, it's mostly in-line with the node bump.
The issue is the uarch is a mess and has what seems to be a Cac problem.
I think amd has a weaker scheduler than nvidia for duel issue operations.. amd needed to make a bigger chip 14k cores at 3.1ghz would have hot close to a 4090
 
I think amd has a weaker scheduler than nvidia for duel issue operations.. amd needed to make a bigger chip 14k cores at 3.1ghz would have hot close to a 4090
Err? 14k shader unit @ 3.1 GHz would have ran circles around literally everything out there. It would have had 128 % more cores @ ~30 % higher clock compared to 7900 XTX
 
Err? 14k shader unit @ 3.1 GHz would have ran circles around literally everything out there. It would have had 128 % more cores @ ~30 % higher clock compared to 7900 XTX
I assume he's counting the 7900 XTX as 12288 cores rather than 6144.
I think amd has a weaker scheduler than nvidia for duel issue operations.. amd needed to make a bigger chip 14k cores at 3.1ghz would have hot close to a 4090
The issue isn't so much scheduling but register file bandwidth. Each SIMD lane can only load 4 values from registers per clock cycle. A single FMA needs 3, so it's impossible to do two together unless the value is already loaded in the ALU.

That leaves it to be only capable of dual issuing pure adds or pure multiplies, or where the compiler knows a variable doesn't need to be fetched from registers. The last case is largely only going to be relevant to dot multiplication, and therefore matrix multiplication.

My take on the dual issue is that it's basically a matrix multiplication accelerator. Like a lightweight tensor core that is occasionally of use for other operations.
 
I assume he's counting the 7900 XTX as 12288 cores rather than 6144.

The issue isn't so much scheduling but register file bandwidth. Each SIMD lane can only load 4 values from registers per clock cycle. A single FMA needs 3, so it's impossible to do two together unless the value is already loaded in the ALU.

That leaves it to be only capable of dual issuing pure adds or pure multiplies, or where the compiler knows a variable doesn't need to be fetched from registers. The last case is largely only going to be relevant to dot multiplication, and therefore matrix multiplication.

My take on the dual issue is that it's basically a matrix multiplication accelerator. Like a lightweight tensor core that is occasionally of use for other operations.
Doing 2 FMAs in only a 64 bits instruction is a bit silly and clearly a sign it was hacked on top of what was already there - still it was a presumably a decent PPA improvement given how little time they had to do it (and not as silly as SGX-XT doing up to 13 flops in 64 bits [Vec4 FMA + 3-way sum-of-squares] about 15 years ago, so it could always be worse...)

On the other hand, RDNA3's Dual-FMA is "automatic" for Wave64, and assuming the compiler makes decent use of the register cache (despite the fact it's per-bank so a bit tricky), I assume it should be reasonably high efficiency even for FMAs! Register bandwidth can be a bottleneck, sure, but the bigger problem seems to be instruction length for encoding fields...

I don't know what AMD's current heuristics are for Wave32 vs Wave64 are though, and what other trade-offs they might have in their HW that could encourage them to use Wave32 despite this. Not sure if any of their profiling tools make this obvious and whether it's possible to figure it out with public tools on 3rd party games by yourself or not.
 
I don't know what AMD's current heuristics are for Wave32 vs Wave64 are though, and what other trade-offs they might have in their HW that could encourage them to use Wave32 despite this. Not sure if any of their profiling tools make this obvious and whether it's possible to figure it out with public tools on 3rd party games by yourself or not.
For GFX11, their driver strongly prefers to compile shaders in Wave64 unless there's a lot of non-uniform wave branching divergence (as is the case with ray tracing shaders) or subgroup shuffle operations. Can be verified in RGP under the 'Overview' tab and in the 'Pipelines' subtab by expanding the "Bucket ID" for each pipelines ...
 
Back
Top