AMD: RDNA 3 Speculation, Rumours and Discussion

DavidGraham · Aug 14, 2022

I feel this means AMD is skimping on RT hardware yet again.

Jawed · Aug 14, 2022

Relative counts of TMUs and ROPs (per CU) don't need to change if you're getting >2x scaling in ALU count per mm² by cutting out a load of scheduling hardware at the CU and SIMD level while simultaneously re-wiring the vector register file and pipeline forwarding to reduce the over-provisioning seen in prior RDNA. So compute density gets a massive boost and power consumption per FLOP should also fall substantially.

The problem remains actual compute throughput in this design.

DegustatoR · Aug 14, 2022

Jawed said:
The problem remains actual compute throughput in this design.

Eh, I mean this looks pretty much like a copy of what Nv did in Ampere but without the loss of a dedicated integer pipeline.
I think it will do fine in FP compute, likely above what Ampere has shown compared to Turing.
Whether this will be enough to resolve RDNA2's weak points remains to be seen though.

Seanspeed · Aug 14, 2022

DegustatoR said:
Eh, I mean this looks pretty much like a copy of what Nv did in Ampere but without the loss of a dedicated integer pipeline.

Nvidia didn't add additional ALU's for Ampere, they just repurposed existing ones.

Perhaps that's what AMD is also doing for RDNA3, but what it sounds like so far is an actual physical doubling of the ALU's, which would be a very different situation.

DegustatoR · Aug 14, 2022

Seanspeed said:
Nvidia didn't add additional ALU's for Ampere, they just repurposed existing ones.

They did. You can't "repurpose" integer ALUs for FP math.

Seanspeed said:
Perhaps that's what AMD is also doing for RDNA3, but what it sounds like so far is an actual physical doubling of the ALU's, which would be a very different situation.

The difference would be in the absence of integer SIMD which would mean that a) actual performance gain in mixed math will likely be higher (since there won't be cases where the previously present integer pipeline will do the same stuff again leading to no performance change) and b) area spent will probably be higher - but this is mostly irrelevant if the SIMDs are redesigned anyway.

That being said I wonder if the scheduling overhead will become higher leading to more stalls inside the WGP. Nvidia solution is rather elegant in how they use their dual-SIMDs but AMD can't really copy that unless they are willing to go to 128 lane widths.

Samwell · Aug 14, 2022

Jawed said:
Relative counts of TMUs and ROPs (per CU) don't need to change if you're getting >2x scaling in ALU count per mm² by cutting out a load of scheduling hardware at the CU and SIMD level while simultaneously re-wiring the vector register file and pipeline forwarding to reduce the over-provisioning seen in prior RDNA. So compute density gets a massive boost and power consumption per FLOP should also fall substantially.

The problem remains actual compute throughput in this design.

The TMU count per CU/WGP won't change, but now a WGP will have double the ALU count, so the ALU/TEX ratio is halved. This will still improve performance, but it won't double it and the perf/gflop will be much lower. The compute limitation of RDNA2 will shift more to a TMU, whatever limitation as seen in Ampere.

I just expect a bigger improvement than in Ampere, as Ampere solution was very basic in adding FP32 units and not changing much else.

Quite funny this time. Both architectures are converging. Nvidia is adding Cache ,AMD is doubling their ALU count per WGP. Might be an advantage for both, when games focus on very similar ALU/Tex/ROP performance ratios, while beeing a problem for intel.

DegustatoR · Aug 14, 2022

Samwell said:
Both architectures are converging.

Has been the case since R600.

Jawed · Aug 14, 2022

Samwell said:
The compute limitation of RDNA2 will shift more to a TMU

I doubt texturing, per se, will be much of a bottleneck - it's not why 6950XT is slower than 3090Ti (744 versus 625 gigatexels/s). There are format-related performance variations there, though and games are now so complex it's hard to compare across architectures...

pTmdfx · Aug 14, 2022

I like this mass confusion prior to launch, hehe.

Let’s muddle the water with — according to LLVM patches so far — Wave32 VOPD co-issue supporting only a tiny subset (10-ish) of (mostly FP32) ALU opcodes?

So is this truly doubling ALU? Or is there more alternative theories to it? Like while co-issue coverage is deliberately limited, VALU pipeline is indeed doubled in full within a “SIMD” to execute an extra wavefront in parallel (which co-issue can steal)? Spicy indeed. :mrgreen:

Samwell · Aug 14, 2022

Jawed said:
I doubt texturing, per se, will be much of a bottleneck - it's not why 6950XT is slower than 3090Ti (744 versus 625 gigatexels/s). There are format-related performance variations there, though and games are now so complex it's hard to compare across architectures...

There is never a single bottleneck, it's always a mixture, but the bottleneck will shift with doubling of ALUs. A 3090Ti has much more Gflops, therefore it should be a lot faster than a 6950XT, but it's not, because other bottlenecks are limiting it. The same will happen, when AMD doubles its ALUs.

iroboto · Aug 15, 2022

Bondrewd said:
Yeap.
And their CPU efficiency regressed.
Whatever, Apple sucks now.

Not related to your post, just easier to reply I apologize. but what’s your thoughts on the rdna chiplets here wrt the consoles. Eventually in the future there will be another generation; any ideas on what that console architecture may look like ?

Bondrewd · Aug 15, 2022

iroboto said:
any ideas on what that console architecture may look like

Gonna bet on some kind of 2.5D chiplet solution in consoles for the next gen, maybe even for this midgen refresh.
As you've seen, N5 costs can be pretty rough, and with N3 flavours it's even worse.

Honestly depends on TSMC/ASE fanout capacities.
N33 isn't tiled because AMD was afraid they'd run out of InFO/FoCoS slots in all the Taiwan.

Jawed · Aug 15, 2022

pTmdfx said:
I like this mass confusion prior to launch, hehe.

Let’s muddle the water with — according to LLVM patches so far — Wave32 VOPD co-issue supporting only a tiny subset (10-ish) of (mostly FP32) ALU opcodes?

It bugs the hell out of me that the sets of available co-issuable OPs per VOPD-half are not even symmetric.

trinibwoy · Aug 15, 2022

Samwell said:
There is never a single bottleneck, it's always a mixture, but the bottleneck will shift with doubling of ALUs. A 3090Ti has much more Gflops, therefore it should be a lot faster than a 6950XT, but it's not, because other bottlenecks are limiting it. The same will happen, when AMD doubles its ALUs.

Flops aren’t the determining factor in predicting which card should be faster. If anything high flops help with the rare ALU bound pass during a frame. Post processing shaders eat them up.

The vast majority of passes though are memory bandwidth/latency limited on a 3090. I’m guessing it’s mostly latency and that’s why Ampere does a bit better at higher resolutions where there is more work available to hide that latency.

iroboto · Aug 15, 2022

trinibwoy said:
Flops aren’t the determining factor in predicting which card should be faster. If anything high flops help with the rare ALU bound pass during a frame. Post processing shaders eat them up.

The vast majority of passes though are memory bandwidth/latency limited on a 3090. I’m guessing it’s mostly latency and that’s why Ampere does a bit better at higher resolutions where there is more work available to hide that latency.

post processing stage: would a high TF card have any more trouble doing a lot of post processing at lower resolution vs doing less post processing at higher resolution ?

I recall reading a quick dev take that said that post processing is really the step that starts making the image look good.

trinibwoy · Aug 15, 2022

iroboto said:
post processing stage: would a high TF card have any more trouble doing a lot of post processing at lower resolution vs doing less post processing at higher resolution ?

I recall reading a quick dev take that said that post processing is really the step that starts making the image look good.

Depends on where the flops are coming from. If it’s a very wide chip you may literally not have enough work available to fill the chip at lower resolutions. If the flops are coming from clocks and a narrower arch then it’s a little easier.

Silent_Buddha · Aug 15, 2022

If the article that was previously posted is correct, then AMD will be further reducing FP64 performance with RDNA 3. That should lead to some transistor savings correct? That would, of course, then be used for other things.

Regards,
SB

DegustatoR · Aug 15, 2022

Silent_Buddha said:
If the article that was previously posted is correct, then AMD will be further reducing FP64 performance with RDNA 3. That should lead to some transistor savings correct? That would, of course, then be used for other things.

Regards,
SB

RDNA2 is at about 15:1 FP64 rate which isn't exactly an FP64 monster. How much can they "save" going down from that? I think it's more likely that the ratio of FP32 to FP64 ALUs with go up simply because they'll double the former while leaving the same number per WGP for the latter.

no-X · Aug 15, 2022

DegustatoR said:
RDNA2 is at about 15:1 FP64 rate which isn't exactly an FP64 monster. How much can they "save" going down from that? I think it's more likely that the ratio of FP32 to FP64 ALUs with go up simply because they'll double the former while leaving the same number per WGP for the latter.

There are no FP64 ALUs in GCN / RDNA.

DegustatoR · Aug 15, 2022

no-X said:
There are no FP64 ALUs in GCN / RDNA.

And how do they do FP64 math?

AMD: RDNA 3 Speculation, Rumours and Discussion

DavidGraham

Jawed

DegustatoR

Seanspeed

DegustatoR

Samwell

DegustatoR

Jawed

pTmdfx

Samwell

iroboto

Daft Funk

Bondrewd

Jawed

trinibwoy

Meh

iroboto

Daft Funk

trinibwoy

Meh

Silent_Buddha

DegustatoR

no-X

DegustatoR

Similar threads