AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
Relative counts of TMUs and ROPs (per CU) don't need to change if you're getting >2x scaling in ALU count per mm² by cutting out a load of scheduling hardware at the CU and SIMD level while simultaneously re-wiring the vector register file and pipeline forwarding to reduce the over-provisioning seen in prior RDNA. So compute density gets a massive boost and power consumption per FLOP should also fall substantially.

The problem remains actual compute throughput in this design.
 
The problem remains actual compute throughput in this design.
Eh, I mean this looks pretty much like a copy of what Nv did in Ampere but without the loss of a dedicated integer pipeline.
I think it will do fine in FP compute, likely above what Ampere has shown compared to Turing.
Whether this will be enough to resolve RDNA2's weak points remains to be seen though.
 
Eh, I mean this looks pretty much like a copy of what Nv did in Ampere but without the loss of a dedicated integer pipeline.
Nvidia didn't add additional ALU's for Ampere, they just repurposed existing ones.

Perhaps that's what AMD is also doing for RDNA3, but what it sounds like so far is an actual physical doubling of the ALU's, which would be a very different situation.
 
Nvidia didn't add additional ALU's for Ampere, they just repurposed existing ones.
They did. You can't "repurpose" integer ALUs for FP math.

Perhaps that's what AMD is also doing for RDNA3, but what it sounds like so far is an actual physical doubling of the ALU's, which would be a very different situation.
The difference would be in the absence of integer SIMD which would mean that a) actual performance gain in mixed math will likely be higher (since there won't be cases where the previously present integer pipeline will do the same stuff again leading to no performance change) and b) area spent will probably be higher - but this is mostly irrelevant if the SIMDs are redesigned anyway.

That being said I wonder if the scheduling overhead will become higher leading to more stalls inside the WGP. Nvidia solution is rather elegant in how they use their dual-SIMDs but AMD can't really copy that unless they are willing to go to 128 lane widths.
 
Relative counts of TMUs and ROPs (per CU) don't need to change if you're getting >2x scaling in ALU count per mm² by cutting out a load of scheduling hardware at the CU and SIMD level while simultaneously re-wiring the vector register file and pipeline forwarding to reduce the over-provisioning seen in prior RDNA. So compute density gets a massive boost and power consumption per FLOP should also fall substantially.

The problem remains actual compute throughput in this design.

The TMU count per CU/WGP won't change, but now a WGP will have double the ALU count, so the ALU/TEX ratio is halved. This will still improve performance, but it won't double it and the perf/gflop will be much lower. The compute limitation of RDNA2 will shift more to a TMU, whatever limitation as seen in Ampere.

I just expect a bigger improvement than in Ampere, as Ampere solution was very basic in adding FP32 units and not changing much else.

Quite funny this time. Both architectures are converging. Nvidia is adding Cache ,AMD is doubling their ALU count per WGP. Might be an advantage for both, when games focus on very similar ALU/Tex/ROP performance ratios, while beeing a problem for intel.
 
The compute limitation of RDNA2 will shift more to a TMU
I doubt texturing, per se, will be much of a bottleneck - it's not why 6950XT is slower than 3090Ti (744 versus 625 gigatexels/s). There are format-related performance variations there, though and games are now so complex it's hard to compare across architectures...
 
I like this mass confusion prior to launch, hehe.

Let’s muddle the water with — according to LLVM patches so far — Wave32 VOPD co-issue supporting only a tiny subset (10-ish) of (mostly FP32) ALU opcodes?

So is this truly doubling ALU? Or is there more alternative theories to it? Like while co-issue coverage is deliberately limited, VALU pipeline is indeed doubled in full within a “SIMD” to execute an extra wavefront in parallel (which co-issue can steal)? Spicy indeed. :mrgreen:
 
Last edited:
I doubt texturing, per se, will be much of a bottleneck - it's not why 6950XT is slower than 3090Ti (744 versus 625 gigatexels/s). There are format-related performance variations there, though and games are now so complex it's hard to compare across architectures...
There is never a single bottleneck, it's always a mixture, but the bottleneck will shift with doubling of ALUs. A 3090Ti has much more Gflops, therefore it should be a lot faster than a 6950XT, but it's not, because other bottlenecks are limiting it. The same will happen, when AMD doubles its ALUs.
 
Yeap.
And their CPU efficiency regressed.
Whatever, Apple sucks now.
Not related to your post, just easier to reply I apologize. but what’s your thoughts on the rdna chiplets here wrt the consoles. Eventually in the future there will be another generation; any ideas on what that console architecture may look like ?
 
any ideas on what that console architecture may look like
Gonna bet on some kind of 2.5D chiplet solution in consoles for the next gen, maybe even for this midgen refresh.
As you've seen, N5 costs can be pretty rough, and with N3 flavours it's even worse.

Honestly depends on TSMC/ASE fanout capacities.
N33 isn't tiled because AMD was afraid they'd run out of InFO/FoCoS slots in all the Taiwan.
 
I like this mass confusion prior to launch, hehe.

Let’s muddle the water with — according to LLVM patches so far — Wave32 VOPD co-issue supporting only a tiny subset (10-ish) of (mostly FP32) ALU opcodes?
It bugs the hell out of me that the sets of available co-issuable OPs per VOPD-half are not even symmetric.
 
There is never a single bottleneck, it's always a mixture, but the bottleneck will shift with doubling of ALUs. A 3090Ti has much more Gflops, therefore it should be a lot faster than a 6950XT, but it's not, because other bottlenecks are limiting it. The same will happen, when AMD doubles its ALUs.

Flops aren’t the determining factor in predicting which card should be faster. If anything high flops help with the rare ALU bound pass during a frame. Post processing shaders eat them up.

The vast majority of passes though are memory bandwidth/latency limited on a 3090. I’m guessing it’s mostly latency and that’s why Ampere does a bit better at higher resolutions where there is more work available to hide that latency.
 
Flops aren’t the determining factor in predicting which card should be faster. If anything high flops help with the rare ALU bound pass during a frame. Post processing shaders eat them up.

The vast majority of passes though are memory bandwidth/latency limited on a 3090. I’m guessing it’s mostly latency and that’s why Ampere does a bit better at higher resolutions where there is more work available to hide that latency.
post processing stage: would a high TF card have any more trouble doing a lot of post processing at lower resolution vs doing less post processing at higher resolution ?

I recall reading a quick dev take that said that post processing is really the step that starts making the image look good.
 
post processing stage: would a high TF card have any more trouble doing a lot of post processing at lower resolution vs doing less post processing at higher resolution ?

I recall reading a quick dev take that said that post processing is really the step that starts making the image look good.

Depends on where the flops are coming from. If it’s a very wide chip you may literally not have enough work available to fill the chip at lower resolutions. If the flops are coming from clocks and a narrower arch then it’s a little easier.
 
If the article that was previously posted is correct, then AMD will be further reducing FP64 performance with RDNA 3. That should lead to some transistor savings correct? That would, of course, then be used for other things.

Regards,
SB
 
If the article that was previously posted is correct, then AMD will be further reducing FP64 performance with RDNA 3. That should lead to some transistor savings correct? That would, of course, then be used for other things.

Regards,
SB
RDNA2 is at about 15:1 FP64 rate which isn't exactly an FP64 monster. How much can they "save" going down from that? I think it's more likely that the ratio of FP32 to FP64 ALUs with go up simply because they'll double the former while leaving the same number per WGP for the latter.
 
RDNA2 is at about 15:1 FP64 rate which isn't exactly an FP64 monster. How much can they "save" going down from that? I think it's more likely that the ratio of FP32 to FP64 ALUs with go up simply because they'll double the former while leaving the same number per WGP for the latter.
There are no FP64 ALUs in GCN / RDNA.
 
Status
Not open for further replies.
Back
Top