AMD: RDNA 3 Speculation, Rumours and Discussion

CarstenS · Jul 26, 2021

So, we're back to 4 SIMDs per CU (or whatever arbitrary name you will call it). Now all we need is a 4-cycle round robin cadence to be back at GCN with 4x throughput.

fellix · Jul 26, 2021

CarstenS said:
So, we're back to 4 SIMDs per CU (or whatever arbitrary name you will call it). Now all we need is a 4-cycle round robin cadence to be back at GCN with 4x throughput.

That's the Work Group Processor, where two CUs are tightly integrated (sharing constant/instruction caches and the LDS). I think this is analogous to Nvidia's TPC, where two SMs are chained together (three in GT200).

Bondrewd · Jul 26, 2021

fellix said:
That's the Work Group Processor

No, that's per 'CU'.
gfx11 WGP is 8 SIMD32 or so.

fellix said:
I think this is analogous to Nvidia's TPC, where two SMs are chained together (three in GT200).

TPC is just a physical layout thingy; you can't just double the shmem available to a single SM with a toggle.
Similar enough I guess?

tsa1 · Jul 26, 2021

Jawed said:
They just now need to keep re-posting snippets from this thread

I'd say some techtubers already did that with a smattering of chinese forums auto-translate (along with ol' good fud)

pTmdfx · Jul 26, 2021

CarstenS said:
So, we're back to 4 SIMDs per CU (or whatever arbitrary name you will call it). Now all we need is a 4-cycle round robin cadence to be back at GCN with 4x throughput.

Triple confirmed: RDNA 3 has 128-wide wavefronts and yet again zero-cycle ALU instruction latency. :mrgreen:

pTmdfx · Jul 26, 2021

Jawed said:
We can infer that 2 SIMDs can cooperate in a non-pixel-shader workgroup because it reduces latency

Not sure whether the programming model reflects your speculation:
* Memory access within a workgroup is coherent only for requests skipping L0$, or atomic operations. (except when you specify the CU mode at dispatch time, presumably).
* Barriers are monitored at the WGP level anyway, since it is a workgroup primitive.
* LDS are equally reachable by all SIMDs in WGP.

Though the ISA documentation does say explicitly that CU mode can lead to higher effective LDS bandwidth. But that sounds more like a specific matter of certain memory access patterns disfavoring(?) the WGP mode of the LDS, rather than something to do with some unspecified tight integration within a WGP half.

Jawed · Jul 26, 2021

pTmdfx said:
Not sure whether the programming model reflects your speculation:
* Memory access within a workgroup is coherent only for requests skipping L0$, or atomic operations. (except when you specify the CU mode at dispatch time, presumably).

I was referring to CU mode, not WGP mode, though I admit I didn't reinforce that constraint at that point.

* Barriers are monitored at the WGP level anyway, since it is a workgroup primitive.

In CU mode barriers are CU level.

* LDS are equally reachable by all SIMDs in WGP.

Only in WGP mode.

Though the ISA documentation does say explicitly that CU mode can lead to higher effective LDS bandwidth. But that sounds more like a specific matter of certain memory access patterns disfavoring(?) the WGP mode of the LDS, rather than something to do with some unspecified tight integration within a WGP half.

There are two LDS arrays inside LDS, each array is localised to the parent CU. So in CU mode you get doubled bandwidth.

I didn't see the note in the ISA about reduced bandwidth in WGP mode, but that merely backs up what I was saying before.

pTmdfx · Jul 26, 2021

Jawed said:
I was referring to CU mode, not WGP mode, though I admit I didn't reinforce that constraint at that point.
In CU mode barriers are CU level.
Only in WGP mode.

The nuance here is that these are all currently execution mode differences at runtime, where they can be emulated alright by actual architectural hardware features at WGP level. You are trying to extrapolate that there are more benefits from this mode, which is possible. But the reason can also be as simply as a compatibility mode for kernels previously written with certain programming model assumptions that have been around for a decade (e.g. wavefront size = 64), which isn't something a shader compiler alone can fix/patch transparently.

Jawed said:
There are two LDS arrays inside LDS, each array is localised to the parent CU. So in CU mode you get doubled bandwidth. I didn't see the note in the ISA about reduced bandwidth in WGP mode, but that merely backs up what I was saying before.

The whitepaper says: "Each compute unit has access to double the LDS capacity and bandwidth". It does not seem like CU mode or not should matter in the peak throughput. After all, in WGP mode, you are meant to be able to address all of the LDS, which means both arrays collectively have to be capable to deliver 2x 128B/cycle to either of the SIMD pairs anyway.

If anything that could hamper bandwidth, it is perhaps the documented fact that a SIMD pair shares the request & return buses, in which case effective bandwidth can be indeed halved in unideal scenarios where all active work-items in a workgroup are biased towards either of the two SIMD pairs. But even then, it does not fit the "localized LDS = CU mode better" theory.

trinibwoy · Jul 26, 2021

Jawed said:
L0$ per WGP, L1$ per shader array. Well, that's how RDNA (2) is configured.

Yep, I meant WGP cache so L0.

The ALU:RA ratio is pretty high in RDNA 2 already, so making it much higher may well be harmless given that the entire BVH, in the worst case, can't fit into infinity cache (i.e. lots of latency). More WGPs in Navi 31 still means a theoretical increase in intersection throughput (e.g. 3x Navi 21).

How did you get more WGPs in Navi 31? Navi 21 has 40, latest rumours say Navi 31 has only 30.

We can infer that 2 SIMDs can cooperate in a non-pixel-shader workgroup because it reduces latency (wall clock latency for all the work items in the workgroup).

I'm not following your thought process. Of course wavefronts within a workgroup can be allocated to different SIMDs within an SM/CU - this has been the case forever. What is it that we need to infer exactly?

I think you're going to have to be very specific in pointing out a problem with what I've written. I'm not saying I haven't made a mistake, but while you're hiding the problem you've identified I can't read your mind and I'm not motivated to find the problem in text that I've already spent well over 6 hours writing (I posted version 3, in case you're wondering).

I'm referring to your use of the term workgroup in the post that I quoted. You seem to be treating them as groups of threads that must be launched and retired together and that reside on the same SIMD - this is a wavefront not a workgroup. AMD explicitly says that 64-wide wavefronts run on the same 32-wide SIMD and just take multiple cycles to retire each instruction. There is no suggestion that multiple SIMDs will co-operate in the execution of such a wavefront. A workgroup per the OpenCL spec is the equivalent of a CUDA block and consists of multiple wavefronts that can be distributed across the SIMDs in an SM/WGP. So it's a bit confusing as to what you're referring to when you describe groups of threads. Do you mean wavefronts (bound to a SIMD) or workgroups (bound to a WGP)?

Digidi · Jul 26, 2021

When AMD put more shaders in the WGps how does this effect theire Raytracing approach? I thought they need mor WGps for Raytracing not less?

My second question is that also the Frontend is a black box for me. In driver you find always the hint that you have 4 Rasterizer but 8 Scan Converter. So Scan Converter is the main Part which transforms Polygons into pixels. So when 1 Polygon comes from Rasterizer but you have 2 Scan Converter, 1 Scan Converter is running empty?

CarstenS · Jul 27, 2021

Digidi said:
When AMD put more shaders in the WGps how does this effect theire Raytracing approach? I thought they need mor WGps for Raytracing not less?

It's one RA per CU as it is. Or two RAs per WGP. They are closely coupled to if not an integral part of the TMU. If they don't change the ALU:TEX ratio, there's no reason to believe, RT performance won't scale with ALU performance. Bigger ∞$ would only help RT perf, especially if they can make the BVH stick in that $.

Leoneazzurro5 · Jul 27, 2021

trinibwoy said:
How did you get more WGPs in Navi 31? Navi 21 has 40, latest rumours say Navi 31 has only 30.

30 per die, but Navi31 seems to have two dies, so the total is 60.

trinibwoy · Jul 27, 2021

Leoneazzurro5 said:
30 per die, but Navi31 seems to have two dies, so the total is 60.

Oh yes that’s right. Though that’s not 3x Navi 21 if latest rumors are accurate.

Deleted member 13524 · Jul 27, 2021

trinibwoy said:
Oh yes that’s right. Though that’s not 3x Navi 21 if latest rumors are accurate.

It's 60 WGPs but each WGP now has 256 ALUs, whereas one WGP in Navi 21 has 128 ALUs.

5120 ALUs on Navi 21 vs. 15360 ALUs on dual-chiplet Navi 31. Furthermore, Navi 31 has up to 512MB Infinity Cache, 4x the 128MB in Navi 21.

We should also expect N31's clocks to reach higher frequencies, since it's made on N5P instead of Navi 21's N7P.

It should be >3x Navi 21, even if power consumption jumps to RTX 3090 levels.

Bondrewd · Jul 27, 2021

ToTTenTranz said:
It should be >3x Navi 21

Scaling hard!

ToTTenTranz said:
5120 ALUs on Navi 21 vs. 15360 ALUs on dual-chiplet Navi 31. Furthermore, Navi 31 has up to 512MB Infinity Cache, 4x the 128MB in Navi 21.

Correct.
Think the entire lineup gets 4x LLC bumps as one last SRAM huzzah.

Deleted member 13524 · Jul 27, 2021

Bondrewd said:
Think the entire lineup gets 4x LLC bumps as one last SRAM huzzah.

4*96MB = 384MB LLC on Navi 32?
128MB LLC on Navi 33?

:runaway:

Leoneazzurro5 · Jul 27, 2021

trinibwoy said:
Oh yes that’s right. Though that’s not 3x Navi 21 if latest rumors are accurate.

Your stance is correct if the ratio between RA and WGP is the same, I think Jawed was supposing that the ratio between RA and ALU stays the same, which could also be. Or it may be that ratio between RA and WGP is indeed the same, but the RA capabilities are increased... There are too few details atm for having a definitive answer. I would find very strange, however, if AMD increased the base shading power almost threefold (which is not the limit of RDNA2) while keeping Ray Tracing hardware (which is the weakest point of RDNA2) with moderate increase.

Leoneazzurro5 · Jul 27, 2021

ToTTenTranz said:
It's 60 WGPs but each WGP now has 256 ALUs, whereas one WGP in Navi 21 has 128 ALUs.

5120 ALUs on Navi 21 vs. 15360 ALUs on dual-chiplet Navi 31. Furthermore, Navi 31 has up to 512MB Infinity Cache, 4x the 128MB in Navi 21.

We should also expect N31's clocks to reach higher frequencies, since it's made on N5P instead of Navi 21's N7P.

It should be >3x Navi 21, even if power consumption jumps to RTX 3090 levels.

I don't think AMD will clock N31 so much higher than N21, the reason being power consumption. It is more possible to have it clocked the same (and enjoying the power reduction) or very slightly higher. Then people could maybe enjoy some overclok, if the board design allows it.

pjbliverpool · Jul 27, 2021

ToTTenTranz said:
4*96MB = 384MB LLC on Navi 32?
128MB LLC on Navi 33?

But these lower tier chips will be single chiplet presumably and it's just double the LLC per chiplet from what I've read? Not that that's still not awesome obviously.

My expectation at this stage is that the single chiplet N31 will be the direct replacement for the 6900XT with 50% more ALU, double the LLC, maybe faster clocks and improved IPC etc... for a more traditional performance uplift. But then we get a 6900XT X2 equivalent GPU at the very top end which harks back to the old Crossfire on a card days where they are just stupid prices for clearly Halo products. The difference here being that hopefully scaling is much less like 2 GPU's in Crossfire and more like 1 big GPU, with hopefully no compatibility issues. So outside of that one crazy expensive halo "X2" card, the rest of the stack will be a more traditional performance uplift of around 50% (a bit more if we're very lucky).

Bondrewd · Jul 27, 2021

pjbliverpool said:
But these lower tier chips will be single chiplet presumably and it's just double the LLC per chiplet from what I've read?

GCDs have no LLC.
At all.

pjbliverpool said:
My expectation at this stage is that the single chiplet N31

No such thing.
Below N31 goes N32 and after that a single die N33.

pjbliverpool said:
The difference here being that hopefully scaling is much less like 2 GPU's in Crossfire and more like 1 big GPU

It is one big GPU.
A very expensive one to boot.

pjbliverpool said:
the rest of the stack will be a more traditional performance uplift of around 50% (a bit more if we're very lucky).

No, it's the biggest gen on gen uplift in eons across the stack.

AMD: RDNA 3 Speculation, Rumours and Discussion

CarstenS

Moderator

fellix

Bondrewd

tsa1

pTmdfx

pTmdfx

Jawed

pTmdfx

trinibwoy

Meh

Digidi

CarstenS

Moderator

Leoneazzurro5

trinibwoy

Meh

Deleted member 13524

Guest

Bondrewd

Deleted member 13524

Guest

Leoneazzurro5

Leoneazzurro5

pjbliverpool

B3D Scallywag

Bondrewd

Similar threads