AMD: RDNA 3 Speculation, Rumours and Discussion

Bondrewd · Jul 27, 2021

Wesker said:
Anandtech has a series of articles documenting this.

You're talking to the dude writing them.

Wesker said:
The GPUs are a different story, however. While the GPUs in the A12, A13, and A14 (and M1) are good, they're just "kinda meh" when compared to their Nvidia and AMD counterparts.

Still excellent PPA.

Jawed · Jul 27, 2021

pTmdfx said:
The whitepaper says: "Each compute unit has access to double the LDS capacity and bandwidth".

The whitepaper is quite explicit about CU mode:

"The RDNA architecture has two modes of operation for the LDS, compute-unit and workgroupprocessor mode, which are controlled by the compiler. The former is designed to match the behavior of the GCN architecture and statically divides the LDS capacity into equal portions between the two pairs of SIMDs. By matching the capacity of the GCN architecture, this mode ensures that existing shaders will run efficiently. However, the work-group processor mode allows using larger allocations of the LDS to boost performance for a single work-group."

It does not seem like CU mode or not should matter in the peak throughput.

Agreed, the peak is the the same. The peak count of DWORDs read or written per LDS bank is constant whether in CU mode or LDS mode.

After all, in WGP mode, you are meant to be able to address all of the LDS, which means both arrays collectively have to be capable to deliver 2x 128B/cycle to either of the SIMD pairs anyway.

An atomic instruction like BUFFER_ATOMIC_SWAP means the peak is capped. In WGP mode all 64 banks of LDS are addressable but an atomic that returns data, by definition, can only touch a single bank per cycle. This is the scenario I was referring to originally when I was talking about a CU-less design.

If anything that could hamper bandwidth, it is perhaps the documented fact that a SIMD pair shares the request & return buses, in which case effective bandwidth can be indeed halved in unideal scenarios where all active work-items in a workgroup are biased towards either of the two SIMD pairs. But even then, it does not fit the "localized LDS = CU mode better" theory.

I was referring specifically to bandwidth and there are some LDS operations where it appears that bandwidth is higher in CU mode because a CU can only touch one array in LDS.

I can't think of a mechanism in WGP mode that can guarantee that two BUFFER_ATOMIC_SWAP instructions can always run concurrently. If two SIMDs (e.g. 0 and 2) are both doing BUFFER_ATOMIC_SWAP but they are in distinct workgroups, then it should be possible for LDS to run BUFFER_ATOMIC_SWAP for both concurrently. Or, BUFFER_ATOMIC_SWAP for one workgroup and some other "less serial" LDS instruction

Yes, I'm looking for corner cases. I'm trying to identify the problems with a CU-less WGP.

trinibwoy said:
I'm not following your thought process. Of course wavefronts within a workgroup can be allocated to different SIMDs within an SM/CU - this has been the case forever. What is it that we need to infer exactly?

And that's all I was inferring.

I was contrasting a workgroup of 64 pixels (work items) which are allocated to a single SIMD with a non-pixel-shader workgroup of 64 work items that will be allocated across both SIMDs. I was pointing out that the pixel shader allocation to SIMDs was unusual. And that LDS appears to these two SIMDs in a CU the same as in GCN (though shared by four SIMDs). A half of LDS (one array out of the two) is dedicated to a CU in CU mode.

TMUs, RAs and L0$ bandwidth are dedicated to a CU all of the time, whether in CU or WGP mode.

Though we also know that RDNA (2) has increased throughputs and sizes versus GCN at the CU level (e.g. L0$ cache lines are twice the size of L1$ in GCN).

I'm referring to your use of the term workgroup in the post that I quoted. You seem to be treating them as groups of threads that must be launched and retired together and that reside on the same SIMD - this is a wavefront not a workgroup. AMD explicitly says that 64-wide wavefronts run on the same 32-wide SIMD and just take multiple cycles to retire each instruction. There is no suggestion that multiple SIMDs will co-operate in the execution of such a wavefront.

What you've written is true for a pixel shader for 64 pixels. It's described as a wave64 and corresponds with a workgroup of size 64. The unusual aspect here is that it's scheduled on a single SIMD with 2 work items sharing a SIMD lane, whereas in general workgroups are spread across SIMDs width first.

But note a workgroup of 128 work items bound to an RDNA CU necessarily results in 2 work items sharing a SIMD lane. With increasing multiples for larger workgroups up to those of size 1024. So the pixel shader configured as wave64 appears to be a special case of workgroup. As a special case it's designed explicitly to gain from VGPR, LDS and TMU locality for pixels whose quad work item layout and use of quad-based derivatives is a special case of locality.

This then leads me to believe that "wave64" for pixel shading is a hack that ties together two hardware threads on a SIMD. Not only is there a clue in the gotcha that I described earlier, but one mode of operation (alternating halves per instruction) is just like two hardware threads that issue on alternating cycles - which is a feature of RDNA.

In the end pixel shaders are merely a subset of compute shaders, where some of the work item state (values in registers and constant buffers) is driven by rasterisation.

A workgroup per the OpenCL spec is the equivalent of a CUDA block and consists of multiple wavefronts that can be distributed across the SIMDs in an SM/WGP. So it's a bit confusing as to what you're referring to when you describe groups of threads. Do you mean wavefronts (bound to a SIMD) or workgroups (bound to a WGP)?

You can see the problem AMD has introduced with RDNA: two different wavefront sizes. Are they both hardware threads? Or is wave64 an emulated hardware thread, simulated by running two hardware threads?

Running a wave64 on a single SIMD appears to be enough to define it as a hardware thread. I believe that's where you're coming from. But a SIMD can support multiple hardware threads from the same workgroup, too. So that definition looks shaky.

Also, you can say (and I believe this is really your point) that running all work items from a workgroup on a SIMD means it's a hardware thread. So wave64 is a hardware thread. Whether it is or isn't a hardware thread, it's still a workgroup.

I'm sure that leaves you unhappy with what I've said. Ah well. Maybe RDNA 3 won't have wave64. Three distinct execution models for pixel shaders is beginning to take the piss. Hoping that the compiler will choose the right one is how you drive customers away. (RDNA took a year+ for drivers to settle at adequate performance?...)

What's interesting about LDS usage by a pixel shader for attribute interpolation is that the bandwidth of LDS is multiplied by sharing data across 64 work items (e.g. a single triangle has 64 or more pixels). Whether wave64 is a hardware thread or not, this is an example of shared data and reduced wall clock latency for pixel shading.

Of course that falls apart when the triangles are small: you really want wave32 (32 work item workgroups) then.

All this came up because I'm trying to work out how RDNA 3 with no CUs inside a WGP configures TMUs, RAs and LDS. Bearing in mind that a "WGP" seems as if it would actually be a "compute unit", a compute unit generally has a single TMU and a single LDS. So my question was whether a CU with 8 SIMDs can have adequate performance with a single TMU and a single LDS.

Leoneazzurro5 said:
Your stance is correct if the ratio between RA and WGP is the same, I think Jawed was supposing that the ratio between RA and ALU stays the same, which could also be. Or it may be that ratio between RA and WGP is indeed the same, but the RA capabilities are increased... There are too few details atm for having a definitive answer. I would find very strange, however, if AMD increased the base shading power almost threefold (which is not the limit of RDNA2) while keeping Ray Tracing hardware (which is the weakest point of RDNA2) with moderate increase.

Need to be careful to distinguish the two things that RAs do:

box intersections (4 tests per ray query) - intrinsic performance here seems completely fine
triangle intersections (1 test per ray query - I am not sure if this is a confirmed fact)

and separate them from the idea of "hardware acceleration accepts a ray and produces the triangle that was hit" (black box).

Because BVH traversal is a shader program, increased work items and/or rays in flight may actually be the key to improved performance. If AMD can do work item reorganisation for improved coherence during traversal, then that's another win. Indeed work item reorganisation may be the only way that AMD can meaningfully improve performance.

The alternative could be narrower SIMDs.

"RA as being like texturing" is a useful metaphor. Both operations depend upon global memory accesses and so their latency needs to be hidden.

I have not found a way to determine the precise extent of RDNA 2's bottlenecks. I've been working on it for a couple of weeks, but haven't got very far. Of course bottlenecks are super-slippery in real time graphics, so there's a limit to what can be gleaned.

trinibwoy · Jul 27, 2021

DegustatoR said:
Well that makes more sense although N32 looks like an over-engineered part existing solely to make a point in this case. It will be interesting to see how it would compare to a single die solution of similar complexity.

Over-engineered maybe but could still be a win from a yield and performance perspective. The hard part will be moving data between the dies without blowing the power budget.

Bondrewd said:
Don't think a 160SM single die is really feasible.

Why would a 160SM die be necessary? An RDNA3 WGP is 2x as wide as an Ampere SM. So if just talking flops you would need 80 Ampere SMs to match 40 RDNA3 WGPs. That is well within the reach of a monolithic 5nm die.

Leoneazzurro5 · Jul 27, 2021

trinibwoy said:
Why would a 160SM die be necessary? An RDNA3 WGP is 2x as wide as an Ampere SM. So if just talking flops you would need 80 Ampere SMs to match 40 RDNA3 WGPs. That is well within the reach of a monolithic 5nm die.

That's only talking FLOPS but in games it's a little different: if the capabilities of those SMs would be similar to Ampere, then in rasterization there is not much difference between an Ampere SM and a RDNA2 CU (with a slight advantage for Ampere so far), looking at what a 3080 and a 6800XT can do with 68SM vs 72 CU (In Ray tracing it's different but next gen is unknown so this is a great X). So if we go for ALU count and what these ALU to for actual performance, you'll need 144 Ampere SM to match/slightly beat the equivalent of 160CU-10240 FP32 unit (that being probably N32 while N31 seems going for the 60 WGP-15360 shader units). Incidentally, the rumors about Lovelace seems to point exactly at that count on 5nm. But then there is the issue of feeding all those ALUs - that is, bandwidth. AMD is throwing tons of cache at the problem. It must be seen, if for Lovelace Nvidia will do the same. They can go again for a chip near the reticle limit, but if AMD delivers what N31 seems to be, I have difficulties to see it competing with those sheer numbers. Of course, there will be arch improvements, but that goes both ways.

DegustatoR · Jul 27, 2021

Leoneazzurro5 said:
That's only talking FLOPS but in games it's a little different: if the capabilities of those SMs would be similar to Ampere, then in rasterization there is not much difference between an Ampere SM and a RDNA2 CU (with a slight advantage for Ampere so far), looking at what a 3080 and a 6800XT can do with 68SM vs 72 CU (In Ray tracing it's different but next gen is unknown so this is a great X). So if we go for ALU count and what these ALU to for actual performance, you'll need 144 Ampere SM to match/slightly beat the equivalent of 160CU-10240 FP32 unit (that being probably N32 while N31 seems going for the 60 WGP-15360 shader units). Incidentally, the rumors about Lovelace seems to point exactly at that count on 5nm. But then there is the issue of feeding all those ALUs - that is, bandwidth. AMD is throwing tons of cache at the problem. It must be seen, if for Lovelace Nvidia will do the same. They can go again for a chip near the reticle limit, but if AMD delivers what N31 seems to be, I have difficulties to see it competing with those sheer numbers. Of course, there will be arch improvements, but that goes both ways.

An 8 SIMD WGP will have a different performance profile compared to a 4 SIMD WGP (even disregarding all the other changes for now) so I'm not so sure that you can just extrapolate N31 performance from RDNA2 results, either by unit numbers or FLOPs.

Leoneazzurro5 · Jul 27, 2021

DegustatoR said:
An 8 SIMD WGP will have a different performance profile compared to a 4 SIMD WGP (even disregarding all the other changes for now) so I'm not so sure that you can just extrapolate N31 performance from RDNA2 results, either by unit numbers or FLOPs.

As you cannot extrapolate that performance will be lower instead of being not the same or even higher, especially when hystorically it basically never happened to have a regression in performance in iterative steps of the same architerture family - so let's see.

Bondrewd · Jul 27, 2021

Leoneazzurro5 said:
in iterative steps of the same architerture family -

Ehhhhh gfx11 is a pretty clean break.

DegustatoR · Jul 27, 2021

Leoneazzurro5 said:
As you cannot extrapolate that performance will be lower instead of being not the same or even higher, especially when hystorically it basically never happened to have a regression in performance in iterative steps of the same architerture family - so let's see.

You can't of course but if we think on that "historically" then let me remind you about Vega scaling. Or Ampere compared to Turing for that matter.

Leoneazzurro5 · Jul 27, 2021

Bondrewd said:
Ehhhhh gfx11 is a pretty clean break.

It may and it will be, but simply I don't think AMD would decrease the "per ALU" performance - the task with RDNA and successive steps was clearly to increase ALU utilization, which was quite lacking in GCN. I may be wrong, but I don't think AMD will lose this focus, especially looking at their recent history.

Bondrewd · Jul 27, 2021

DegustatoR said:
let me remind you about Vega scaling

RDNA2 is already stuff but on speed and it scales effortlessly.

DegustatoR said:
Or Ampere compared to Turing for that matter.

That one added a second FP FMA lane per 4 r/w ports and not outright double the partitions.

Leoneazzurro5 said:
but I don't think AMD will lose this focus

Yep, their PPA considerations are pretty anal.

Frenetic Pony · Jul 27, 2021

trinibwoy said:
Over-engineered maybe but could still be a win from a yield and performance perspective. The hard part will be moving data between the dies without blowing the power budget.

Calculated the approximate power cost for infinity fabric v1 a while back, and even with that the power budgets for two dies were tolerable for today's high end GPUs, let alone whatever improved version they have now. Not trivial, or even cheap mind, but within budget for a <400 watt card.

The real power question for these upcoming GPUs versus "the rumors" is how exactly either AMD or Nvidia plan to get the power usage of compute itself down. Double the performance in two years or less? Last I looked there was no secret silicon miracle in the back pocket of either they were just waiting to pull out. So when was the last time any major GPU architecture doubled its power efficiency in that short a time? It almost feels rhetorical.

For AMD I can see it with the big qualifier that it only applies to the hardware RT titles it currently performs the worst in; sure maybe the performance there could double, especially with the higher headroom they have for cranking up power. But Nvidia's 300+ watt already cards, doubling performance, really?

Bondrewd · Jul 27, 2021

Frenetic Pony said:
let alone whatever improved version they have now.

What they're using now is a gazillion times lower pJ/b.
Like orders and orders of magnitude lower.

Frenetic Pony said:
For AMD I can see it with the big qualifier that it only applies to the hardware RT titles

No.
It's 2.7x or so in raster gen over gen.

Frenetic Pony · Jul 27, 2021

Bondrewd said:
What they're using now is a gazillion times lower pJ/b.
Like orders and orders of magnitude lower.

No.
It's 2.7x or so in raster gen over gen.

Deleted member 13524 · Jul 28, 2021

Bondrewd said:
It's 2.7x or so in raster gen over gen.

What is even going to make use of this level of performance?
Isn't that like 7 to 8x more performance than the current generation consoles?

trinibwoy · Jul 28, 2021

ToTTenTranz said:
What is even going to make use of this level of performance?
Isn't that like 7 to 8x more performance than the current generation consoles?

I’ll take super resolution 8k downscaled to 4k please.

Jawed · Jul 28, 2021

Frenetic Pony said:
But Nvidia's 300+ watt already cards, doubling performance, really?

NVidia with a monster cache could go 256-bit too? NVidia in its COPA paper:

2104.02188.pdf (arxiv.org)

has been contemplating extremely large L3 cache (see section E). Silly to assume that NVidia is not going to do crazy things. Returning big time to TSMC might be a reflection that TSMC's tech is crucial to NVidia's plans, not just for data centre but everything from 2022. No, I'm not suggesting NVidia will put 1GB of L3 in Lovelace, but 256MB?

ToTTenTranz said:
What is even going to make use of this level of performance?
Isn't that like 7 to 8x more performance than the current generation consoles?

Well AMD needs to sort out ray tracing performance. Less than 2x gain over Navi 21 will be seen as a major fail and consoles are not a useful benchmark for ray tracing performance. Also, leet console dev ray tracing voodoo is going to make Navi 31 look worse anyway, so the pressure is on.

trinibwoy said:
I’ll take super resolution 8k downscaled to 4k please.

Yep, super-sampled 4K is nice. Especially when sat a metre from a 48" where them pixels are not what you'd call "retina".

It would be amusing if LG brought 8K OLED to small TVs such as 48" in a couple of years.

techuse · Jul 28, 2021

Bondrewd said:
What they're using now is a gazillion times lower pJ/b.
Like orders and orders of magnitude lower.

No.
It's 2.7x or so in raster gen over gen.

2.7x a 6900xt across a general performance summary? Very hard to believe in 2022 or even 2023.

Bondrewd · Jul 28, 2021

ToTTenTranz said:
What is even going to make use of this level of performance?

I dunno.
4k HRR?

ToTTenTranz said:
Isn't that like 7 to 8x more performance than the current generation consoles?

Less.

techuse said:
Very hard to believe in 2022 or even 2023.

That's the whole gimmick!
No one built above reticle GPUs before because it was outright impossible to do so before.

JoeJ · Jul 28, 2021

ToTTenTranz said:
What is even going to make use of this level of performance?

First thing i come up with is full scene fluid sim (fog, smoke) and volumetric lighting (glowing projectiles with awesome glow). That's the stuff we need brute force power because there are not much options to fake well or reduce work with clever algorithms.
But that's still some complexity to implement. And on/off makes a big difference, and can't scale down well. So not that practical and attractive for current cross platform games landscape.
Well, maybe if next gen is the only target it would scale down good enough.

Other than that, high resolution and fps as usual. :/
If they want better RT perf, HW traversal is the obvious improvement to make.

Maybe arcade machines would make sense again? Competing cinemas? If there were no covid?

Well, I really hope chiplets will also help with smaller and practical GPUs after that nice proof of concept. I'm impressed, but also disappointed about another high end craze.

Bondrewd · Jul 28, 2021

JoeJ said:
Well, I really hope chiplets will also help with smaller and practical GPUs

Don't worry, smaller and practical also get a chungus bump in RDNA3.
Everyone will be very happy and very broke!

AMD: RDNA 3 Speculation, Rumours and Discussion

Bondrewd

Jawed

trinibwoy

Meh

Leoneazzurro5

DegustatoR

Leoneazzurro5

Bondrewd

DegustatoR

Leoneazzurro5

Bondrewd

Frenetic Pony

Bondrewd

Frenetic Pony

Deleted member 13524

Guest

trinibwoy

Meh

Jawed

techuse

Bondrewd

JoeJ

Bondrewd

Similar threads