AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
The whitepaper says: "Each compute unit has access to double the LDS capacity and bandwidth".
The whitepaper is quite explicit about CU mode:

"The RDNA architecture has two modes of operation for the LDS, compute-unit and workgroupprocessor mode, which are controlled by the compiler. The former is designed to match the behavior of the GCN architecture and statically divides the LDS capacity into equal portions between the two pairs of SIMDs. By matching the capacity of the GCN architecture, this mode ensures that existing shaders will run efficiently. However, the work-group processor mode allows using larger allocations of the LDS to boost performance for a single work-group."

It does not seem like CU mode or not should matter in the peak throughput.
Agreed, the peak is the the same. The peak count of DWORDs read or written per LDS bank is constant whether in CU mode or LDS mode.

After all, in WGP mode, you are meant to be able to address all of the LDS, which means both arrays collectively have to be capable to deliver 2x 128B/cycle to either of the SIMD pairs anyway.
An atomic instruction like BUFFER_ATOMIC_SWAP means the peak is capped. In WGP mode all 64 banks of LDS are addressable but an atomic that returns data, by definition, can only touch a single bank per cycle. This is the scenario I was referring to originally when I was talking about a CU-less design.

If anything that could hamper bandwidth, it is perhaps the documented fact that a SIMD pair shares the request & return buses, in which case effective bandwidth can be indeed halved in unideal scenarios where all active work-items in a workgroup are biased towards either of the two SIMD pairs. But even then, it does not fit the "localized LDS = CU mode better" theory.
I was referring specifically to bandwidth and there are some LDS operations where it appears that bandwidth is higher in CU mode because a CU can only touch one array in LDS.

I can't think of a mechanism in WGP mode that can guarantee that two BUFFER_ATOMIC_SWAP instructions can always run concurrently. If two SIMDs (e.g. 0 and 2) are both doing BUFFER_ATOMIC_SWAP but they are in distinct workgroups, then it should be possible for LDS to run BUFFER_ATOMIC_SWAP for both concurrently. Or, BUFFER_ATOMIC_SWAP for one workgroup and some other "less serial" LDS instruction

Yes, I'm looking for corner cases. I'm trying to identify the problems with a CU-less WGP.

I'm not following your thought process. Of course wavefronts within a workgroup can be allocated to different SIMDs within an SM/CU - this has been the case forever. What is it that we need to infer exactly?
And that's all I was inferring.

I was contrasting a workgroup of 64 pixels (work items) which are allocated to a single SIMD with a non-pixel-shader workgroup of 64 work items that will be allocated across both SIMDs. I was pointing out that the pixel shader allocation to SIMDs was unusual. And that LDS appears to these two SIMDs in a CU the same as in GCN (though shared by four SIMDs). A half of LDS (one array out of the two) is dedicated to a CU in CU mode.

TMUs, RAs and L0$ bandwidth are dedicated to a CU all of the time, whether in CU or WGP mode.

Though we also know that RDNA (2) has increased throughputs and sizes versus GCN at the CU level (e.g. L0$ cache lines are twice the size of L1$ in GCN).

I'm referring to your use of the term workgroup in the post that I quoted. You seem to be treating them as groups of threads that must be launched and retired together and that reside on the same SIMD - this is a wavefront not a workgroup. AMD explicitly says that 64-wide wavefronts run on the same 32-wide SIMD and just take multiple cycles to retire each instruction. There is no suggestion that multiple SIMDs will co-operate in the execution of such a wavefront.
What you've written is true for a pixel shader for 64 pixels. It's described as a wave64 and corresponds with a workgroup of size 64. The unusual aspect here is that it's scheduled on a single SIMD with 2 work items sharing a SIMD lane, whereas in general workgroups are spread across SIMDs width first.

But note a workgroup of 128 work items bound to an RDNA CU necessarily results in 2 work items sharing a SIMD lane. With increasing multiples for larger workgroups up to those of size 1024. So the pixel shader configured as wave64 appears to be a special case of workgroup. As a special case it's designed explicitly to gain from VGPR, LDS and TMU locality for pixels whose quad work item layout and use of quad-based derivatives is a special case of locality.

This then leads me to believe that "wave64" for pixel shading is a hack that ties together two hardware threads on a SIMD. Not only is there a clue in the gotcha that I described earlier, but one mode of operation (alternating halves per instruction) is just like two hardware threads that issue on alternating cycles - which is a feature of RDNA.

In the end pixel shaders are merely a subset of compute shaders, where some of the work item state (values in registers and constant buffers) is driven by rasterisation.

A workgroup per the OpenCL spec is the equivalent of a CUDA block and consists of multiple wavefronts that can be distributed across the SIMDs in an SM/WGP. So it's a bit confusing as to what you're referring to when you describe groups of threads. Do you mean wavefronts (bound to a SIMD) or workgroups (bound to a WGP)?
You can see the problem AMD has introduced with RDNA: two different wavefront sizes. Are they both hardware threads? Or is wave64 an emulated hardware thread, simulated by running two hardware threads?

Running a wave64 on a single SIMD appears to be enough to define it as a hardware thread. I believe that's where you're coming from. But a SIMD can support multiple hardware threads from the same workgroup, too. So that definition looks shaky.

Also, you can say (and I believe this is really your point) that running all work items from a workgroup on a SIMD means it's a hardware thread. So wave64 is a hardware thread. Whether it is or isn't a hardware thread, it's still a workgroup.

I'm sure that leaves you unhappy with what I've said. Ah well. Maybe RDNA 3 won't have wave64. Three distinct execution models for pixel shaders is beginning to take the piss. Hoping that the compiler will choose the right one is how you drive customers away. (RDNA took a year+ for drivers to settle at adequate performance?...)

What's interesting about LDS usage by a pixel shader for attribute interpolation is that the bandwidth of LDS is multiplied by sharing data across 64 work items (e.g. a single triangle has 64 or more pixels). Whether wave64 is a hardware thread or not, this is an example of shared data and reduced wall clock latency for pixel shading.

Of course that falls apart when the triangles are small: you really want wave32 (32 work item workgroups) then.

All this came up because I'm trying to work out how RDNA 3 with no CUs inside a WGP configures TMUs, RAs and LDS. Bearing in mind that a "WGP" seems as if it would actually be a "compute unit", a compute unit generally has a single TMU and a single LDS. So my question was whether a CU with 8 SIMDs can have adequate performance with a single TMU and a single LDS.

Your stance is correct if the ratio between RA and WGP is the same, I think Jawed was supposing that the ratio between RA and ALU stays the same, which could also be. Or it may be that ratio between RA and WGP is indeed the same, but the RA capabilities are increased... There are too few details atm for having a definitive answer. I would find very strange, however, if AMD increased the base shading power almost threefold (which is not the limit of RDNA2) while keeping Ray Tracing hardware (which is the weakest point of RDNA2) with moderate increase.
Need to be careful to distinguish the two things that RAs do:
  1. box intersections (4 tests per ray query) - intrinsic performance here seems completely fine
  2. triangle intersections (1 test per ray query - I am not sure if this is a confirmed fact)
and separate them from the idea of "hardware acceleration accepts a ray and produces the triangle that was hit" (black box).

Because BVH traversal is a shader program, increased work items and/or rays in flight may actually be the key to improved performance. If AMD can do work item reorganisation for improved coherence during traversal, then that's another win. Indeed work item reorganisation may be the only way that AMD can meaningfully improve performance.

The alternative could be narrower SIMDs.

"RA as being like texturing" is a useful metaphor. Both operations depend upon global memory accesses and so their latency needs to be hidden.

I have not found a way to determine the precise extent of RDNA 2's bottlenecks. I've been working on it for a couple of weeks, but haven't got very far. Of course bottlenecks are super-slippery in real time graphics, so there's a limit to what can be gleaned.
 
Well that makes more sense although N32 looks like an over-engineered part existing solely to make a point in this case. It will be interesting to see how it would compare to a single die solution of similar complexity.

Over-engineered maybe but could still be a win from a yield and performance perspective. The hard part will be moving data between the dies without blowing the power budget.

Don't think a 160SM single die is really feasible.

Why would a 160SM die be necessary? An RDNA3 WGP is 2x as wide as an Ampere SM. So if just talking flops you would need 80 Ampere SMs to match 40 RDNA3 WGPs. That is well within the reach of a monolithic 5nm die.
 
Why would a 160SM die be necessary? An RDNA3 WGP is 2x as wide as an Ampere SM. So if just talking flops you would need 80 Ampere SMs to match 40 RDNA3 WGPs. That is well within the reach of a monolithic 5nm die.

That's only talking FLOPS but in games it's a little different: if the capabilities of those SMs would be similar to Ampere, then in rasterization there is not much difference between an Ampere SM and a RDNA2 CU (with a slight advantage for Ampere so far), looking at what a 3080 and a 6800XT can do with 68SM vs 72 CU (In Ray tracing it's different but next gen is unknown so this is a great X). So if we go for ALU count and what these ALU to for actual performance, you'll need 144 Ampere SM to match/slightly beat the equivalent of 160CU-10240 FP32 unit (that being probably N32 while N31 seems going for the 60 WGP-15360 shader units). Incidentally, the rumors about Lovelace seems to point exactly at that count on 5nm. But then there is the issue of feeding all those ALUs - that is, bandwidth. AMD is throwing tons of cache at the problem. It must be seen, if for Lovelace Nvidia will do the same. They can go again for a chip near the reticle limit, but if AMD delivers what N31 seems to be, I have difficulties to see it competing with those sheer numbers. Of course, there will be arch improvements, but that goes both ways.
 
That's only talking FLOPS but in games it's a little different: if the capabilities of those SMs would be similar to Ampere, then in rasterization there is not much difference between an Ampere SM and a RDNA2 CU (with a slight advantage for Ampere so far), looking at what a 3080 and a 6800XT can do with 68SM vs 72 CU (In Ray tracing it's different but next gen is unknown so this is a great X). So if we go for ALU count and what these ALU to for actual performance, you'll need 144 Ampere SM to match/slightly beat the equivalent of 160CU-10240 FP32 unit (that being probably N32 while N31 seems going for the 60 WGP-15360 shader units). Incidentally, the rumors about Lovelace seems to point exactly at that count on 5nm. But then there is the issue of feeding all those ALUs - that is, bandwidth. AMD is throwing tons of cache at the problem. It must be seen, if for Lovelace Nvidia will do the same. They can go again for a chip near the reticle limit, but if AMD delivers what N31 seems to be, I have difficulties to see it competing with those sheer numbers. Of course, there will be arch improvements, but that goes both ways.
An 8 SIMD WGP will have a different performance profile compared to a 4 SIMD WGP (even disregarding all the other changes for now) so I'm not so sure that you can just extrapolate N31 performance from RDNA2 results, either by unit numbers or FLOPs.
 
An 8 SIMD WGP will have a different performance profile compared to a 4 SIMD WGP (even disregarding all the other changes for now) so I'm not so sure that you can just extrapolate N31 performance from RDNA2 results, either by unit numbers or FLOPs.

As you cannot extrapolate that performance will be lower instead of being not the same or even higher, especially when hystorically it basically never happened to have a regression in performance in iterative steps of the same architerture family - so let's see.
 
As you cannot extrapolate that performance will be lower instead of being not the same or even higher, especially when hystorically it basically never happened to have a regression in performance in iterative steps of the same architerture family - so let's see.
You can't of course but if we think on that "historically" then let me remind you about Vega scaling. Or Ampere compared to Turing for that matter.
 
Ehhhhh gfx11 is a pretty clean break.

It may and it will be, but simply I don't think AMD would decrease the "per ALU" performance - the task with RDNA and successive steps was clearly to increase ALU utilization, which was quite lacking in GCN. I may be wrong, but I don't think AMD will lose this focus, especially looking at their recent history.
 
Over-engineered maybe but could still be a win from a yield and performance perspective. The hard part will be moving data between the dies without blowing the power budget.

Calculated the approximate power cost for infinity fabric v1 a while back, and even with that the power budgets for two dies were tolerable for today's high end GPUs, let alone whatever improved version they have now. Not trivial, or even cheap mind, but within budget for a <400 watt card.

The real power question for these upcoming GPUs versus "the rumors" is how exactly either AMD or Nvidia plan to get the power usage of compute itself down. Double the performance in two years or less? Last I looked there was no secret silicon miracle in the back pocket of either they were just waiting to pull out. So when was the last time any major GPU architecture doubled its power efficiency in that short a time? It almost feels rhetorical.

For AMD I can see it with the big qualifier that it only applies to the hardware RT titles it currently performs the worst in; sure maybe the performance there could double, especially with the higher headroom they have for cranking up power. But Nvidia's 300+ watt already cards, doubling performance, really?
 
It's 2.7x or so in raster gen over gen.
What is even going to make use of this level of performance?
Isn't that like 7 to 8x more performance than the current generation consoles?
 
But Nvidia's 300+ watt already cards, doubling performance, really?
NVidia with a monster cache could go 256-bit too? NVidia in its COPA paper:

2104.02188.pdf (arxiv.org)

has been contemplating extremely large L3 cache (see section E). Silly to assume that NVidia is not going to do crazy things. Returning big time to TSMC might be a reflection that TSMC's tech is crucial to NVidia's plans, not just for data centre but everything from 2022. No, I'm not suggesting NVidia will put 1GB of L3 in Lovelace, but 256MB?

What is even going to make use of this level of performance?
Isn't that like 7 to 8x more performance than the current generation consoles?
Well AMD needs to sort out ray tracing performance. Less than 2x gain over Navi 21 will be seen as a major fail and consoles are not a useful benchmark for ray tracing performance. Also, leet console dev ray tracing voodoo is going to make Navi 31 look worse anyway, so the pressure is on.

I’ll take super resolution 8k downscaled to 4k please.
Yep, super-sampled 4K is nice. Especially when sat a metre from a 48" where them pixels are not what you'd call "retina".

It would be amusing if LG brought 8K OLED to small TVs such as 48" in a couple of years.
 
What is even going to make use of this level of performance?
First thing i come up with is full scene fluid sim (fog, smoke) and volumetric lighting (glowing projectiles with awesome glow). That's the stuff we need brute force power because there are not much options to fake well or reduce work with clever algorithms.
But that's still some complexity to implement. And on/off makes a big difference, and can't scale down well. So not that practical and attractive for current cross platform games landscape.
Well, maybe if next gen is the only target it would scale down good enough.

Other than that, high resolution and fps as usual. :/
If they want better RT perf, HW traversal is the obvious improvement to make.

Maybe arcade machines would make sense again? Competing cinemas? If there were no covid?

Well, I really hope chiplets will also help with smaller and practical GPUs after that nice proof of concept. I'm impressed, but also disappointed about another high end craze.
 
Status
Not open for further replies.
Back
Top