AMD: RDNA 3 Speculation, Rumours and Discussion

Status
Not open for further replies.
There is a valid reason about not going for the "workgroup count" route and that is marketing. Look at Nvidia and their "CUDA cores" marketing, with considering the FP32-capable ALUs as a separate "core" while in reality the logic is shared at SM-level. Because it is bad to market "SM" where you have the 3080 and the 2080Ti at the same count when you can market "8704" vs "4352" CUDA Cores. Yes, peak FP32 capability is double per SM (but it is also true you cannot count on that peak rate in every workload. ) marketing loves numbers, the bigger the number is, the better. Now, if AMD would go for a "workgroup" count they would call for a 33% regression for each die, even if the FP resources would be 50% more, per die. Would you market the card at "30 Workgroup per die" vs "40 Workgroup per die" or "7860 CU per die vs "5120 CU per die"? For people understanding tech terms, it would be the same, for most of customers 30 is less than 40.

Intel isn’t playing that game and are counting EUs not SIMD lanes. It’s debatable though whether the number of EU/CU/SM is more accurate or more helpful than the number of SIMD lanes when it comes to graphics and compute performance.

This is especially true when the width of each of these units is vastly different. An EU is 8-wide and a CU is 64-wide. It doesn’t make sense to compare them directly. Also an EU isn’t functionally equivalent to an SM or CU. It’s closer to just one of the SIMDs (or partitions in Nvidia’s case) within those units.
 
Intel isn’t playing that game and are counting EUs not SIMD lanes. It’s debatable though whether the number of EU/CU/SM is more accurate or more helpful than the number of SIMD lanes when it comes to graphics and compute performance.

This is especially true when the width of each of these units is vastly different. An EU is 8-wide and a CU is 64-wide. It doesn’t make sense to compare them directly. Also an EU isn’t functionally equivalent to an SM or CU. It’s closer to just one of the SIMDs (or partitions in Nvidia’s case) within those units.

Well there is no simple response to that, as the structures are different among vendors, and so are the capabilities of the units or "CU" or "CUDA Cores". For certain things, it's a no brainer to count per WGP or SM or EU, because there are shared structures (flow control, registers, cache, and so on) so logically that is the "block" you use for building your GPU. But, if you are interested at the peak FP, then it's the FP ALU number to look at - even if bottlenecks here and there can reduce the actual FP ALU utilization by a lot (even in teh software rather than in hardware). But as said, many people look only at who has the "bigger number". It was so during the Megahertz race (even after Pentium 4 came out), the amount of VRAM (opinion being: I have a 6 Gbytes card, so it's more powerful than your 4 Gbytes one regardless of which the actual GPU die and memory bus was) and now we have the number of ALUs or the TeraFLOPs.
 
Navi 21: 4 shader engines, 10 WGPs each, 4 32-wide SIMDs per WGP.
Navi 31: 6 shader engines, 10 WGPs each, 4 32-wide SIMDs per WGP.
 
So, any guess about the GCD size? At 5nm, I think we could see something around the 350 mm^2 die size. The interposer+cache is trickier, I think. I don't think AMD would use a 5nm process for that, so on 7nm we would see something near or exceeding the 400mm^2, counting only the SRAM. (512 Mbytes)...
 
Navi 21: 4 shader engines, 10 WGPs each, 4 32-wide SIMDs per WGP.
Navi 31: 6 shader engines, 10 WGPs each, 4 32-wide SIMDs per WGP.
8.
8 SIMDs per WGP.
At 5nm, I think we could see something around the 350 mm^2 die size.
Less.
The interposer+cache is trickier
The what.
The thing is SoIC, no passive slabs or anything.
so on 7nm we would see something near or exceeding the 400mm^2, counting only the SRAM. (512 Mbytes)...
Think different.
Think the smallest possible reuse unit.
 
So 3 shader engines, 10 WGPs each, 8x32 wide SIMDs each WGP.
Per GCD.

RDNA3 looks like the largest departure from GCN yet, at least from a high level perspective.
 
I very much welcome our many-GPU ULTRA HALO overlords back.
Hopefully PC does another 4 slot Red Devil 13 cuz why not.

So 3 shader engines, 10 WGPs each, 8x32 wide SIMDs each WGP.
Per GCD.
Yep.
RDNA3 looks like the largest departure from GCN yet, at least from a high level perspective.
True dat.
Ideologically the fat SM approach feels closer to previous NV gens or IMG A/B-series.

There's a ton of uArch changes for gfx11 (both variants) but we gotta talk them at a later date, if ever.
 

Oh, well, I was conservative, my raw calculation was little less than 330 mm^2 or so, then it depends on actual scaling and not by gross estimates made by TSMC

The what.
The thing is SoIC, no passive slabs or anything.

Oh well, that was my mistake the cache part I thought it was stacked but at a second thought yes, it makes no sense to use passive parts when your inter-die communication is done through the cache die.

Think different.
Think the smallest possible reuse unit.

In effect TSMC was developing 7nm stacked on 7nm and 5nm stacked on 5nm, but there was no hint about hybrid stacking. It would be interesting to see what the real product will look.
 
Oh, well, I was conservative, my raw calculation was little less than 330 mm^2 or so, then it depends on actual scaling and not by gross estimates made by TSMC
Depends on how AMD implements stuff too!
CDNA1 to CDNA2 PPA will be funny given iso node.
In effect TSMC was developing 7nm stacked on 7nm and 5nm stacked on 5nm
No, it's just generic hybrid bonding.
All the nodes define is minimum pitch.
but there was no hint about hybrid stacking
C'mon the usual Taiwan IP vendor (what's its name?) already announced 3D d2d solution for 7 on 5.
 
So 3 shader engines, 10 WGPs each, 8x32 wide SIMDs each WGP.
Per GCD.

RDNA3 looks like the largest departure from GCN yet, at least from a high level perspective.

It does make sense in terms of scaling fixed function hardware. Double FP throughput for the same number of TMUs. Wonder what will happen with ray accelerators. It’s not clear that they’re a bottleneck on RDNA2.

The WGP memory subsystem should be interesting. That’s 8 waves potentially hitting L1 and LDS each clock.
 
It does make sense in terms of scaling fixed function hardware. Double FP throughput for the same number of TMUs. Wonder what will happen with ray accelerators. It’s not clear that they’re a bottleneck on RDNA2.

It is quite possible the ray accelerators and TMUs are "beefier" now, with even wider data paths, possibly enabling concurrent utilization of TMU and Ray accelerator(s).
 
We're already familiar with this, in a sense. When a 64-item workgroup runs on an RDNA (2) SIMD, the compiler generates "hi" and "lo" 32-work-item halves, which are either tackled by alternating each instruction for each half, or scheduling one half to run to completion followed by the other.

Currently we are left to infer that a workgroup of more than 32 work-items will run concurrently (though not necessarily in lock-step) across both SIMDs in an RDNA (2) CU, unless it's a pixel shader (which is usually 64 work items assigned to a single SIMD).

These 2 statements appear to contradict each other. AMD explicitly says that 64-item wavefronts are run on a single SIMD over multiple cycles. Why do we need to infer anything about running across SIMDs?

Also I think you’re using workgroup where you should be using wavefront.
 
Last edited:
It does make sense in terms of scaling fixed function hardware. Double FP throughput for the same number of TMUs. Wonder what will happen with ray accelerators. It’s not clear that they’re a bottleneck on RDNA2.

The WGP memory subsystem should be interesting. That’s 8 waves potentially hitting L1 and LDS each clock.
L0$ per WGP, L1$ per shader array. Well, that's how RDNA (2) is configured.

8 SIMDs sharing a single TMU is what I've been suggesting, because texturing rates with more TMUs would be disproportionately high. But Carsten's point about bandwidth into each SIMD from L0$ still stands. A multi-ported L0$ doesn't seem like a good idea (increased latency and LDS-like banking rules making latency/bandwidth more unpredictable).

Instead, an L0$ per SIMD. But the problem with that is the quantity of L0$ taken by data that's present in nearby L0$s (i.e. within the same WGP) - duplication is going to waste a lot of these 8 L0$s..

So I am really struggling to justify, one way or another, how L0$ (and to a similar extent TMU) works when there's 8 SIMDs. Maybe a 32KB L0$ is enough for 8 SIMDs and maybe a single TMU is the same.

An individual SIMD has a bursty relationship with L0$ and TMU, because of latency hiding. Additionally, clause-based operation of L0$ and TMU commands intensifies this burstiness, since groups of multiple commands will be issued (effectively by the complier), rather than them being spaced-out. These groups minimise the count of context switches seen by a single hardware thread.

So perhaps 8 SIMDs all doing their bursty thing can be seen as safely keeping out of each others' way in the general case.

Ray accelerators, on the other hand, seems to be an even more troublesome question. Since AMD is SIMD-traversing a BVH, one can argue that the intersection test rate is fine with one RA per WGP (8 SIMDs). We could expect the compiler to make bursty BVH queries (several queries at a time per work item, e.g. multiple rays and/or multiple child-nodes per work-item).

The ALU:RA ratio is pretty high in RDNA 2 already, so making it much higher may well be harmless given that the entire BVH, in the worst case, can't fit into infinity cache (i.e. lots of latency). More WGPs in Navi 31 still means a theoretical increase in intersection throughput (e.g. 3x Navi 21).

These 2 statements appear to contradict each other. AMD explicitly says that 64-item wavefronts are run on a single SIMD over multiple cycles. Why do we need to infer anything about running across SIMDs?
A workgroup, at the very least, is 128 work items in size at its maximum size (across all GPUs and all non-proprietary APIs running on those GPUs). In the good old days a workgroup was up to 1024 work items at maximum. The different APIs impose varying restrictions on the size of a workgroup which further muddies the waters. And then there's the consoles, which appear to have the loosest restrictions (for a given GPU architecture).

RDNA (2) supports 1024 work items in a workgroup.

We can infer that 2 SIMDs can cooperate in a non-pixel-shader workgroup because it reduces latency (wall clock latency for all the work items in the workgroup). Pixel shading is a special case, where 64 work items sharing a SIMD as hi and lo halves, in general, benefits directly from attribute interpolation sharing (LDS locality) and texel locality.

For non-pixel-shading kernels (and specifically those that have low or no use of LDS) there is less reason not to spread a workgroup across all SIMDs. This becomes "essential" when a workgroup is high in work items and also high in register allocation: the register file in a given SIMD is literally too small. Additionally if there's only 2 workgroups that can fit into a CU, then you want both SIMDs to be time-sliced by the work of one or the other of the workgroups. By making them both time-slice you maximise maximal concurrent utilisation.

You can argue that wall-clock latency is worse when workgroups fight over a single SIMD: one hardware thread from workgroup 1 wants to run because it's received its data from L0$ and a hardware thread on the same SIMD for workgroup 2 wants to run because it's received its LDS data. I think you'd agree that's in the category of "low probability".

Of course there are scenarios where WGP mode is preferred for compute (when an algorithm requires large: VGPR and/or LDS and/or work item allocations).

This comes back to my question: why does RDNA (2) even have a compute unit concept? Is it merely for L0$, TMU and RA (scheduling/throughput)? Or is it to make corner-case GCN-focussed shaders happy instead of suffering performance that falls off a cliiff? Or...?

Also I think you’re using workgroup where you should be using wavefront.
I think you're going to have to be very specific in pointing out a problem with what I've written. I'm not saying I haven't made a mistake, but while you're hiding the problem you've identified I can't read your mind and I'm not motivated to find the problem in text that I've already spent well over 6 hours writing (I posted version 3, in case you're wondering).

I deliberately use "workgroup" and "hardware thread" because in discussing multiple architectures, and architectures over time from the same IHV, you get contradictory models of the hardware if you use "wavefront" and "block" and "warp". Remember, G80 had two warp sizes, for example: 16 and 32. Graphics and compute APIs can easily hide hardware threading models, which fucks-up discussions of the hardware.

For example I have a theory that RDNA (2) only has one hardware thread size: 32. The idea that pixel shaders are "wave64" (implying that they are a hardware thread of 64 work items) contradicts this, but can easily be viewed as an abstraction to maximise equivalency with the operation of GCN. There's even a gotcha in the operation of "wave64 mode":

"Subvector looping imposes a rule that the “body code” cannot let the working half of the exec mask go to zero. If it might go to zero, it must be saved at the start of the loop and be restored before the end since the S_SUBVECTOR_LOOP_* instructions determine which pass they’re in by looking at which half of EXEC is zero."

from RDNA_Shader_ISA.pdf (amd.com)

It looks like a fossil of GCN, where only VCC and EXEC hardware registers are 64-bit (both required to support wave64). Perhaps RDNA 3 will entirely abandon wave64. One motivation could be that 8 SIMDs per WGP need filling, so LDS locality for attribute interpolation and L0$ locality for texturing (burstiness) is less of a win when there's so many SIMDs to keep occupied.

(Amusingly, ISA also says this about S_SUBVECTOR_LOOP_BEGIN and S_SUBVECTOR_LOOP_END: "This opcode has well-defined semantics in wave32 mode but the author of this document is not aware of any practical wave32 programming scenario where it would make sense to use this opcode.")

Also, "SGPRs are no longer allocated: every wave gets a fixed number of SGPRs" - means that per-SIMD hardware-thread ID directly indexes SGPRs, which would also mesh easily with there being only one hardware thread size of 32.

Wave64 and the necessity of optional sub-vector looping smells like a hack... Sub-vector looping helps with VGPR allocation, as it happens. But AMD also increased the size of the register files in RDNA...
 
Last edited:
Status
Not open for further replies.
Back
Top