It does make sense in terms of scaling fixed function hardware. Double FP throughput for the same number of TMUs. Wonder what will happen with ray accelerators. It’s not clear that they’re a bottleneck on RDNA2.
The WGP memory subsystem should be interesting. That’s 8 waves potentially hitting L1 and LDS each clock.
L0$ per WGP, L1$ per shader array. Well, that's how RDNA (2) is configured.
8 SIMDs sharing a single TMU is what I've been suggesting, because texturing rates with more TMUs would be disproportionately high. But Carsten's point about bandwidth into each SIMD from L0$ still stands. A multi-ported L0$ doesn't seem like a good idea (increased latency and LDS-like banking rules making latency/bandwidth more unpredictable).
Instead, an L0$ per SIMD. But the problem with that is the quantity of L0$ taken by data that's present in nearby L0$s (i.e. within the same WGP) - duplication is going to waste a lot of these 8 L0$s..
So I am really struggling to justify, one way or another, how L0$ (and to a similar extent TMU) works when there's 8 SIMDs. Maybe a 32KB L0$ is enough for 8 SIMDs and maybe a single TMU is the same.
An individual SIMD has a bursty relationship with L0$ and TMU, because of latency hiding. Additionally, clause-based operation of L0$ and TMU commands intensifies this burstiness, since groups of multiple commands will be issued (effectively by the complier), rather than them being spaced-out. These groups minimise the count of context switches seen by a single hardware thread.
So perhaps 8 SIMDs all doing their bursty thing can be seen as safely keeping out of each others' way in the general case.
Ray accelerators, on the other hand, seems to be an even more troublesome question. Since AMD is SIMD-traversing a BVH, one can argue that the intersection test rate is fine with one RA per WGP (8 SIMDs). We could expect the compiler to make bursty BVH queries (several queries at a time per work item, e.g. multiple rays and/or multiple child-nodes per work-item).
The ALU:RA ratio is pretty high in RDNA 2 already, so making it much higher may well be harmless given that the entire BVH, in the worst case, can't fit into infinity cache (i.e. lots of latency). More WGPs in Navi 31 still means a theoretical increase in intersection throughput (e.g. 3x Navi 21).
These 2 statements appear to contradict each other. AMD explicitly says that 64-item wavefronts are run on a single SIMD over multiple cycles. Why do we need to infer anything about running across SIMDs?
A workgroup, at the very least, is 128 work items in size at its maximum size (across all GPUs and all non-proprietary APIs running on those GPUs). In the good old days a workgroup was up to 1024 work items at maximum. The different APIs impose varying restrictions on the size of a workgroup which further muddies the waters. And then there's the consoles, which appear to have the loosest restrictions (for a given GPU architecture).
RDNA (2) supports 1024 work items in a workgroup.
We can infer that 2 SIMDs can cooperate in a non-pixel-shader workgroup because it reduces latency (wall clock latency for all the work items in the workgroup). Pixel shading is a special case, where 64 work items sharing a SIMD as hi and lo halves, in general, benefits directly from attribute interpolation sharing (LDS locality) and texel locality.
For non-pixel-shading kernels (and specifically those that have low or no use of LDS) there is less reason not to spread a workgroup across all SIMDs. This becomes "essential" when a workgroup is high in work items and also high in register allocation: the register file in a given SIMD is literally too small. Additionally if there's only 2 workgroups that can fit into a CU, then you want both SIMDs to be time-sliced by the work of one or the other of the workgroups. By making them both time-slice you maximise maximal concurrent utilisation.
You can argue that wall-clock latency is worse when workgroups fight over a single SIMD: one hardware thread from workgroup 1 wants to run because it's received its data from L0$ and a hardware thread on the same SIMD for workgroup 2 wants to run because it's received its LDS data. I think you'd agree that's in the category of "low probability".
Of course there are scenarios where WGP mode is preferred for compute (when an algorithm requires large: VGPR and/or LDS and/or work item allocations).
This comes back to my question: why does RDNA (2) even have a compute unit concept? Is it merely for L0$, TMU and RA (scheduling/throughput)? Or is it to make corner-case GCN-focussed shaders happy instead of suffering performance that falls off a cliiff? Or...?
Also I think you’re using workgroup where you should be using wavefront.
I think you're going to have to be very specific in pointing out a problem with what I've written. I'm not saying I haven't made a mistake, but while you're hiding the problem you've identified I can't read your mind and I'm not motivated to find the problem in text that I've already spent well over 6 hours writing (I posted version 3, in case you're wondering).
I deliberately use "workgroup" and "hardware thread" because in discussing multiple architectures, and architectures over time from the same IHV, you get contradictory models of the hardware if you use "wavefront" and "block" and "warp". Remember, G80 had two warp sizes, for example: 16 and 32. Graphics and compute APIs can easily hide hardware threading models, which fucks-up discussions of the hardware.
For example I have a theory that RDNA (2) only has one hardware thread size: 32. The idea that pixel shaders are "wave64" (implying that they are a hardware thread of 64 work items) contradicts this, but can easily be viewed as an abstraction to maximise equivalency with the operation of GCN. There's even a gotcha in the operation of "wave64 mode":
"Subvector looping imposes a rule that the “body code” cannot let the working half of the exec mask go to zero. If it might go to zero, it must be saved at the start of the loop and be restored before the end since the S_SUBVECTOR_LOOP_* instructions determine which pass they’re in by looking at which half of EXEC is zero."
from
RDNA_Shader_ISA.pdf (amd.com)
It looks like a fossil of GCN, where only VCC and EXEC hardware registers are 64-bit (both required to support wave64). Perhaps RDNA 3 will entirely abandon wave64. One motivation could be that 8 SIMDs per WGP need filling, so LDS locality for attribute interpolation and L0$ locality for texturing (burstiness) is less of a win when there's so many SIMDs to keep occupied.
(Amusingly, ISA also says this about S_SUBVECTOR_LOOP_BEGIN and S_SUBVECTOR_LOOP_END: "This opcode has well-defined semantics in wave32 mode but the author of this document is not aware of any practical wave32 programming scenario where it would make sense to use this opcode.")
Also, "SGPRs are no longer allocated: every wave gets a fixed number of SGPRs" - means that per-SIMD hardware-thread ID directly indexes SGPRs, which would also mesh easily with there being only one hardware thread size of 32.
Wave64 and the necessity of optional sub-vector looping smells like a hack... Sub-vector looping helps with VGPR allocation, as it happens. But AMD also increased the size of the register files in RDNA...