The whitepaper says: "Each compute unit has access to double the LDS capacity and bandwidth".
The whitepaper is quite explicit about CU mode:
"The RDNA architecture has two modes of operation for the LDS, compute-unit and workgroupprocessor mode, which are controlled by the compiler. The former is designed to match the behavior of the GCN architecture and
statically divides the LDS capacity into equal portions between the two pairs of SIMDs. By matching the capacity of the GCN architecture, this mode ensures that existing shaders will run efficiently. However, the work-group processor mode allows using larger allocations of the LDS to boost performance for a single work-group."
It does not seem like CU mode or not should matter in the peak throughput.
Agreed, the peak is the the same. The peak count of DWORDs read or written per LDS bank is constant whether in CU mode or LDS mode.
After all, in WGP mode, you are meant to be able to address all of the LDS, which means both arrays collectively have to be capable to deliver 2x 128B/cycle to either of the SIMD pairs anyway.
An atomic instruction like BUFFER_ATOMIC_SWAP means the peak is capped. In WGP mode all 64 banks of LDS are addressable but an atomic that returns data, by definition, can only touch a single bank per cycle. This is the scenario I was referring to originally when I was talking about a CU-less design.
If anything that could hamper bandwidth, it is perhaps the documented fact that a SIMD pair shares the request & return buses, in which case effective bandwidth can be indeed halved in unideal scenarios where all active work-items in a workgroup are biased towards either of the two SIMD pairs. But even then, it does not fit the "localized LDS = CU mode better" theory.
I was referring specifically to bandwidth and there are
some LDS operations where it appears that bandwidth is higher in CU mode because a CU can only touch one array in LDS.
I can't think of a mechanism in WGP mode that can guarantee that two BUFFER_ATOMIC_SWAP instructions can always run concurrently. If two SIMDs (e.g. 0 and 2) are both doing BUFFER_ATOMIC_SWAP but they are in distinct workgroups, then it should be possible for LDS to run BUFFER_ATOMIC_SWAP for both concurrently. Or, BUFFER_ATOMIC_SWAP for one workgroup and some other "less serial" LDS instruction
Yes, I'm looking for corner cases. I'm trying to identify the problems with a CU-less WGP.
I'm not following your thought process. Of course wavefronts within a workgroup can be allocated to different SIMDs within an SM/CU - this has been the case forever. What is it that we need to infer exactly?
And that's all I was inferring.
I was contrasting a workgroup of 64 pixels (work items) which are allocated to a single SIMD with a non-pixel-shader workgroup of 64 work items that will be allocated across both SIMDs. I was pointing out that the pixel shader allocation to SIMDs was unusual. And that LDS appears to these two SIMDs in a CU the same as in GCN (though shared by four SIMDs). A half of LDS (one array out of the two) is dedicated to a CU in CU mode.
TMUs, RAs and L0$ bandwidth are dedicated to a CU all of the time, whether in CU or WGP mode.
Though we also know that RDNA (2) has increased throughputs and sizes versus GCN at the CU level (e.g. L0$ cache lines are twice the size of L1$ in GCN).
I'm referring to your use of the term workgroup in the post that I quoted. You seem to be treating them as groups of threads that must be launched and retired together and that reside on the same SIMD - this is a wavefront not a workgroup. AMD explicitly says that 64-wide wavefronts run on the same 32-wide SIMD and just take multiple cycles to retire each instruction. There is no suggestion that multiple SIMDs will co-operate in the execution of such a wavefront.
What you've written is true for a pixel shader for 64 pixels. It's described as a wave64 and corresponds with a workgroup of size 64. The unusual aspect here is that it's scheduled on a single SIMD with 2 work items sharing a SIMD lane, whereas in general workgroups are spread across SIMDs width first.
But note a workgroup of 128 work items bound to an RDNA CU necessarily results in 2 work items sharing a SIMD lane. With increasing multiples for larger workgroups up to those of size 1024. So the pixel shader configured as wave64 appears to be a special case of workgroup. As a special case it's designed explicitly to gain from VGPR, LDS and TMU locality for pixels whose quad work item layout and use of quad-based derivatives is a special case of locality.
This then leads me to believe that "wave64" for pixel shading is a hack that ties together two hardware threads on a SIMD. Not only is there a clue in the gotcha that I described earlier, but one mode of operation (alternating halves per instruction) is just like two hardware threads that issue on alternating cycles - which is a feature of RDNA.
In the end pixel shaders are merely a subset of compute shaders, where some of the work item state (values in registers and constant buffers) is driven by rasterisation.
A workgroup per the OpenCL spec is the equivalent of a CUDA block and consists of multiple wavefronts that can be distributed across the SIMDs in an SM/WGP. So it's a bit confusing as to what you're referring to when you describe groups of threads. Do you mean wavefronts (bound to a SIMD) or workgroups (bound to a WGP)?
You can see the problem AMD has introduced with RDNA: two different wavefront sizes. Are they both hardware threads? Or is wave64 an emulated hardware thread, simulated by running two hardware threads?
Running a wave64 on a single SIMD appears to be enough to define it as a hardware thread. I believe that's where you're coming from. But a SIMD can support multiple hardware threads from the same workgroup, too. So that definition looks shaky.
Also, you can say (and I believe this is really your point) that running all work items from a workgroup on a SIMD means it's a hardware thread. So wave64 is a hardware thread. Whether it is or isn't a hardware thread, it's still a workgroup.
I'm sure that leaves you unhappy with what I've said. Ah well. Maybe RDNA 3 won't have wave64. Three distinct execution models for pixel shaders is beginning to take the piss. Hoping that the compiler will choose the right one is how you drive customers away. (RDNA took a year+ for drivers to settle at adequate performance?...)
What's interesting about LDS usage by a pixel shader for attribute interpolation is that the bandwidth of LDS is multiplied by sharing data across 64 work items (e.g. a single triangle has 64 or more pixels). Whether wave64 is a hardware thread or not, this is an example of shared data and reduced wall clock latency for pixel shading.
Of course that falls apart when the triangles are small: you really want wave32 (32 work item workgroups) then.
All this came up because I'm trying to work out how RDNA 3 with no CUs inside a WGP configures TMUs, RAs and LDS. Bearing in mind that a "WGP" seems as if it would actually be a "compute unit", a compute unit generally has a single TMU and a single LDS. So my question was whether a CU with 8 SIMDs can have adequate performance with a single TMU and a single LDS.
Your stance is correct if the ratio between RA and WGP is the same, I think Jawed was supposing that the ratio between RA and ALU stays the same, which could also be. Or it may be that ratio between RA and WGP is indeed the same, but the RA capabilities are increased... There are too few details atm for having a definitive answer. I would find very strange, however, if AMD increased the base shading power almost threefold (which is not the limit of RDNA2) while keeping Ray Tracing hardware (which is the weakest point of RDNA2) with moderate increase.
Need to be careful to distinguish the two things that RAs do:
- box intersections (4 tests per ray query) - intrinsic performance here seems completely fine
- triangle intersections (1 test per ray query - I am not sure if this is a confirmed fact)
and separate them from the idea of "hardware acceleration accepts a ray and produces the triangle that was hit" (black box).
Because BVH traversal is a shader program, increased work items and/or rays in flight may actually be the key to improved performance. If AMD can do work item reorganisation for improved coherence during traversal, then that's another win. Indeed work item reorganisation may be the only way that AMD can meaningfully improve performance.
The alternative could be narrower SIMDs.
"RA as being like texturing" is a useful metaphor. Both operations depend upon global memory accesses and so their latency needs to be hidden.
I have not found a way to determine the precise extent of RDNA 2's bottlenecks. I've been working on it for a couple of weeks, but haven't got very far. Of course bottlenecks are super-slippery in real time graphics, so there's a limit to what can be gleaned.