Ok, I got you. Why limit the conversation to just pixel shaders though? AMD's description of 64-item "wavefronts" appears to apply to compute workloads as well.
From
AMD PowerPoint- White Template (gpuopen.com) page 18:
Compiler makes the decision
- Compute and vertex shaders usually as Wave32, pixel shaders usually as Wave64
- Heuristics will continue to be tuned for the foreseeable future
Implies there's at the very least a strong bias towards wave64 being solely for pixel shading.
So the gotcha (for my argument that the hardware really only has 32 work item hardware threads) is the idea that compute shaders can be issued as 64 work item hardware threads. And, indeed, that 128 work item workgroups could be issued as two 64 work item hardware threads, instead of four 32 work item hardware threads.
I can't think of a time when "Allows higher occupancy (# threads per lane)" would apply to compute and improve performance. This is the only stated benefit (in the list of two benefits) that applies to compute under the Wave64 column of the table.
In compute, work items sharing a SIMD lane is not part of the programming model. The closest you can get is with chip-specific data parallel processing (DPP) instructions that work on sub-sets of 8 or 16 work items. And that won't share data across the boundary between work items 0:31 and 32:63.
So the only scenario where work items sharing a lane is effectively exposed is pixel shading attribute interpolation. Maybe someone can think of something else?
So how would compute get higher occupancy with Wave64 versus Wave32 and be faster (i.e. worth doing)? Is there a mix of workgroup size combined with VGPR and LDS allocations that does this?
Locality doesn't seem to be a major motivating factor to keep it on the same SIMD as you get much of the same benefits as long as you're on the same CU. AMDs whitepaper only has this to say on the matter: "While the RDNA architecture is optimized for wave32, the existing wave64 mode can be more effective for some applications.". They don't mention which applications benefit from wave64.
Yes, that's why
AMD PowerPoint- White Template (gpuopen.com) seems to be more precise (yet remains vague, citing "heuristics"). In truth in PC gaming we can never really exclude heuristics, because drivers create too much distance to the metal.
Locality is explicitly relevant for attribute interpolation (pixels that share a triangle can share parts of the LDS data). And locality affects texture filtering. So both of these relate specifically to pixel shading.
Is it important that it's one hardware thread vs two? I guess my only point is that from a software perspective it doesn't matter.
Developers probably can't access the wave32/64 decision, so it "doesn't matter to them". Well, you could argue that the more dedicated might complain to AMD that the driver stinks for their game and AMD makes a decision in the driver for them.
In trying to understand the hardware, and why "CUs" are still a part of RDNA, wrapped inside a WGP, the count of hardware thread sizes might be informative.
CUs might be present simply to soften the complexity of getting RDNA working. Drivers were very troublesome for quite a while after 5700XT released, despite the helping hand of the CU mode. And perhaps would have been worse if there was only a single TMU per WGP, with LDS being a single array, not the two that we currently have.
CU mode combined with wave64 mode looks intentional as the softest complexity: the backstop when the driver team is struggling to adapt to a new architecture. G-buffer fill might be the perfect use-case, but that's export bandwidth bound these days, isn't it?
If RDNA 3 has no CUs does it still need wave64 mode? Is wave64 mode more important in that case?