So
COMPUTE UNIT SORTING FOR REDUCED DIVERGENCE - Advanced Micro Devices, Inc. (freepatentsonline.com)
is extremely underwhelming. All the tricky questions:
- how are thresholds for profit in performing reorganisation assessed
- degradation due to count of branch targets
- quantity of work item state that is reorganised
- nested divergent control flow (e.g. count of branch targets keeps increasing and even worse if it does so at a high rate relative to instructions executed)
- impact on execution (stalls) caused by having to wait for enough hardware threads to "settle" as a prerequisite for sorting and reorganisation
were completely ignored. Actual sorting techniques were not provided, the capacities of the hardware were not acknowledged as determining factors for sorting and no hint of mitigations for the difficulties of moving data around a SIMD or CU or WGP were provided.
RDNA does have cross-lane data reads which could be a major component of "intra-wavefront" (intra-hardware thread) reorganisation. The bandwidth is low though, so there's a high cost to moving VGPRs on demand and then returning them after work items reconverge to their original scheduling. e.g. there might be 5 VGPRs required for the different sections of code target by control flow, but each work item has 20 VGPRs allocated in total.
The description of the "inter-wavefront" technique avoids tackling the question of whether the two hardware threads in question are running on the same SIMD. The closest the document comes to acknowledging this question is "Different wavefronts of a single workgroup do not execute in the simultaneous SIMD manner described herein, although such wavefronts can execute concurrently on different SIMD units
138 of a single compute unit
132."
So in all the document barely acknowledges the logistical problems of the algorithms it presents. I'm not convinced there's anything novel (patentable) in the algorithms presented.
It's nice that the document referred expressly to ray tracing, and shows the problems associated with "uber ray shading" that comes when "closest hit", "miss" and "any hit" shaders are composed into a single uber shader (on top of the actual BVH traversal shader).
The general case of divergent control flow while shading ray results is unavoidable, so anything that mitigates SIMD-wastage is welcome.
Multiple execution items per work item is presented.
We're already familiar with this, in a sense. When a 64-item workgroup runs on an RDNA (2) SIMD, the compiler generates "hi" and "lo" 32-work-item halves, which are either tackled by alternating each instruction for each half, or scheduling one half to run to completion followed by the other.
From the perspective of a hardware thread this is sort of as if each work item is mapped to two. For instance in a pixel shader that is generally ran as a 64 work item workgroup, pixels 0 and 32 can be thought of as two execution items, both sharing work item 0 in the hardware thread. The register file can be viewed the same way: r0 in execution item 0 is matched by r32 in execution item 32.
The document presents this as a way to run ray tracing: 2 or more rays (each being an execution item) share a work item. Then intra-wavefront reorganisation is used to "move" rays so that, after sorting, divergence is minimised across all of the execution items.
The document also presents the idea of using criteria other than the branch target as input to the sorting algorithm (though it doesn't actually put this into the claims), e.g. the direction of rays. Say rays are split into one of two hemispheres of a sphere, then the rays can be sorted into two groups, with the expectation of increased coherence as they traverse (or at least for their next intersection test).
Throughout there's a de-emphasis of hardware functionality, instead descriptions focus on shader code inserted by the compiler, to perform the sorting. We can liken this to how vertex attribute interpolation is inserted by the compiler into pixel shaders. So the result of running such code (sorting by arbitrary conditions in order to "minimise divergence") could, ideally, be fed into hardware. Or it might just be turned into a stream of cross-lane or LDS-mediated data moves.
For example, in the same way that an execution mask is a 32- or 64-wide register (per hardware thread) that SIMDs refer to when running instructions, a lane-mapping mask would instruct the SIMD and the operand collector how to run reorganised work items. There is no attempt to elucidate this, as the document is merely about sorting work items for reduced divergence and reorganising them within the workgroup (as if by magic).
I don't see why it implies a major reworking. In RDNA each SIMD is already independent with its own instruction scheduler, wavefront controller, scalar ALU and SFU.
Currently we are left to infer that a workgroup of more than 32 work-items will run concurrently (though not necessarily in lock-step) across both SIMDs in an RDNA (2) CU, unless it's a pixel shader (which is usually 64 work items assigned to a single SIMD).
What appears as a logically single LDS is effectively two LDSs, each with its own crossbar and queue, in order to serve its parent CU.
So LDS and texturing (addressing, fetching, filtering) are notionally controlled by CU-level scheduling hardware.
Only if workgroup processing mode is activated will the hardware threads have the option to occupy all four SIMDs and gain access to all physical addresses in LDS (e.g. a single 256-work item workgroup uses all 128KB of LDS). In this scenario the crossbars and queues are merged into a single functional unit in the worst case, to deal with atomics (e.g. read after increment).
So now the CUs can no longer own their use of LDS, the WGP has to take on that responsibility. So some logic is required to support dual-responsibility for LDS usage.
Without CUs, WGP has full control in theory. Can that be simplified so that LDS is always singular? Or does LDS remain as two arrays? With 2 arrays in CU mode, a WGP has "doubled" bandwidth, because the 2 arrays are truly independent. In WGP mode is that doubled bandwidth still available?
In a CU-less WGP would that doubled bandwidth be available? Or will it still be 32-lanes at-a-time constrained?
Without CUs, what's the allocation policy for hardware threads across SIMDs inside the WGP, when those hardware threads are from the same workgroup? Always in (0,1) (2,3) pairs? Or greedy, finding the SIMD groups with the least work?
What if there's 8 SIMDs? Do you really want to use anything other than greedy allocation?
What's the effect of allocation policy on LDS bandwidth and latency?
Blimey I've spent over 6 hours on this today and it isn't even bedtime. Result!