One plausible case for 16-lane SIMD IMO is that:
(1) The native wavefront size remains 32 lanes, including the VRFs, operand collectors and the memory pipelines (which seem to happily remain 128-byte cache lines);
(2) Two 16-lane SIMDs as the VALU pipelines. Native 32-wide wavefronts are executed on them with double issuing and transparent half skipping like Wave64.
(3) Compiler scheduled co-issuing (VOPD) is introduced to offset the potential ILP lost caused by double issuing.
I know you said "two 16-lane SIMDs" (and in that case instruction throughput would be halved), but I think it would be four per CU. This is based upon my theory that VOPD for 32-bit resultants is not happening. The risk of not being able to use VOPD is too high due to serial dependencies (eating-up more hardware threads), whereas for 16-bit resultants the cost of serial dependencies preventing use of VOPD is relatively low, because relatively little code has 16-bit resultants.
(FWIW, I think deleting the RBEs in favour of shader code is a great use of VOPD for 16-bit resultants. Most render targets are 8- or 16-bit per component).
Incidentally, I think four SIMD16 per CU, with two CUs per WGP means that two GCDs each of 48 WGPs are required for Navi 31/32.
Probably no change to instruction latencies, except for cross-lane modifiers/instructions that will have to deal with partial bypass hazard (note: `S_DELAY_ALU`?).
I thought I'd collect some new-due-to-GFX11 hazards:
reviews.llvm.org
Code:
bool hasVALUPartialForwardingHazard() const {
return getGeneration() >= GFX11;
}
bool hasVALUTransUseHazard() const { return getGeneration() >= GFX11; }
neither of those are being used yet.
Meanwhile these were added in a commit:
Code:
let SubtargetPredicate = isGFX11Plus in {
def S_WAIT_EVENT : SOPP_Pseudo<"s_wait_event", (ins s16imm:$simm16),
"$simm16">;
def S_DELAY_ALU : SOPP_Pseudo<"s_delay_alu", (ins DELAY_FLAG:$simm16),
"$simm16">;
} // End SubtargetPredicate = isGFX11Plus
With DELAY_FLAG (GFX11) being printed like so:
reviews.llvm.org
Code:
static const std::array<const char *, 12> InstIds = {
"NO_DEP", "VALU_DEP_1", "VALU_DEP_2",
"VALU_DEP_3", "VALU_DEP_4", "TRANS32_DEP_1",
"TRANS32_DEP_2", "TRANS32_DEP_3", "FMA_ACCUM_CYCLE_1",
"SALU_CYCLE_1", "SALU_CYCLE_2", "SALU_CYCLE_3"};
[...]
static const std::array<const char *, 6> InstSkips = {
"SAME", "NEXT", "SKIP_1", "SKIP_2", "SKIP_3", "SKIP_4"};
Is that telling us that there's four VALUs? Similarly, is that telling us that there's three transcendental ALUs? In my opinion, no for both of these. The compiler doesn't have a model of the relative number of ALUs (SALU, VALU, TRANS) because the compiler has a hardware-thread centric view, not a CU- or WGP-centric view.
Are the cycle counts for SALU telling us that SALU has a 3-cycle intrinsic loop?
I don't think S_DELAY_ALU is for intra-VALU timing hazards, instead it appears to relate to relationships amongst:
- SALU and VALU or
- SALU and TRANS or
- VALU and TRANS
presumably for VCC validity or resultant availability. But, looking at:
we can see there's two categories of "delay" here, InstIds (which is for either a first instruction or a second instruction) and InstSkips which are just immediate skip counts. So the former case could, perhaps, be applied to intra-VALU timing hazards? Or is that for instructions with 16-bit resultants, used to work out when a "clause of VOPD" instructions can commence?
(4) VALU issue rate: up to 2 VALU instructions (from two different waves) or 1 VOPD bundle per cycle; the 2 inst/cycle rate can only be sustained with either VOPD or having >=1 half-disabled Wave32 in flight.
I strongly believe that the hardware cannot dynamically combine hardware threads for simultaneous issue in one cycle. The entire history of AMD GPUs shows no tendency towards instruction-issue that is outside of the compiler's direct control. If the compiler cannot see hardware threads, then the hardware cannot do simultaneous issue within a SIMD-type (either VALU or TRANS).
It seems to me that this S_DELAY_ALU instruction is specifically providing the compiler with the ability to control execution for a single hardware thread.
This way they will not regress on achievable ILP on paper (still 1.0). Meanwhile, it can potentially execute some divergent kernels better, provided that these kernels will either naturally converge to one half of the wavefront, or will explicitly reorder the lanes at sync points e.g. a wave ballot+shuffle.
I agree that SIMD16 is useful for half-hardware thread predication and also I agree that "wave ballot+shuffle" is an extremely attractive concept. The latter has been at the back of my mind in pretty much all RDNA 3 speculation, specifically because ray tracing puts enormous pressure upon work-item divergence.
This also doesn't pose much questions on the VRF and memory pipelines, since they can remain 32-lane wide. Though a possible crack in this theory would be whether/how the VRF can sustain 2x half-disabled wave32s being co-issued at full rate for long VALU runs.
I think that's answered simply by VOPD being exclusively for 16-bit resultants
Anyway, overall, I think SIMD16 has a compelling feel to it, right now.