Now that I think more about it…And do they definitely not support this for graphics shaders at all, or might that be redacted from the public ISA because they consider it too limited/complicated to be exposed publicly?
It might have to do with how registers are allocated for Wave64 as well. I recall reading that RDNA 3 seems to:
1. swizzles the VGPR banks used for the wave64 high lanes (0<->1, 2<->3), so e.g., V0_lo is on bank 0, but V0_hi is on bank 1.
2.
This scheme allows wave64 to feed both 32-lane VALU pipes with two 64-lane operands bank conflict free.
e.g., V0_lo, V0_hi, V1_lo, V1_hi can be read in a single cycle with the scheme.
Depending on the scheme's actual implementation, it could be at odds (?) with how s_alloc_vgpr works.
Though naively speaking, it could have worked still if the lo & high are stored as physically adjacent pairs.
Last edited: