An RDNA SIMD32 register file has twice the register file capacity (in terms of KB) of a GCN SIMD16 register file.My confusion: VGPRs are not twice as much now, in a way register pressure problems are magically gone now, no?
Since RDNA's register IDs correspond to 32 work-items, a register is individually half the length of the 64-wide register of GCN.
The register file has 4x as many individually addressable registers, although they are half-size. If a single CU in RDNA were asked to support the same number of work-items as GCN (2x Wave32 wavefronts or 1 Wave64 per GCN wavefront without certain optimizations), it would have the same register capacity per work-item.
I'm trying to find the slide or reference that characterized RDNA has slightly improving register pressure, since I didn't see it being considered "gone".
The finer granularity of Wave32 might allow for shaders that are particularly poor at utilizing 64 threads from having to allocate a full 64-wide wavefront context.
Wave64 has a sub-vector execution loop that takes advantage of how Wave64 works with 2x Wave32 instructions and splits the execution into two Wave32 halves and treats each half as a single iteration of an internal loop.
Registers used for results internal to that loop can be assigned to the same Wave32 register ID, as the software knows the intermediate results of the halves are separated in time--saving some capacity if that mode is used.
Perhaps someone has parsed the ISA doc better than I have, but I didn't see reference to the LDS allocation values in the wavefront context being extended to give them the ability to allocate more LDS.Sort of killer feature would be the option to double accessible LDS for certain workgroups, while running others that do not use any LDS on on the other half of one WGP. If technically possible, would be worth some work to make it happen!
The issue latency is 1/4 of GCN. For scalar forwarding, there appears to be a 2-cycle latency, which is better though not relevant as far as scalar register file footprint goes. It's scalar, so no savings in register width. It's also RDNA, which has shifted to a static 128 registers per wavefront, which in capacity terms is worse than GCN though it's rendered moot by the architecture having enough register file space to hard-wire the allocation.It is not magic, it is an effect of the shorter execution latency. Each instruction result has a reservation in the register file. Since latency in RDNA is one quarter of GCN, register file entries used for temporary results effectively have one quarter the footprint (measured in bytes times cycles).
The vector result forwarding latency is worse than GCN, with RDNA needing 5 clock cycles before a dependent instruction can issue versus GCN's 4.
Depending on what the limiting factor is for getting from the point of a temp register being written and its being consumed, there would be cases where a temp generated ahead of a serial chain would live longer with RDNA than GCN.
Wave64 can provide register savings, within the limits of sub-vector mode. If running in Wave32 on a workload that can readily use up 64 work-items, needing 2x the wavefronts to get the same number of work-items leaves overall occupancy similar.