Instead you could have a CU wide load buffer, that keeps the loaded data until it is copied to a VGPR just before the first use. Data would be copied from this buffer to the target VGPR when the s_waitcnt is ready. This would allow the compiler to use this VGPR for other purposes instead of serving as a dummy storage slot. This would practically increase the usable register count, as average register life time would be much shorter. There would be no need to extend register life time to hide memory latency. The separate buffer (CU wide) would keep incoming loads. This would actually allow more latency hiding, as the compiler could be more aggressive in moving loads away from use. This kind of load buffer wouldn't need to be as fast as registers as data load (and s_waitcnt before reading it) is a much less frequent action than addressing a register.
This would change where VMCNT is decremented, since it would have to be tracked in the memory pipeline, which may add complexity I'll delve into later.
I think there's already some amount of buffering just get the data from a 64-wide scatter/gather coalesced and moved into the register file.
How deep do you think this CU-wide buffer would need to be?
I suppose the worst-case within one wavefront is firing off 15 vector memory ops to max VMCNT, and then setting a waitcnt of 0. The ISA doc does say that execution can continue if the count equals VMCNT or lower, but on the other hand issuing the 16th operation would exceed the currently documented representation for that value.
Potentially another reason for trying to move data into the register file sooner is reduce the burden on the vector memory pipeline. If the waitcnt value hitting 0 became the point that the program could continue, the vector memory path might be on the hook for up to 40*14*64*4 bytes of data before one more returning load could get one of the wavefronts to VMCNT=0 and data could start moving.
Anything less, and there might be deadlock where no wavefront can buffer enough loads to satisfy its waitcnt, or lower if something were done like N loads, *foo, workgroup_barrier, wait_cnt=0. (ed: if back-pressure throttles load issue)
Moving data into the register file as soon as one row can be filled can allow for the memory pipeline to vary in its capability, perhaps down to 1 register's worth of data at a time without worrying about forward progress.
In terms of sequencing, a vector load is potentially updating information in the vector, scalar, and memory domains. With the current method, there's probably a queue of some kind already that allows the SIMD to start loading data concurrently into the file and then decrementing VMCNT. If s_waitcnt became the point where this starts happening, it might for one thing complicate the process because now there's an implicit waitcnt for s_waitcnt, since the count would reach 0 based on what happened in the memory pipeline and now a SIMD needs to be detect his and then arbitrate access to load/forward the data from another domain. The pipeline logic that works for forwarding with the hard-wired latency of the register file will not have lead time for when VMCNT decrements, which might mean an additional stall.
In other scenarios, there might be a new class of manual NOP states for s_waitcnt, since it would be initiating the update of the register file. Unlike other wait states on vector registers, it would be a more universal need for vector NOPs in front of a waitcnt.