Except for the cases described in table 4.2, when no non-dependent instructions are available.
Most of these scenarios would not work with a WAITCNT as it works with an increment/decrement when an instruction of a given type is issued/completed, and the source instructions of these side-effects or crossovers from the SIMD to scalar domain would have completed as far as data dependence goes.
NOPs (or non-dependent instructions) avoid a thread switch. Issuance of an S_WAIT VALU_CNT would imply a thread switch (once non-dependent instructions run out). The latter sounds like it should be "costly" yet the same argument could apply to S_WAIT LGKM_CNT when fetching data from LDS, since those should generally be very low latency (similar magnitude to the values shown in table 4.2, at least for a single LDS instruction).
NOPS or independent instructions in the 4.2 table probably prevent undefined behavior. The ISA document is stating that the hardware won't catch these dependences, so it's going to blindly issue even if the end results make no sense. The wait counters wouldn't catch these because the originating instructions have completed.
The LDS is a common resource between wavefronts, and each operation can vary in duration based on bank conflicts. One common thread for these wait states is that they are used in cases where the CU's control logic has deferred some authority to outside scheduling and arbitration, be it the LDS, GDS, the graphics pipeline at the other end of the export bus, the scalar cache controller, and vector memory pipeline.
The SIMDs seem to be simpler and have little self-management capability.
There are several awkward points about SI that I'm curious about. One is that S_WAITCNT is severely constrained if you want to use scalar memory reads. These, of all the operations, can complete out of order. It doesn't sound like the architecture will stop you from using a different wait value, but who knows what it'll do.
Another corner case are VALU instructions with an LDS operand. How are these being handled, and what implications are there for sourcing from something that is not fixed latency?
Is this one reason why there isn't a wait count available for VALU instructions?
And that still leaves the mystery of why this counter is named/sized but nothing is said about its use. (Hardware bug for this particular scenario?)
It's effectively constrained to 1 or 0 for now. The way the architecture acts, an S_WAITCNT for VALU_CNT would only be allowed a value of 0.
This might be a case of something going on behind the scenes that they can't fully hide; it's in the "wiggle room" category when they came to a crossroads in the design and didn't finalize until later, or it's "room to grow".
Another place that AMD may have been vacillating is in the HW_ID table. The SIMD and CU identifier bits are separated by two reserved bits.
Does this mean they reserve the possibility for a combination of increasing SIMDs per CU or CUs per array?
One thing that could come up in the future is if the SIMDs become more capable of self-management, which might necessitate adding a wait count.
Vector multi-issue or the introduction of 16-wide instructions for closer alignment with future CPU extensions might make the count more important.