I disagree. These CNT registers are "watched" by the top-level logic that controls thread arbitration.
I guess I trying to follow AMD's terminology choice in stating that completion is when the data an instruction is meant to write back is written back and the counter is decremented.
The issuing instruction (that increments CNT) always completes immediately. The text talks about "completion" in the sense of data being returned, not in terms of thread state. The thread will be switched-out upon S_WAITCNT.
Writeback would be the last stage prior to instruction completion, unless there's an exception detect stage for something like ECC failures.
I am somewhat unclear on whay you mean by "switched out". The ten hardware threads per SIMD don't switch out in the same sense as the phrase is used for standard CPU cores and thread context switches. The wavefront is just flagged as not being ready when the SIMD's issue cycle comes back around. The wavefront is still in the group of 10, with hopefully up to 9 others ready to issue in its stead.
But S_WAITCNT implies sleeping the thread until the counter hits the value specified, so I don't see how your final sentence would apply.
None of the counters other than VALU_CNT, if it were being used, would have bearing on the scenarios in table 4.2.
Even if they were used, these scenarios involve instructions that would be considered complete, it's just that a few side effects like flag changes don't propagate as quickly as result forwarding.
Presumably you're referring to bank conflicts during LDS access. LGKM_CNT is only decremented once the variable-latency caused by the bank conflict has fully passed.
I'm under the impression that LGKM_CNT is not global to the entire CU.
Rather, this is tracking the issued instructions for a given wavefront, and that wavefront can't account for what the other 39 wavefronts in the CU might have in flight for the LDS at the time.
The LDS-direct read is also not listed as being one of the things tracked by LGCKM_CNT, which may make S_WAITCNT for LDS and any possible LDS reads to a vector instruction as restrictive as it is for scalar cache reads.
I dare say a good way to think of these S_WAITCNT instructions is that they are protecting the VGPRs. That's all they do, ensuring that VGPRs are permanently coherent. With that model in mind, it becomes clear that table 4.2 consists entirely of non-VGPR operands for shader instructions and expresses dependency constraints that don't relate to VGPRs.
If VGPRs were all that mattered, S_WAITCNT wouldn't be so restrictive with scalar memory reads.
4.2 contains a bunch of corner cases where side-effects and hardware state updates fall outside the regular data forwarding paths. The rest of the cases rely on the 4-cycle gap between instruction issues to allow results to flow over.
I still can't think what VALU_CNT could be used for, though.
With the current ratios in the architecture, possibly very little.
I don't think it means anything. Bits 30/31 are also not mentioned. This is just stuff documented for debugging as far as I can tell.
It's two reserved bits right in the middle of the state encoding. The relationship of each field to the ones above and below is pretty well established.
If during the SI design stage they hadn't yet decided on the maximum number of SIMDs, they probably had to allocate bits in advance for a product that hadn't been finalized.
Hopefully AMD will work out how to get past a mere two shader engines and to resolve the coherency problems that multiple shader engines create. If NVidia can tackle that problem there's no reason why AMD can't.
There's "can't" and then there's "won't", however I am hopeful that something at that level is going to evolve more than we've seen over the last several generations.
Another GPU generation will have to justify its existence on 28nm, with Tahiti already gobbling up much of the TDP room before a successor has the chance to improve on it.
Adding a few more CUs might not cut it.