I'm curious what form the documentation for the GPU architecture will take now that the goal is to be approachable as a compute architecture as well. Intel's GPU documentation has been available in very dense form in the past, but it didn't seem to catch on as a topic of discussion to the extent that the other GPUs have.
edit: This is probably wandering too far afield from Xe, which I have much of the responsibility for.
To continue:
NVidia's original scoreboarding was costly partly because it was handling varying operand fetch latencies, not just instruction-dependency chains (register read-after-write).
Operand fetch latencies were complicated because the register file was very loosely organised (or rather, register allocations were not simply mapped, the banks versus register allocation problem, "vertical and horizontal allocations"). This meant that register fetch latencies varied. The effect was that instruction-level scoreboarding had to track many more datapoints than we see with contemporary GPUs, making it an operand scoreboarder.
Another factor that was mentioned in passing after the transition to the explicit encoding was that Fermi's architecture had reg/mem operations, where an instruction's source operands could be from a register or memory location, meaning the more intensive tracking methods for the latter may have bled into the tracking for the former. Later ISAs took on a load-store model that would have made this unnecessary.
Even with the explicit wait cycles, I was under the impression that variable register fetch latency could still occur with bank conflicts. I thought the explicit delay tracking helped with dispensing with tracking the execution status of instructions whose outputs would be sourced by later operations. It would be a simpler set of status flags for register IDs, with a short lifetime. The hardware pipeline would also dispense with stages devoted to the run-time analysis of having to check multiple reads with multiple sources, if the explicit tracking bits can direct references to a specific location.
GFX10 was flagged by LLVM as having a banked vector register file, but I haven't seen it mentioned as such by AMD's presentations.
"Interlock" in RDNA sounds simply like a new case of waitcnt.
An interlock would be capable of automatically detecting a specific dependence. AMD's method allows for a much broader time window and number of outstanding operations, particularly in light of how many wavefronts would have their individual tracking, but at the cost of precision. A non-zero counter indicates something hasn't resolved yet, but which one of potentially dozens of operations that is wouldn't be known.
Examples where the count must be zero would likely be classified as an architecturally visible point where there is a critical lack of interlocking, such as how often scalar memory operations are recommended to be used with a waitcnt of 0.
Some of the hazards/bugs from the following show microarchitectural gaps that RDNA still has when it comes to detecting dependences, frequently at similar points that the GCN ISA listed as needing NOPs or wait states (EXEC dependences, scalar/vector sourcing, etc.):
https://gitlab.freedesktop.org/mesa...18f4a3c8abc86814143bf/src/amd/compiler/README
An interesting reference is to the instruction s_waitcnt_depctr, which is not referenced in the ISA doc.
It's indicated as a fix for an unhandled WAR hazard on the EXEC mask, something a more cohesive pipeline would have physical checks for.
This may be a check on a more global internal counter on the number of instructions that have left the instruction buffer and whose operand fetches are outstanding. Using it would be relying on it as a sort of catch-all for missing interlocks or hazard checks..
Coincidentally, I ran across that band-aid showing up on AMD's GPGU site discussing the porting of the PS4 game Detroit to the PC on a 5700XT:
https://gpuopen.com/learn/porting-detroit-2/.
It also shows the compiler using s_inst_prefetch, which was documented as being capable of hanging shaders and wasn't even documented in the original release of the RDNA ISA doc. It apparently exists officially at least since the Feb 2020 version.
When the RDNA shader compiler emits a clause, the implication is that a stall caused by a register read-after-write must not be filled by switching thread. If a clause was not used, then the GPU can switch thread to fill the stall.
Clause terminology showed up in Vega, briefly, in relation to scalar memory ops. The other AMD reference would be its VLIW scheduling days.
It seems like the weaker implicit version in Vega (now called groups?) shows a major motivation for bringing back an optional form of the VLIW method is to get better behavior from the memory pipeline. It may also feed into the introduction of the ordered memory mode as well. I'm less clear on the benefit of the VALU clauses, but perhaps there is some wavefront bring-up that is ALU heavy and is worth optimizing for at the expense of SIMD utilization.