Huh, curious that this approach ended up being efficient at all. I mean that this implies that the M3 is backing the register file with a cache that has multiple associativity rather than a simple adder+mux. There's definitely been a trade-off there between better occupation and increased complexity in the by far hottest path, considering that you usually need multiple registers per instruction and thread. I can not see that paying off, in terms of required transistors per cache line.
Is an associative cache that much more expensive compared to a traditional banked register file? It is my understanding that they still have operand caches for recently used registers. The rest seems functionally similar to the usual operand collectors, and is likely done at the pre-scheduling stage. There are enough waves in flight to hide cache access, and each instruction can be delayed until the data has been collected. Cache misses can probably be handled efficiently as well. Stall the instructions until you fall below a certain occupancy threshold, then invoke an interrupt that will shuffle the data around. Sure, it won’t be fast, but still likely much better than a round trip to the thread scheduler.