All this does raise an interesting question — I don't think we have a very good idea how these things are managed. Getting though multiple layers of marketing BS to the actual technical details can be incredibly tricky. At least my understanding of what Apple does is that they virtualize everything, even register access. So registers are fundamentally backed by system memory and can be cached closer to the SIMDs to improve performance (here is the relevant patent:
CACHE CONTROL TO PRESERVE REGISTER DATA). This seems to suggest that you can in principle launch as many waves as you want, just that in some instances the performance will suck since you'll run out of cache. I have no idea how they manage that — there could be a watchdog that monitors the cache utilization and occupancy suspends/launches new waves to optimize things, or it could be a more primitive load-balancing system that operates at the driver level. It is also not clear at all what AMD does. It doesn't seem like their system is fully automatic.