I'm not disputing HWS's lack dispatch, but that the hardware performing that function is interchangeable. Four blocks as I understand, each existing as 1 HWS, 2 ACEs, or something else entirely.
I don't believe either unit has any involvement with instruction caches beyond a single pointer to the kernel for reference. Instruction fetching left to the CUs or other hardware block.
These blocks are very simple as I understand. Handful or registers and some basic math capabilities. Simple microcontrollers that may be working with each other. No direct wiring to CUs or anything. The CUs on the other hand will need some direction on how to communicate with the units as a firmware update could relocate registers. My thinking is when a CU needs work it pings an ACE, HWS evaluates metrics from that CU and in progress kernels on the ACE, then the ACE dispatches a wave based on that result. Accounting for priority, age, dependencies, etc. Could be completely wrong on that, but it seems likely they are working together. HWS being added late to extend some capability that didn't exist in early GCN iterations. For that reason it shouldn't be critical, but have added better async handling. GCN1 was simple round robin across queues as I understand. Some of the console guys may have a better grasp on that original behavior as it should be the XB1/PS4 method.