pretty much all three processors must work in unison to do asynchronicity.
Please don't use the term "asynchronous", when what you really mean is simultaneous/parallel execution.
Asynchronicity is not an attribute of the hardware. It's an attribute of the API.
It only means that the order of execution is not defined implicitly by the invocation pattern, but instead modeled explicitly by the use of signals and fences.
This can be either implemented using cooperative scheduling on fences, or a sufficient number of monitors in hardware when opting for simultaneous execution or low latency scheduling.
In the first case, the hardware does not need any support for that at all.
Even GCN "degrades" to cooperative scheduling if you exceed the number of monitored queues. Even though I have yet to see a legit real life example where that actually happened...
Well that wasn't what I was meaning I was actually meaning the opposite, they are not underutilized any more than 10%, if things are done right, which they seem to be and I haven't seen anything that would show that hardware is crap, drivers are crap, and software is crap.
Doing things "right" isn't easy. At least if you define "right" as achieving a constant utilization of all possible bottlenecks, while also keeping the working set below cache sizes and alike.
This has been said a couple of times in this and other threads. The current design of the render paths is still a straight forward evolution from the old fixed function setup, where you would treat the rendering process as a set of operations applied sequentially on the whole frame each. We are yet to see a wide spread move over to tile based renderers, and a departure from the use of overly expensive full screen space effects.
As it stands, you just can't achieve an even/constant load on all subsystems of the GPU.