On the chips I'm talking about (i.e. almost all SoCs these days up to ~50W or so), you are always running at max TDP when running a game. i.e. you are power limited, not hardware/area limited per se. Any power that you free up will be immediately applied to boosting frequencies. Thus it's not as if idle cycles are entirely lost as the conventional wisdom would have it, the power from them is re-purposed to run the non-idle units faster. Obviously this is a high level description and there are various levels of efficiency in this sort of process, but the general point stands that you are optimizing for *power efficiency* on these chips to get higher performance, not strictly trying to fill every idle piece of hardware with work, as the chip cannot actually sustain that at max frequencies. And the frequency range for the GPU depending on available TDP is veeeery large
A lot of that is unavoidable due to the fact that the consoles are heavily biased towards decent GPUs but pretty weak CPUs, especially in terms of throughput-intensive tasks. On PCs, the "best" and lowest latency place to do "async compute" is often the CPU (even with a discrete card). The exception is if you need texture filtering or similar obviously, which would likely apply in a lot of the GI cases.
But yeah, power questions aside, I would argue that GCN needs async compute for efficiency more than other architectures, which isn't really a contentious statement (compare the monstrous theoretical throughput numbers on GCN to its performance in practice vs. other architectures). That's not a bad or good thing, it's just one design point and my point is that conclusions drawn about how to get the best performance out of it do not necessarily apply to the same extent to other architectures.