I dont remember any games where AC can be toggled showing any benefit on Maxwell and Pascal.
I think Doom showed an improvement, IIRC.
Maxwell does support Async perfectly fine, it's just it can't dynamically load balance it's Compute and Graphics queue on the fly to benefit from it,
Do you know how this has improved with Volta / Turing?
Assumption is, with the new 'fine grained sheduling' it should be good now?
As far is i understand, Maxwell needs to 'divide' the SMs before the work starts, e.g. task A gets 20% of GPU, task B 80%. If one task finishes earlier, there is no way to utilize the SMs that became idle. (Not sure how accurate this is technically.)
GCN can dynamically balance with its ACEs, and it can even run multiple shaders on the same CU.
"You could use 10 compute queues if you wanted to, but that won't help increase performance as internet seems to be convinced this days, it will actually hurt performance even on GCN."
Actually, in Vulkan the number of queues is fixed, and for GCN i get one GFX+compute queue, and two pure compute queues, so i can only enqueue 3 different tasks concurrently. Also, only the gfx queue offers full compute performance, the other two seem halfed even if there is no gfx load, also given priorities from API seem to be ignored. This seems what AMD considered to be practical. Very early drivers had more queues, IIRC.
I did some testing with small workloads, which is where i see the most benefit because one task alone can not saturate. Results on GCN were close to ideal. (Forgot to test on Maxwell)
But there is one more of application of AC that is rarely mentioned because it happens automatically:
If you enqueue multiple tasks to just one queue, but they have no dependencies on each other (no barriers in between), then GCN can and does run those tasks async. Likely also on DX11.
This is very powerful because it avoids the disadvantages of using multiple queues, which are: Additional sync necessary accross queues, and the need to divide command lists to feed multiple queues. (Big overhead that can cut the benefit, which hurts especially on small tasks where we need it the most.)
I see some potential to improve low level APIs here. Again: We need ability to generate commands dynamically directly on GPU, including those damn barriers that are currently missing from DX Execute Indirect, VK Conditional Draw and even NV device generated command buffers.