The old "CUDA has access to command queues which should be exposed as compute queues in DX12 rather than doing everything on the GPC" complaint appears to remain valid though. I've not seen any indicator that they've fixed this yet.
Fair enough. Hopefully that capability extends to DX12 as well.
Nobody said that it has to be the most efficient or fastest implementation in existence. Similarly, nobody said that enabling async compute has to be faster than not enabling it: if a particular implementation is such that it can't find inefficiencies to exploit, then so be it.
Yeah there seems to be a general sentiment that async is some sort of silver bullet that provides benefits in every situation. It's just another tool that should only be applied when needed. It's very likely that optimizations done with GCN in mind will not work as well for nVidia's hardware and vice versa.