You can easily turn this argument around. There is also some inherent (hardware) "overhead" for creating an architecture providing a high performance with fewer threads.
I think there is room for a lot of nuance here.
There is a debate from a more workload or algorithmic form of overhead, and then there's hardware overhead. The hardware is to a significant extent engineered as much as possible to be a "so?" kind of overhead where it is only permitted so many nanoseconds or so many mm2 or so much energy out of a budget that scales somewhat fitfully.
The "overhead" for high performance in one thread is a known area of seriously diminishing returns.
However, the debate between "fewer" and "more" threads, since these GPUs through much of their processing are very SMT is much less clear cut in the middle with very clear downsides if you wander into "too many" since caches/interconnects/memory controllers/control can thrash or congest in ways where serialization turns out to be preferable, and diminishing returns on the parallel resources' contribution to hardware footprint (if the front end is highly parallel, then there is generally a back end broad enough to support it).
The SMT analogy taken from a CPU context breaks down because so much of this is so vastly parallel in the back end while the front end pipelines so deeply that internally it turns into a question of how well two different resource types are load-balanced and how much of the overall scalar component each one contributes if trying to apply Amdahl's law to parallel systems that are aside from specific points are very close in overall parallelism.
So you're saying async is like a fix to already existing problems that Nvidia doesn't have. The 'bubbles'. But if that's the case why would Nvidia gpu's perform worse with async on than it does when it's off.
Some of the tweets and back and forth over AOTS seem to indicate that there is an additional device check besides the async flag in the in that is effectively disabling it for Nvidia anyway.
The game is also rather variable, although the minor loss seems to be mostly consistent for whatever reason.
At least some of the 1080 results actually made it a wash or vacillated with tiny losses or gains, and one answer about this indicated that Pascal actually might have architectural quirks that benefit from some of the changes in the game's behavior with async on despite the device check.
However, the messaging on this has been inconsistent.
If there were a demerit for AOTS as a DX12 benchmark, or as an experimental tool in my opinion, it's this sort of potential non-orthogonality and inability to really control for factors the knobs are labeled for. Some of the confused discussion about it also makes me uncertain how the different paths structured, and if they are comparable. Having a flag for async that can be overridden by the software is one thing, but that it might be overridden imperfectly makes it seem like the innards are a bit "leaky" for drawing conclusions--particularly if we find out that different vendors/chips behave unexpectedly in different ways (and they have).