Maybe someone with more knowledge about Nvidia hardware should answer your question. I am just pointing out potential reasons based on the information available.Does that means the async efficiency is related to the MIMD "width" of the whole device? It could be one of reasons Nvidia chopped down the multiprocessors in Pascal, aside for relieving the register pressure.
If you assume that a single Pascal SM cannot run mixed graphics + compute then splitting the MPs should improve the granularity. Compute and graphics might also share some higher level (more global) resources as well. Nvidia has quite sophisticated load balancing in their geometry processing. Distributed geometry data needs to be stored somewhere (SM L1 at least is partially pinned for graphics work, see this presentation: http://on-demand.gputechconf.com/gtc/2016/video/S6138.html). Also, Nvidia doesn't have separate ROP caches (AMD still does). Some portion of their L2 needs to serve ROPs when rendering graphics. This might be transparent (just another client of the cache) or might be statically pinned based on the GPU state. I don't know