Can you give us a few ideas of which fixed function elements of the GPU might be holding back the CU effectiveness? And whether or not they scale up in bigger GPU's or would pose an even greater limitation in those cases?
The fixed-function portions are decoupled from the CUs, and usually go through some sort of specialized bus or memory to interface with them.
From the CU array's point of view, they are just another client or a resource to negotiate with, and how much they can do with that finite capacity is a limit. The low-level details of how much they can send/receive is not discussed much.
One known value is the number of ACEs versus command processors. Each front end can negotiate for one new wavefront a cycle, so at least in that regard there's a difference in raw numbers. If there weren't prioritization the compute side could negotiate at a higher rate than the command processor. The command processor could be a limit, although it doesn't seem to be that much of one in practice ahead of all the other bottlenecks.
The ROPs can still be a limit, hence why there are preferred buffer formats or tiling considerations. There's an export bus from the CUs to the ROPs, so there is contention there and in the buffering stages. A wavefront will stall at the export phase if the ROPs are committed or the bus is being used.
The global data share is heavily relied upon for wavefront launch, and conflicts there were profiled as one reason why the graphics portion couldn't always utilize the ALUs in earlier architectures.
The caveat there is that this was done for pre-GCN GPUs and I've not seen a similar article since.
The counter-caveat is that AMD hasn't done that much to change that portion of the GPU or its glass jaws, so it's quite possible similar problems persist.
However, in that case, the GDS is used by the compute front ends as well for wavefront launch, so problems may hit both sides of the divide.
Primitive setup and rasterizer output is a limit, and AMD has not done as much as the competition to improve it for generations.
Tessellation or geometry handling seems to be a sore point. Rendered triangle output during tessellation for AMD GPUs going all the way up to Hawaii is a pretty serious point of non-scaling.
It's probably not a coincidence that Cerny noted as an example of the PS4's architectural features the use of compute shaders to do a better job at culling geometry.
There may be miscellaneous things like constant updates and other fixed-function requirements. Units there may have requirements for what memory they can and cannot address, and the command processor and other such units tend to sit on a common and lesser link to the L2 or can't hit the cache at all.
The ROPs don't touch the L2, but they have direct links to the memory controllers and their own caches.