Why not then just simply add 50% more CUs? Should be more power efficient than making each CU 50% fatter. Less shared resources = good for power efficiency.
I agree that would be a conceptually straightforward way of increasing throughput. I was coming from the context of recent Vega rumors that had already given a CU count for a constraint.
GCN as we know it in that scenario is at its CU maximum. At least conceptually, it is a minor tweak to increase that limit, although certain items with chip-crossing interconnects like the L1-L2 network, shared L1 scalar caches, instruction fetch blocks, and message/export would need to expand their routing and addressing capability to service more clients if that were the only change.
There are some items that could benefit from fatter CUs, such as large workgroups or workgroups in the long tail of a period of execution before a barrier that only speed up with quicker or fatter CUs, rather than multiple CUs that a single workgroup cannot split across. Some levels of clock gating can benefit from high-speed information within a CU, while others might benefit from multiple CUs with more aggressive coarse gating.
Shutting down the extra CUs could give power to boost the remaining CUs.
What makes me uncertain at this point with Polaris is that this is another data point in a string that crosses multiple years, multiple generations, multiple fabs, and multiple nodes where GCN seems to be showing that there is something consistent, unforgiving, and physical about the architecture's clock headroom and efficiency past its target range. Perhaps this next time the headroom will appear, although I am not sure if this eagerness to reach for these speeds is optimal long-term for either vendor.
If there were ever an analysis or post-mortem of the architecture below the hood, I would be delighted to see it.
The following thoughts below are more of a rambling tangent:
There are some things about GCN that from a CPU standpoint appear to be rather heavy, with a specialized encoding with some non-streamlined operand and execution behavior, broad hardware resources, ways to route data between the lanes of a decently wide unit, a rather switch-happy multithreading policy, inconsistently handled memory spaces, and a hardwired 4-cycle latency. There wouldn't be a question for CPUs why a processor with a pipelined FPU with 4-cycle latency wouldn't clock as high as one with 6 or more. That's not to say there couldn't be other paths, or other pipelines not exposed to the ISA that are themselves too short for the amount of work they do to be clocked high.
Nvidia's ISA transitioned to a more streamlined one, it did not pursue very short forwarding latency as aggressively, and among other things has been working on updating its in-house custom controller to RISC V, which is philosophically focused on straightforward and performant implementation.
It's not purely exclusive of increasing SIMD count. One way to do it would be to increase from the 4-cycle cadence so the new SIMDs can fit in the issue loop, which would effectively extend the pipeline and so could allow for a higher clock speed. That might not play nice with the way everything seems to tie together with GCN (SIMD count goes to cadence, cadence to width gives batch/cache/export size, etc). However, it may be that the elegant solution raises the cost of changing any one element and keeps implementations at a local minimum when we know there are examples that appear more optimal globally.