This topic is not going to gain any traction or serve any purpose until those asking questions can really clarify them. I'm thinking that the real question here is, "what did Cerny mean by 'balance'?" and it's that single talking point that's spawned this rather odd view of PS4's graphics processor.
Cerny went so far as to say the 14+4 slide was not part of their evangelization, which may mean that one slide was not meant to be universally applicable or that the slide could have miss-stated some kind of design rule of thumb.
One conservative interpretation is that the CU count is sized such that the PS4 was projected to have a high probability of having sufficiently large numbers of wavefront issue cycles free when running hypothetical present and future workloads in a target time frame.
There's a bunch of caveats:
1) It's heavily dependent on statistical analysis of workloads and their extrapolations to their future needs.
Sony could have projected a set of representative use cases for the fixed-function portions of the graphics pipeline, saw what number of CUs it could probably get away with without impacting their performance too much, and then opted to add a few more.
The GPU hardware is very good at allocating CUs, and it's not a heroic feat to burden all 18 CUs with graphics work. The GDC slide alone shows major parts of a frame where all 18 get used.
However, at least some of the time and with modest effort, some cycle time can be freed up.
Since this is forward-looking, only time will tell how accurate their predictions will be.
2) The GPU's granularity for injecting compute kernels into the overall workload is still somewhat coarse. Other analyses of AMD's architecture and the little slivers of unallocated compute in the GDC slide point to this. In such case, it helps to have extra margin so that there is enough contiguous spare capacity, instead of it being fragmented so heavily across the whole frame that the overheads of splitting a kernel over many wavefronts and over long periods of time beings to dominate.
AMD offer a decently balanced part based on their GPU options with some feedback from Sony about what tweaks they'd like (32 ROPs being one obvious decision. We have Sebbbi telling us that's basically overkill and compute is more important going forwards, so that seems imbalanced).
This might depend on Sony's long-term goals and projected workloads, which may not be the same as a developer of any specific engine would be using.
There may be internal quirks to the architecture where doubling up on ROPs can help reduce periods of contention when CUs need to use the export bus to send data to the ROPs. That can buy small slivers of time where the CUs are able to export pixels faster or where they aren't waiting on other CUs to do the same.
Sony's desire to push a 3D headset might figure into that. The 1000 foot view has the ALU and pixel output side growing more in relation to geometry and the graphics front end, and the GDC graph somewhat aligns with it since the time frame is the all-important 16ms.
For a single camera and 33ms, it might not seem as balanced as when the architecture has to fight for far fewer ms while in stereoscopic mode.
Probably the only balancing consideration that came into it was what CPU cores to have. Sony could have gone Piledriver, maybe held off for Steamroller, but that'd have taken up more die space meaning less room for CUs or a higher cost.
They might have dodged a bullet there. Given Kaveri's incomplete launch lineup, the PS4 might not have launched yet, and it probably would have been dependent on Globalfoundries' specialized 28nm APU-focused process--restricting production to a single fabrication partner (that AMD at one point paid hundreds of millions of dollars not to use).
Steamroller has an order of magnitude more of process-specific custom macros built into it compared to Jaguar, so the chances of seeing a Steamroller PS4 coming anywere but GF seem small.
For example, normally, bandwidth of L1, LDS and Registers (in every CU) is tailored to the number of all CUs in GPU (considering the graph). What happens if the bandwidth of L1, LDS, Registers (for each CU) and L2 on a GPU being tailored for a lower number of CUs that GPU has?
Registers, L1, and LDS bandwidth are fixed. Any graph of them is going to scale with the number of CUs and their clock.
L2 bandwidth will scale with the number of memory channels and the GPU clock.
It's also helpful to use graphs with many data points that are normalized to the same number of CUs--with the added point that normalizing design bandwidths to some arbitrary number of CUs is not particularly meaningful.
I don't think balance or so was the only reason. There is also power consumption, price and heat to be dissipated by the cooling system to be put into the equation.
Another point that was raised was that clamshell mode impinges on the top clock a given bin can reach. Clamshell makes GDDR5 devices share various signal lines, which is not as clean electrically.
DRAM capacity doesn't really have much effect on power consumption, and clamshell mode means the power-hungry interface isn't growing.