The 4-way ALU clusters are part of a 16-wide SIMD.
A 16-ALU cluster would have 256 units in the SIMD.
It would also be more challenging to connect 16 units to the per-cluster register file.
What if they keep the same SIMD width? So four 16-ALU cluster, for 64 units per SIMD.
Due to the elimination of redundant transistors, SIMDs should be smaller.
As for AMD, well they reduced the width to 4 rather than increased it beyond 5. I think that may be a clue as to the current direction of the shader workloads. AMD said that the average lane utilisation was infact 3.5 I believe which shows that 5 wide was indeed far too wide.
In the PC space you need an architecture that can perform at its "best" at launch or in a meaningful lifespan ( one year?) . In the console space an architecture more future oriented. And the trend it's increasing shader workloads.. and computing. A future oriented architecture in the console space would probably have a much higher ALU:texture ratio, and maybe the elimination of some fixed function (well, due to the failure of Larabee, we will have to stick with TMUs, but we might get rid of ROPs at this point and maybe also fixed tessellation unit).
For example:
32 SIMD 64-wide (4*16 ALU cluster)
64 TMUs
128bit MC
32 to 64 mb of L3 cache on die.
At 28nm, it shouldn't be much over 200-250 mm^2.
I think XDR2 have higher cost.
What kind of bandwidth would be available with on-die RAM at this point, and is it needed above and beyond the 200-250?GB/sec that a shared memory pool would offer?
Internal crossbar can have a bandwidth as high as 1 Terabyte/s.
BTW: Tim Sweeney (Epic) said that for a real leap in game technology, a huge step in bandwidth is needed... in the order of Terabyte/s. Since there aren't news about the Tb Initiative from Rambus, the only way to achieve this is with a large cache.
And someone from DICE in one of his presentation, said that it's time to move to 16-ways ALUs.