Thanks for all the details, Jawed.
I still have my reservations, though. Why, for example would AMD redesign their SIMD-Cores, when they could have had their 20:1 ratio and saved die space plus R&D costs?
4 clusters, 320 ALU lanes, 16 TUs? I reckon that would save about 5% of the die space - based on about 17.5% of each SIMD being control (complete guess).
The problem, I guess, is that RV7xx has half RV6xx's performance, per TU, for fp16 filtering. But I still have no idea what proportion of typical rendering time in current games is fp16-filtering
Especially since RV730 has so little bandwidth compared to HD4800 and only 16 Interpolators.
HD4850 is definitely short on bandwidth, while HD4870 has too much.
16 interpolators is still a 2:1 interpolator:fragment ratio:
http://forum.beyond3d.com/showpost.php?p=1193433&postcount=184
like RV770.
I still think that this move somehow was a kind of field test for some things to come - smaller branches, more Data fed into shaders... maybe interesting stuff for D3D11 and since TMU aren't that costly - as you've also hinted at...
I'm torn over the smaller branches thing - I presume you mean lower branching-divergence penalty. Lower is clearly useful, but using the control overhead I mentioned earlier, a 2:1 (what you would call 10:1) ratio makes the ALUs (ALUs+control, excluding TUs) about 22% larger (resulting in that 5% penalty for the die as a whole).
With nested branching there's an explosion in the divergence penalty. Sadly GPUSA is still broken for calculating the throughput of complex shaders with nested branching (well with the shaders I've tried, anyway, e.g. Steep Parallax Mapping), so I haven't had a chance to play to see what kind of effects nesting has on performance compared with ALU:TEX. Also, is nesting really relevant at this time (as a proportion of frame rendering time)?
Without nesting it's then a matter of sheer throughput of a 4:1 configuration compared with a 2:1 configuration. If the former has ~20% higher throughput for the same SIMD area, and shaders with DB are still relatively rare...
A 2:1 version of RV770, with ~ 1TFLOP, would have had 16 clusters, which would make 64 TUs
It wouldn't have been hugely bigger, though, around 270mm2, I guess.
--
The two smallest ones I've measured by hand were about 104mm²: RV380 and RV515. IIRC the latter was the last low-end-GPU from AMD (then ATi) with a 128 Bit mem interface. On the nv-side of things it was G96-300 (55nm AFAIK) measuring in about 118mm².
So RV730 is about 40mm2 "over-sized" for its bandwidth - though that doesn't account for power. So in comparison to that, the ~5% increase in die area for using 2:1 clusters, instead of 4:1 clusters like RV770, seems relatively tame.
Jawed