Oh, finally an easy question, thanks!
APUs are about accelerated processing, they're not about making discrete 3D accelerators obsolete for gaming. So, it really is the OpenCL part that matters: robustness, predictable performance and whatever else AMD though was worth being improved in Cayman upon Cypress. Don't know - maybe even the RBEs are making more effective use of bandwidth. Something which would be a good thing for bw starved APUs.
Cayman is marginal as an "improvement" in these regards though, a tweak. OpenCL isn't really better served by VLIW-4 in the way that AMD is advertising GCN will achieve. That's a complete change.
Except for some games, where it doesn't matter anyway, asymmetric VLIW5 does not seem to compare favourably to the symmetric VLIW4 approach in Cayman. And the vector cores in GCN have identical functionality also - so having that kind of animal out in the field probably helps the driver team to identify some tricks before either GCN or Trinity launches.
Are you expecting each of the SIMD-16s to operate with sets of 4 lanes collaborating on a transcendental result, i.e. 4x transcendentals per clock (though in practice 3 lanes out of 4, like Cayman?)? And 4 lanes collaborating on DP?
---
For DP-MAD that's 6 VGPRs that need to be read in 4 cycles (assuming 4-cycle dependent-instruction issue), since each operand takes 2 VGPRs. I don't think this is possible. On top of that, for both transcendentals and DP, the 1:1 mapping of 128-bit portions of the VGPRs to ALU lanes would fall apart, making operand collection considerably more costly.
One argument you can make is that for DP it's really 6 VGPRs in 16 cycles (or 8 cycles if it's half-rate). Assuming there's a bypass network and that there's >4 cycles of GPR fetch time before ALU execution (e.g. 8 cycles) then this timing flexibility could save the day. Not sure...
Slide 18 of the GCN presentation says that 64-bit registers are formed from adjacent VGPRs. This might be nothing more than a technique to minimise fragmentation problems in register allocation, or it might refer to an RF burst operation that solves the "6 VGPRs in 4 cycles problem" (seems unlikely though).
---
The 4-lane DP (and other operations, such as dual-dependent MUL with ADD) are founded on the dot-product functionality of the VLIW configuration. Dot-product requires cross-lane communication in the VLIW configuration, and it's that communication that leads to those other instructions.
In GCN I expect dot-product is just a macro for MAD, MAD, MAD and ADD, like in NVidia. I expect transcendentals and DP are macros too, taking multiple cycles, with no lane-ganging.
---
I can't think of any other clues about the actual operation, but overall I'm sceptical about ganged operation.
Edit: The OpenCL driver also has a preferred vector width of 4.
Same for Cypress and Cayman. Will have to wait to see what it is for GCN. But GCN is going to be register-limited in comparison with VLIW so it seems that it'll probably be 1.