I'm wondering if there is any evidence for 16-SIMD in the cluster (as opposed, say, to 2 banks of 8 or 4 banks of 4)? In theory, splitting up into smaller pieces would cost more in scheduling trannies, but, provide more flexibility for a single-cluster chip (that'd allow vertex&pixel progs to run concurrently, as opposed to having to create some kind of interrupt-driven time slicing mechanism), map more correctly to the smaller set of texture units (4 or 8, take your pick), and match the scheduling factor of R580 and Xenos more closely.
I don't know why the latter two chips schedule batches at 4x SIMD width, but that appears to be the case. If nVidia requires the same relative factor, that'd argue for 8-SIMD. 8-SIMD would also match the pictoral grouping of 8 st. procs in the tech brief, and the number of texture units in the cluster.
That's hardly evidence of anything either, but, I'm curious what evidence we might have for 16, other than it would seem obvious that the units in the cluster are all SIMD scheduled/dispatched.
I think I'll hold off on what that might mean (either now, or in future revs) for handling dispatches of vector instructions where the entire SIMD-width is predicated-ignored. I'm already well out on a limb ;-)