You know you're right, for some reason I was thinking the GPRs couldn't feed them at peak for independent ALU ops but I don't really know why I thought that in retrospect. So the only problem is the increased granularity for branch coherence.More limited? Could you give an example of what a vec4 could do that a VLIW with the same 4 instruction slots couldn't?
The compiler would have to be really careful how it create the VLIW instructions from scalar threads as far as LDS accesses are concerned so it doesn't create bank conflicts.