Who says that NVIDIA will stay at a size of 32?
As far as I have heard they are improving on dynamic branching by some kind of unknown scheme to avoid bubbles but still maintaining the Vec8s. So the Warp size should still be 32, but you should get much better performance on branch divergence than on G80/GT200.
I don't know if such a scheme is also viable on AMDs VLIW architecture because you have up to 5 different dependencies per instruction.
As far as I have heard they are improving on dynamic branching by some kind of unknown scheme to avoid bubbles but still maintaining the Vec8s. So the Warp size should still be 32, but you should get much better performance on branch divergence than on G80/GT200.
I don't know if such a scheme is also viable on AMDs VLIW architecture because you have up to 5 different dependencies per instruction.
Wouldn't it be possible to just use large enough FIFOs between ALUs and ROPs?The only way round this that I can think of is two parallel rasterisers, each 16 wide, each feeding half the clusters. Then it'd be similar to R580 where the 16 wide rasteriser built batches of 48.