Mintmaster
Veteran
Read my post again. Branching is done at the same rate as before (I'm assuming that current hardware does it once every instruction group when I say my method will branch once every four scalar instructions).It would need 4x as many branch units.
I'm pretty sure that my method doesn't really change things at all here. The only minor issue is that in an 8 cycle period, the total possible locations that need to be accessed from the register file is four times larger with my method. Actual transfer rate will be the same, and the size of the register file is the same, too.Breaking up the groups of 4 would bring up register file concerns.
Either this will quadruple the register file to maintain the same amount of registers, or it split up the wierd quad registers and potentially simplify the register access restrictions.
I'm not going to break the clusters, though. The T units will put into their pipeline 16 pixels of the same batch that the ALUs will. The ALUs go round robin on 8 batches, and each batch will stay active for at least four visits (32 cycles) to allow the T-units and branch units to finish up.The trans units would have a problem, though, since their register accesses piggyback on the datapaths of the slim ALUs, and breaking the clusters will leave them orphaned or picking between a lot of lanes.
(FYI, by active I mean that they have data going through the ALU stages. There's plenty more batches in flight, put on hold either for texture fetches or simply waiting their turn.)