If my only desire is to keep signals from having to cross ever-widening expanses of die space, I'd say yes. If we start from the point of view of sloth, we don't want our signals working any harder than they have to.Hmm, that's like arguing that only a few clusters nearest the RBEs should do MSAA resolve.
As the ALU numbers rise, the proportional cost of leaning on one set of SIMDs over others would actually fall.
Such demarcations would serve to give the design more freedom, if we are to expect continued high ALU counts. Instead of designing a SIMD array where even the furthest units must must be timed and their interconnects specified to accept RBE, TMU, *insert unit here*, inputs, we can draw a line at a certain number of SIMDs.
The possible benefits include more flexible clocking schemes, slower growth of the crossbar, and possibly lower intensity hot spots or power-consuming repeaters.
The diagrams are probably oversimplified for the ROP and L2 cache relationships, so I don't know how many ports lead to the L1s.In terms of connectivity texels would be going from cache into the register file, or into those funky, shiny and new, LDSs.
Future design increases could lead to greater pressure to economize.
A slimmer crossbar that simply means that 10 out of 16 SIMDs (random future design speculation) don't get RBE traffic may be a worthwhile tradeoff.