Simply preceding rasterisation with a tile-rasteriser would solve a whole load of these kinds of problems in the dual-rasteriser of RV870. A small amount of queuing on inputs and outputs should be enough to always keep both rasterisers running at full speed, assuming "average" primitives aren't all hitting one tile, or one rasteriser's tiles.
I suppose there could be a tile-rasterizer block that AMD chose not to show in the diagram.
The evolutionary pressures were different in the past, and devoting SRAM to merely holding ROP tiles would not have been the best use, as bandwidth scaling was still ramping and SRAM is not the most dense form of memory.Well, the rule of thumb is each doubling of cache produces 10% performance gain, and the RBEs already have colour/Z/stencil caches. Trouble is, this has absolutely no visibility outside of the IHVs. Also, if it was that easy (memory is cheap) wouldn't it already be in place?
EDRAM at finer geometries in an era where GPUs become strangled by bandwidth might tip the scales if designers want to defer abandoning the forward renderers for a generation or so.
There is a locality to off-chip memory with a NUMA setup.The PrimSets are stored off chip until their time comes to be rasterised. The lack in Larrabee is essentially the connecting of multiple chips and pooling of resources - the binning scheme seems fine, otherwise, to ride on top of such infrastructure.
Bear in mind that tiles are allocated to cores dynamically - there's no locality of bins/tiles for cores. So there'd be no particular locality in a multi-chip scheme.
There is likely to be a significant bandwidth and latency difference between local memory and the links between the chips.
I doubt Intel can afford a ring bus sized QPI link between two chips to afford the freedom of movement that it offers within one die.