Larrabee's bandwidth savings by keeing a screen tile on-chip until it is completed would be something most architectures will probably go towards.
I don't know if there are any good ways to do so without extending tiling throughout the GPU architecture.
That's basically tiling for setup->rasterisation->shading->back-end.
Shading and back-end are tiled, so now it appears there's a good chance that rasterisation is tiled in RV870. Who knows...
It's interesting that Larrabee can supposedly play fast and loose with the placement of rasterisation in the software pipeline. And it's also noteworthy that PrimSets can't be constructed without doing the most coarse rasterisation of triangles. So the default rasterisation in Larrabee actually consists of two distinct rasterisation phases.
Simply preceding rasterisation with a tile-rasteriser would solve a whole load of these kinds of problems in the dual-rasteriser of RV870. A small amount of queuing on inputs and outputs should be enough to always keep both rasterisers running at full speed, assuming "average" primitives aren't all hitting one tile, or one rasteriser's tiles.
The least elegant way to defer tiling would be to hope the foundry masters EDRAM and then slather a big ROP tile cache that holds the framebuffer up to some ridiculously high resolution.
A much larger cache in general would reduce bandwidth needs, as has been found in other computing realms.
It's not great, but it is a hack that is available that requires little more effort than allocating it die space.
Well, the rule of thumb is each doubling of cache produces 10% performance gain, and the RBEs already have colour/Z/stencil caches. Trouble is, this has absolutely no visibility outside of the IHVs. Also, if it was that easy (memory is cheap) wouldn't it already be in place?
There is still a H-Z block per rasterizer. If each rasterizer is allowed to update its local copy as the arrows in the diagram indicate, maybe the design assumes that with an even distribution to each rasterizer each local H-Z will start to approach a similar representation in high-overdraw situations.
This would lead to an incremental decrease in effectiveness for short-run situations, and then there is the chance of a long run of pair-wise overlapping triangles pathologically alternating between rasterizers.
The cap would be the RBE z-update latency.
And the problem is that this latency could be so long that it effectively means hierarchical-Z is "off". If overdraw is typically 5 (though some would argue it's lower - and it's obviously lower for early-Z based engines, such as Source) then 2 or 3 of that overdraw could be within the timespan of a set of batches of fragments that are concurrently in flight (shape of a character).
Another argument might be that hierarchical-Z is no longer important because developers do early-Z themselves and there's so many deferred engines being used (for which hierarchical-Z would seem to be too slow).
Who knows?...
I don't think these problems have been fully solved for any mult-chip GPU solution.
Not even Intel has shown a path, as Larrabee's binning scheme has been defined for only for a single-chip solution.
The PrimSets are stored off chip until their time comes to be rasterised. The lack in Larrabee is essentially the connecting of multiple chips and pooling of resources - the binning scheme seems fine, otherwise, to ride on top of such infrastructure.
The rasterizer portion of the scheme may need to have a local run on the full stream on each chip, with a quick reject of primitives that do not fall within a chip's screenspace.
Bear in mind that tiles are allocated to cores dynamically - there's no locality of bins/tiles for cores. So there'd be no particular locality in a multi-chip scheme.
On-chip solutions have much more leeway.
Sure, which is why I suggested that RV870's two rasterisers update each other continuously, resulting in minimal hierarchical-Z latency.
Cypress could send two triangles at the same time to be rasterized. Each rasterizer gets a copy of this pair, and an initial coarse rasterization stage can allow each rasterizer to decide if it will pick or punt each triangle.
Such a process would be much more expensive if crossing chips.
It's a matter of a duplicate data path and an additional coarse check if on-chip.
As MfA says, the initial tile check should be devastatingly cheap and fast. There would need to be extra buffering around the tiling-rasterisation, and it's obviously a bottleneck, but there's no reason it couldn't have twice or higher throughput to ensure the rasterisers aren't idling.
Jawed