What about the theory proposed earlier about buffering triangles in the event triangle setup outpaces scan conversion? Not a likely scenario?
Setup does a coarse rasterisation, identifying all the tiles that a triangle at least partially covers, then giving the rasteriser(s) a list of tiles and triangle data in order to rasterise. I suspect the rasteriser has a tile-centric view of rasterisation, not a triangle-centric view. That's because threads of 16 quads of fragments need to be despatched, and those need to be strictly tile-aligned (because the render target is tiled). Though I also expect it to handle triangles in strict order.
One of the key questions that's still unanswered is can a thread of fragments refer to more than one triangle (e.g. 5 adjacent small triangles from a strip)? While I've long thought that the answer's no on ATI, there'd be little reason for the hierarchical-Z unit to be able to resolve quads of pixels (as I think it does), as this wouldn't save any pixel shading effort - but it would save texturing effort and RBE bandwidth if the quads are rejected. On the other hand, when two adjacent triangles share an edge, that generates two fragments per pixel along that edge - something that doesn't map to a straightforward 2D translation from pixel locations into strands in a thread. That seems problematic to me (because tile ID and pixel position within a tile would no longer be enough for RBE to know which pixel a fragment is destined for).
Can ATI's 16-fragments per clock rasteriser rasterise 4 triangles in one clock, if each only occupies a quad of pixels? Seems unlikely to me, as the rasteriser prolly only works on 1 triangle's line equations per rasterisation-cycle.
One of the features of ATI's rasteriser is an optimisation for thin triangles. Rasterisation orientates itself to the horizontal or vertical depending on the alignment of the triangle, because the rasteriser wants to work on a minimum of 2 columns or 2 rows (since pixels need to be quad-aligned). This implies that rasterisation within a tile doesn't blindly run over all the pixels in the tile, merely that it processes the entire portion of a triangle that fits within the current tile before moving on. This might only amount to, say, 27 fragments in a tile of 64 pixels, for example, so would be 2 rasterisation clocks (as long as the triangle only occupies a maximum of either 4 rows or 4 columns).
So two rasterisers would have higher throughput than a single rasteriser at the same rasterisation rate, if there are any triangles that don't fully occupy an entire rasteriser's capability on each clock. That would be quite common, generally, as a triangle (or portion of a triangle in the current tile) that's small might result in only, say, 9 fragments produced by a 32-rasteriser (wasting 23 rasterisation ops), whereas two 16-rasterisers would only waste 7 rasterisation ops on this one triangle.
But I suspect adjacent small triangles can't go into a single thread of 64 fragments. A lot of the time small triangles will be adjacent (and their fragments would want to share a thread), and so no effective speed-up will be seen. Triangles that are larger (e.g. >32 and <64 pixels) are going to generate a speed-up, since the chances of adjacent triangles of this size falling within the same thread fall-off. But triangles of such a size don't match the expectation: "tessellation generates huge numbers of small triangles!!!"
You could say that a thread size of 64 and a limit of 1 triangle's fragments per thread (if true) are the key limitations here. So I'm dubious that dual rasterisers were made to increase performance, per se. I think it might be a matter of practicality in instancing a block of hardware rather than re-jigging things for 32-rasterisation. I don't think the number 32 is problematic (since other ATI GPUs have 4-, 8- and 12-rasterisers) merely that scaling isn't free of latency/pipelining issues across the entire width of the unit.
Jawed