http://www.beyond3d.com/forum/viewtopic.php?p=389402#389402
Anyway, each pipe has huge load balancing fifos on their inputs, that match up to the tiles that they own. Each pipe is a full MIMD and can operate on different polygons, and, in fact, can be hundreds of polygons off from others. The downside of that is memory coherence of the different pipes. Increasing tile size would improve this, but also requires larger load balancing. Our current setup seems reasonably optimal, but reviewing that, performance wise, is on the list of things to do at some point. We've artificially lowered the size of our load balancing fifos, and never notice a performance difference, so we feel, for current apps, at least, that we are well over-designed.
I interpret the "huge load balancing fifos" to imply that the queues of raster-sectioned triangles ready for the pixel shaders to work on are, collectively, "huge".
Eric sorta seems to imply that a quad owns a tile "Anyway, each pipe has huge load balancing fifos on their inputs, that match up to the tiles that they own."
This would translate into hundreds
of entries per quad. Each quad having a separate FIFO. The number of tiles in each quad's FIFO depends on the overall density of triangles across tiles...
The best case memory accesses arise when triangles are entirely within a single tile. This is because solely a single quad's L1 cache is consumed by these triangles' textures - rather than having these textures appearing multiple times in separate quads' caches - this appears to be what Eric is referring to when he says:
I could imagine that if you did single pixel triangles in one tile over and over, that performance could drop due to tiling, but memory efficiency would shoot up, so it's unclear that performance overall would be hurt.
Since you can only rasterise triangles when you've "worked them all out" (i.e. found the edges of all the triangles and worked out depth), there has to be a fair amount of geometry work completed before you can start shading (i.e. a queue). By dividing the frame into tiles the setup engine (working with the Hierarchical Z unit) works out the rasterisations for all triangles that fall into those tiles, and once each tile is completely rasterised it can put it into the tile queue.
Page 9 is quite explicit about this, now that I've had a rummage:
http://www.ati.com/products/radeonx800/RADEONX800ArchitectureWhitePaper.pdf
The Setup Engine passes each quad pipeline a tile containing part of the current triangle being rendered.
I dare say that page is quite convincing that a tile is owned by a quad. So, together with knowing that a tile is currently 16x16 pixels, it seems conclusive to me that my diagrams hold.
Additionally, if you perform a simple round-robin tile allocation using equal-sized tiles, then each card's set-up engine knows which tiles it's going to work on, without having to communicate this with the other card. So you have a simple mechanism that allows each card to shade pixels based on its quad capacity.
For example a 2-quad card will only get one third of the tiles if it's working with a 4-quad card. Neither card has to agree with the other which tiles are its own.
Alternatively, in the E&S system, the level of AA required (beyond 6x) determines how many cards share a tile, each card rendering a different AA sampling pattern on the tile.
Of course, if anyone can persuade Eric to give us a more definitive insight, that would be 8)
Jawed