I think what I'd call what you're describing "distributed scheduling" and not locality in the traditional sense of the word, atleast with respect to memory hierarchy.
But the point is that this isn't just about memory, it's about three asynchronous pipelines that feed off memory (or on-die memories): ALU, TMU and ROP (four if you include early-Z). These pipelines are asynchronous since R300 with independent queues etc.
That are basically subdividing and distributing work according to a fixed mapping function (if I read you correctly) that maps fragment at X,Y to group N of resources. Yeah, you don't need to have those databits capable of going to any of N GPU resources (e.g. whichever one is free to do more work), but that to me is an issue of routing streamed data around the chip, not neccessarily one of ensuring cache locality or memory locality, or FIFO locality.
Bingo: Dave and I were discussing why you don't want to route data around the chip.
IMHO, what's important is that the "kernel" of data being passed around is "local" and not dependent on other packets of data, not that the packets themselves are scheduled by a fixed distribution. Seems to me to be just like the arguments over other network architectures: time division/reservation vs collision-detect/queue. There are arguments pro/con to each. Sounds to me like the Rxxx argument is based on saving transistors, optimizing chip layout, and avoiding more complex data routing.
Well, without incredibly advanced simulators we're ultimately in the dark about this kind of architecture versus one that schedules all fragments/pixels uniformly.
These GPUs appear to have a single-tier cache architecture: L1 solely at the fragment level. And I guess the colour buffer and z/stencil buffer caches are also simpler to implement.
I don't neccessarily think that dividing the screen up into W x W chunks, and then mapping each chunk to a specific resource based on its coordinates vs putting the chunks into multiple work queues and letting chip resources dequeue work as needed, will ultimately determine performance. It's just a choice of scheduling algorithm, both both master-worker and deterministic scheduling have their tradeoffs.
Maybe the relevant patents provide some metrics/motivations that are convincing...
All we can say is that as the flexibility (programmability) of shader pipelines increases, along with the sheer count of them, the desire to achieve efficient finely-grained scheduling increases. Some kind of distributed scheduling becomes more and more important. R300 etc. packetise by screen-space. G80 may well packetise by batch-ID. Who knows?...
I am speaking from ignorance of the details of the mapping, but it seems to me that any screen space mapping would also have pathological cases where (depending on the "pattern" they used in the map), you could get uneven distribution of work. Of course, one would try to design it so that hopefully the statistical majority of cases ends up with uniform distribution (e.g. a hash function with avalanche criterion for the coordinates),.
The tiles are small. I've never seen dimensions stated, but it would appear to be based upon nominally 16x16-pixel tiles - this size then determines the batch size. Since R420, the tiles can vary in size. Bigger tiles increase cache coherency. Smaller tiles reduce the number of "null" pixels that end up being uselessly shaded, when a triangle doesn't entirely cover the tile (obviously that happens quite a lot but falls-off as screen resolution is increased). I guess that R300 etc. can only shade one triangle per tile at any one time. So a four "quad pipeline" GPU such as R420 or R580 can pixel-shade up to four triangles simultaneously. Plainly, one triangle can cover hundreds of tiles.
I just think that the "tile" terminology is confusing. We have two existing "tile" nomenclatures out there, both of which refer to rasterization order (tile deferred renderers, and tiled scan conversion), it's confusing to start talking about "tiled locality on physical layout of chip"
There are plenty of other tilings in computer graphics, e.g. textures are tiled across memory to maximise bandwidth utilisation.
You can even get non-rectangular tilings, such as this hexagonal render-target/texture tiling:
http://www.graphicshardware.org/previous/www_2005/presentations/bando-hexagonal-gh05.pdf
The fact is that the distributed scheduling in R300 etc. is based upon screen-space tiling. It results in a physical locality of fragment/pixel processing which affects not just one type of memory access, but the entire workload post-rasterisation.
Jawed