The R3xx and the R4xx have a rather interesting way of tiling things. Our setup unit sorts primitives into tiles, based on their area coverage. Some primitives fall into 1 tile, some into a few, some cover lots. Each of our backend pixel pipes is given tiles of work. The tiles themselves are programmable in size (well, powers of 2), but, surprisingly, we haven't found that changing their size changes performance that much (within reason). Most likely due to the fact that with high res displays, most primitives are large. There is a sweet spot in the performance at 16, and that hasn't changed in some time. Even the current X800 use 16, though I think we need to revisit that at some point in the future. Possibly on a per application basis, different tile sizes would benefit things. On our long list of things to investigate.
Anyway, each pipe has huge load balancing fifos on their inputs, that match up to the tiles that they own. Each pipe is a full MIMD and can operate on different polygons, and, in fact, can be hundreds of polygons off from others. The downside of that is memory coherence of the different pipes. Increasing tile size would improve this, but also requires larger load balancing. Our current setup seems reasonably optimal, but reviewing that, performance wise, is on the list of things to do at some point. We've artificially lowered the size of our load balancing fifos, and never notice a performance difference, so we feel, for current apps, at least, that we are well over-designed.
In general, we have no issues keeping all our units busy, given the current polygons. I could imagine that if you did single pixel triangles in one tile over and over, that performance could drop due to tiling, but memory efficiency would shoot up, so it's unclear that performance overall would be hurt. The distribution of load accross all these tiles is pretty much ideal, for all the cases we've tested. Super tiling is built on top of this, to distribute work accross multiple chips.
As well, just like other vendors, we have advance sequences that distribute alu work load to our units, allocate registers and sequence all the operations that need to be done, in a dynamic way. That's really a basic requirement of doing shading processing. This is rarely the issue for performance.
Performance issues are still very texture fetch bound (cache efficiency, memory efficiency, filter types) in modern apps, as well as partially ALU/register allocation bound. There's huge performance differences possible depending on how your deal with texturing and texture fetches. Even Shadermark, if I recall correctly, ends up being texture bound in many of its cases, and it's very hard to make any assumptions on ALU performance from it. I know we've spent many a time in our compiler, generating various different forms of a shader, and discovering that ALU & register counts don't matter as much as texture organization. There are no clear generalizable solutions. Work goes on.