You need to think of current GPUs as consisting of pixel pipelines that are able to hide the somewhat random latency induced by texture mapping.
The traditional approach is to make the pipeline really long (e.g. 220 clocks in NV4x and G70) so that the time taken to issue a texture mapping request and get back a result can be hidden.
A pipeline consists of stages required to:
- fetch the operands for an instruction (or for the co-issued instructions)
- swizzling and organisation of operands for co-issue
- execute the instruction(s) (or request a texture and, optionally, complete the instruction)
- write the results of the instruction(s) back to the register file
Requesting a texture requires:
- calculation of the required texels' addresses
- fetching the texels (which may require a fetch from memory, or they may be in cache)
- a "wait" period to allow the texels time to be fetched
- filtering
- write the results back to the register file (and/or provide them back to the pixel pipeline for immediate processing)
So a requested texture comes back "just in time" for the pipeline to use it.
The pipeline's overall length is the sum of these stages, with the "texel fetch wait" period being designed (presumably) as some kind of "typical" worst-case. More complex texture mapping (multi-texturing and/or trilinear/anisotropic filtering) requires multiple extra fetches and filtering steps to be performed, beyond the limits of the "worst-case wait". This is where things get really fuzzy for me, but it doesn't really affect the point of what I'm saying.
A pipeline normally processes 4 pixels at a time, because this makes for nice coherent accesses to memory to read texels and it makes for nice coherent computation of bilinear (or better) filtering. So right there you get the basic unit of a "batch": 4 pixels and 220 stages equals 880 pixels.
When a GPU is sized-up to process 16 or 24 pixels at a time, you can simply multiply the quads that are all running the same instruction in parallel, from the 1 quad original upto 6, say. 6 quads would make a batch size of 5280 pixels. NVidia actually only went as far as 4 quads though, in NV40-45.
In NV47 (G70) NVidia made each of the quads independent of the others. The primary effect is that texturing in each quad can run "out of step" with its neighbours. There's a patent about issuing quads out of step with each other, the effect is to even-out the load on memory and reduce the worst-case latencies when texturing. It also reduces the size of a batch, which obviously makes dynamic branching more granular (as compared with NV45, say, where a batch consists of 4 quads in lock-step, 3520 pixels).
ATI designed its pixel pipeline a bit differently. The idea is that a lot of the time a texture fetch isn't needed immediately by the pixel pipeline - instead in two, three or more instructions' time. So texturing is performed "aysnchronously" and the pixel pipeline tries to continue to process succeeding instructions while the texture results are being produced. It doesn't always work out which is when you get stalls.
Now the size of a batch can be smaller (e.g. 64 quads = 256 pixels) because the pipeline only needs to be long enough for non-texturing work - it's now an ALU-only pipeline, in effect. The latency-hiding "wait" stages are no longer part of the overall pipeline length, instead waiting is the responsibility of the
separate TEX pipeline. (The size of a batch in ATI's R3xx GPUs was fixed, seemingly at 256, but the size of a batch in R4xx GPUs can be less or more.)
In R5xx, ATI changed the architecture so that the pixel pipeline no longer works on a single batch until all the instructions of the shader are executed (ALU or TEX). Texturing latency is now hidden not just by hopefully executing two or three succeeding instructions whilst the texture operation is performed, but by executing instructions for other pixels that aren't even in the same batch.
The problem now is how short can you make the ALU pipeline? You still have to spend cycles fetching from the register file, organising co-issue etc. With the pipeline lengthened by simultaneously working on multiple batches, it's now a matter of how many batches can be supported. Each batch requires separate instruction decode and each batch will require a different fetch/store access-pattern in the register file (which means increased latency). So as each batch is added to the design, the complexity of the pipeline increases - extra transistors.
Additionally it's difficult to have the pipeline work on different batches on each succeeding clock. So ATI has settled on 4 clocks per batch. In R520 and RV515 this means 16 pixels in a batch, 4 clocks per quad. In R580 and RV530 three quads are processed by a pipeline simultaneously, so you have 4 clocks x 12 pixels = 48 pixels in a batch (though the texturing pipeline is 1 quad wide, so the 3 ALU quads have to take it in turns to request texturing). That's the result of ATI's desire to create an architecture where 3 ALU instructions are processed for each TEX instruction.
Ultimately it's a question of spending transistors. Currently it doesn't make sense to texture in units smaller than 1 quad (coherency is lost - smaller TMUs amounting to the same overall texturing capability will incur more latency and use more transistors), so that determines the minimum width of a pixel pipeline. Then you have the turn-around time for a pipeline: the minimum number of stages in which operands can be fetched, organised, executed-upon and stored versus the number of batches you're willing to put into the pipeline. As you increase the number of batches to reduce the number of pixels per batch, you create overheads in terms of supporting these multiple contexts.
Theoretically you could create a pipeline that only spends 1 clock on each batch (instead of the current 4 in R5xx) but the complexity of the pipeline would be immense. And you'd also have to increase texture cache sizing and complexity to account for an increase in the turnover of batches requesting textures, hence much lower cache coherency. Though this should be mitigated by the batches, themselves, being coherently scheduled (at least some of the time).
So, overall, batch size is a compromise of hiding the latency of texture mapping versus the complexity of pipeline design to support multiple batches. The support for dynamic branching in R5xx and Xenos comes directly out of the ability to support multiple batches per pipeline - the granularity of branching is really down to the overheads incurred in supporting multiple batches.
Jawed