Jawed, can you explain why exactly why you would rather the batch size stay at 64? I really don't know much about that matter, just curious to learn why.
Code that uses dynamic branching pays a divergence penalty on any SIMD processor. Larger batches increase the chances of paying this penalty.
A simple example is a shadowing shader that wants to soften shadow edges. A pixel is either within the softening region (somewhere near the edge of the "hard shadow" before softening) or outside of that region. The shader uses an If statement to decide, which is "dynamic" because the result of this test varies across the screen. Let's say that when a pixel needs softening it takes 4x longer to compute that pixel.
Because pixels are lumped together in batches if any one pixel in the batch needs softening then the other pixels in the batch are forced to come along for the ride - this is the penalty of SIMD. Those other pixels don't actually get softened because the SIMD unit deactivates them while doing the softening calculations.
Imagine a shadow cast by a wire fence (contrived example but it'll do). If you're looking for soft areas in the shadow then you can see that with 4x4 pixel blocks (batch size 16) there'll be many blocks where there's no softening required. The "holes" in the shadow formed by the wire mesh are big enough that these small blocks fit. Blocks that touch or cross the shadow from the mesh will do softening.
When the block is bigger, say 8x8 (batch size 64), there's less chance that the holes in the shadow are big enough for these bigger blocks to squeeze in. Consequently more blocks (considered as a percentage of blocks required to cover the screen area) will run the "slow" shadow softening code.
So the overall effect of a bigger batch is that more pixels on the screen will "catch" on the "softening" test, even though the total number of pixels that need softening has not changed.
A simple way to imagine this is if the entire screen was rendered as one huge batch. Then all pixels would run the slow softening code even if only one pixel needed softening.
Ideally RV770 would have a smaller batch size, e.g. 16. But as GPUs progress, the cost of a smaller batch size is:
- an increase in the number of SIMD units, e.g. from 4 to 16 - which means that the transistor cost for scheduling across all these SIMDs is higher
- or reducing the per-instruction duration of the ALU pipeline, e.g. from 4 to 2 clocks - which means that register file fetches have to be "wider but shorter" (since R6xx has to juggle operands in a buffer before they can be used by the ALUs) and that instruction execution itself has to be re-designed for less clock cycles (which is more difficult and may not be possible without a complete re-think)
- or filling the ALU pipeline with more batches, e.g. from 2 to 4 - which increases the scheduling cost as well as increasing the instruction decoding complexity in order to juggle these extra batches
As far as I can tell NVidia has paid all of these costs, relatively speaking, in G80 (which has a batch size of 16). They mitigated the pipeline costs by having to juggle less operands per clock (apparently 4 scalars per clock are juggled into place per "pixel", instead of 15) and investing in the ability to use custom logic (which makes the cost/area of the ALU lower). There are corner cases for G80's dynamic branching that relate to the number of instructions that can be skipped by the If, and there are corner cases for scheduling caused by register pressure. These are other aspects of the trade-offs associated with batch-sizing and general batch scheduling.
It's looking very likely that GT200 will have a batch size of 32, for what it's worth.
Jawed