Here's a quick comparison of shader arrays and quads, based on a "12-pipe" architecture.
For this comparison I'm going to use X850Pro (507 core, 520MHz memory, 33.3GB/s) with RV530 (600 core, 700MHz memory, 22.4GB/s), simply because the former exists.
X850 Pro:
- 3 quads of fragment shader pipelines
- each shader-quad has a dedicated (quad) TMU, with its own cache
- each shader-quad is 4-way SIMD, i.e. one program counter is shared by all 4 pipelines, all executing the same instruction
- the three shader-quads each operate independently of the others, so the shader-quads operate, overall, as 3-way MIMD
- each shader-quad "owns" a tile of 256 pixels (a square of 16x16) in the backbuffer
(Speculation) the size of a tile corresponds with the batch size for the architecture (256 fragments). Batches are used to hide texture latency, where each instruction that is completed on the entire batch of 256 fragments will hide 64 cycles of texture latency (since this is 64 quads of fragments - in other words one batch executes in 64 phases per instruction). Texture latency can only be completely hidden if multiple instructions are executed in the shader.
This page appears to indicate that 6 instructions hide a single texture instruction's latency:
GPGPU bench results for single texture fetch in X800XT
the instruction count appears to include the texture instruction, itself - so breakeven comes at 6 instructions, total. Though it's worth noting that 2 texture fetches take 8 instructions to hide, so the overall average is 4 instructions per texture fetch for 2 or more texture fetches.
So in X800XT I'm guessing that the average texture fetch requires 256 cycles to hide. The X850Pro should have similar texture fetch latency (slightly higher-clocked memory).
RV530
- 1 array of 12 pipelines
- TMU configuration might be:
- 1 TMU per X pipelines or
- a TMU array, e.g. 8 TMUs, shared by all pipelines
- the shader array shares a single program counter, making it 12-way SIMD
(Speculation) the batch size may be a small multiple of the array size, e.g. 24 fragments . The only way to hide (say) 256 cycles of latency is with more average instructions per texture operation per batch (2 phases per batch, 128 instructions per phase per texture) or to interleave multiple triangles' threads (batches). The former configuration simply isn't practical, so to support small batches, RV530 would need to be able to interleave batches for fragment shading.
Xenos is able to interleave batches in each shader array on successive clock cycles. Unfortunately we don't know how many batches Xenos can maintain at any one time. I'm going to assume that Xenos uses 32-fragment batches (i.e. 2 phases per batch) when pixel shading, and so to hide 256-cycles of texture latency it would need 32 batches each of 4 instructions average per texture op. (32 batches x 4 instructions per batch x 2 phases = 256 cycles).
If a batch was, say, 64 fragments (4 phases), then 16 batches would need to be active at one time, etc. I'm suggesting that 1024 fragments could be in flight at one time. X800XT, with four quads, each working on 256-pixel tiles, also has 1024 fragments in flight at one time.
So assuming that RV530 has a multiple-batch scheduler like Xenos's, it would take 32 batches (each of 24 fragments in 2 phases), each with an average of 4-instructions per texture op (i.e. 3 ALU ops and 1 TMU op) to hide 256 cycles of texture latency. This would correspond with 768 fragments in flight at one time.
So, does RV530 make use of a multiple-batch scheduler, like Xenos?
If so, it would imply that with such a small batch size (24) all forms of dynamic branching (loops, if...then...else) become fairly practical. What makes dynamic branching in NV40's or G70's fragment shader architecture worthless is the extremely high cost if only 1 fragment follows the worst-case execution path. It causes all other fragments in the batch to follow the same, slow, execution path.
So a smaller batch, if it's possible to implement by using multiple-batch scheduling (like Xenos) would make the worst-case execution path far less costly overall.
(Throughout this comparison, I've stuck to "256-cycle latency" for texturing. This depends on the clock rate and architecture, so in RV530 the latency could be much longer, for example. Naturally, if I've got the batch size wildly wrong, e.g. it should be 512-cycles, I believe the analysis will stand simply by changing the phase count - and the overall concept that Xenos and RV530 both schedule multiple
small-batches still holds.)
Jawed