TurnDragoZeroV2G said:
Well, they did move 20-25million transistors of logic to the daughter die (and there's only half as many ROPs, which go for 4 samples but ditch programmable pattern for fixed 4-sample pattern, too, right?). And doesn't Xenos kinda lack a need for anything resembling Avivo, which would allow for some savings there?
Yep. Even counting all that, Xenos seems to have a "low" transistor count. And not to forget it actually has 64 shader pipes (any 16 from the 64 are given up for redundancy to improve yield).
I think the "simpler" shader ALU organisation of Xenos is prolly a big part of it. Not only does that cut out the ALU (which is supposedly just capable of ADD) but it also cuts out a heap of complex issue/decode circuitry that has to work out
if the ADD can be dual-issued.
Additionally, Xenos saves transistors over conventional PC GPUs by lumping batches into 16-wide phases. The batch size in R520 is 16, but in four phases - on each phase a quad of pixels is processed (all running the same instruction). On Xenos a batch size is 64 pixels, again in four phases. By making the phases wider, like this, you use less transistors on the instruction fetch/issue/decode block - since you now have one of these blocks for each of 16 pipes. Whereas in PC GPUs, you have one of these blocks for each 4 pipes. So Xenos has the same number of these blocks as R520 does - but Xenos has four times as many pipes (ignoring R520's vertex pipes for a second). That's a big transistor saving over what you might expect.
Similarly, Xenos's texture pipes are treated as 16-wide, instead of four 4-wide quads. This means Xenos has one quarter of the texture-pipe decode logic (though I imagine that it's nowhere near as complex as the fetch/issue/decode logic required in the shader pipeline).
Xenos is supposed to work on batches of 8x8 pixels. But, it seems like it could actually batch up 2x2 pixels from any triangles in the scene, potentially, into the same batch.
Dave's article was written before it was known Xenos uses 64-sized batches.
The r520 article's picture illustrating dynamic branching implies that all the pixels in the batch are adjacent to eachother and that the smaller batch size is what allows it to achive greater efficiency. But if Xenos is batching up pixel quads from multiple triangles, wouldn't this be the equivalent of making some batches out of a few 2x2 blocks in the shadow, some in the "grey," and then some in the full light? In which case, DB efficiency would essentially go down the toilet?
Yes, Xenos will suffer lower DB efficiency than R520. R520 is a curiosity in this respect, because it's expected that all future ATI GPUs will increase the ALU:texture op ratio. In R520 it's 1:1. In R580 it's 3:1. So in R580, the batch size becomes 48, instead of R520's 16. So R580 (like Xenos) suffers a shortfall in DB efficiency compared with R520.
Is my understanding of this batching totally off, or is there alot more logic that goes into properly batching things together (or is it that all the pixels in the 64-pixel batch are adjacent to one another, even if they lie on multiple triangles...
)
A batch is formed of pixels that all have the same shader state. A shader state is defined by the need to run the same shader program. As far as I can tell, in ATI hardware this means the pixels must all come from the same triangle.
In vertex processing, the shader state effectively relates to vertex batches - all the vertices must be in the same batch. Since vertex batches are normally hundreds to tens of thousands (or more) in size, that's not a problem.
But it's arguable that dynamic branching in vertex shader programs is going to suffer a lot from the inefficiency of running in batches of 64. On the other hand the tessellation (creation, destruction or shifting) of vertices that Xenos supports may make this moot.
From the XFest documents I have Xenos has extra tricks up its sleeve to do with dynamic branching. These are instructions that allow the dev to program the sequencer (the control block in Xenos that organises shader execution at the batch level). A simple example might be to jump over portions of code in the shader, or to loop over a portion of code - doing so for all 64 objects in the batch, if they all match the same condition (or, if any one matches a condition). It's all a bit hairy, to be honest - I haven't worked out how that would be used
That's where real dev comments are going to be needed. Huge chunks of this document go right over my head.
Jawed