- psurge, there's Dave posting just ahead of you to say just how confusing this architecture is!
Going back to the patent and thinking about the command thread queues (one for pixel shaders, the other for vertex shaders), each thread is described as having a status recorded against it, in the queue. e.g. "needs texturing" "needs ALU". This is what drives thread switching, finding threads that are ready to execute and prioritising by age, or shader length (or something).
In general, when you start a group of objects together, and run the same code on them, then they'll all run in sync with each other (dynamic branches cause the exception). So each time you pull a single thread out of a command queue to execute it (ALU or texturing), you're actually pulling out 4 or 16 or 48 objects, all with a common (coherent) state, at least partially so.
The patent also talked about thread interleaving in the ALU pipeline. This may turn out to be a reference to the interleaving of vertex and pixel threads. Or...
Your idea about quad serialisation is one that I've pondered a number of times, since quad-organisation is primarily a mechanism to optimise texturing, as far as I can tell - there's no specific reason to run the pixels of a quad across multiple pipes when they could simply follow one after the other.
In this scenario you don't process twelve pixel quads (48 pixels total), you process 48 pixels at a time without thinking in "quads". Then you repeat over and over, until you've exhausted all the pixels for the current triangle/shader.
By swapping round-robin between threads (one thread per group), you execute one instruction per group of objects.
So if you have 96 pixels to be shaded then you split them into 6 groups of 16, say. Each group of 16 pixels is submitted one after the other for execution, but with thread interleaving. So if your shader looks like:
InstrX
InstrY
InstrZ
Execution of the pixel groups (threads A to F, each of 16 pixels) looks like:
Code:
InstrX:
A1 A2 A3 ... A15 A16
B1 B2 B3 ... B15 B16
...
F1 F2 F3 ... F15 F16
InstrY:
A1 A2 A3 ... A15 A16
B1 B2 B3 ... B15 B16
...
F1 F2 F3 ... F15 F16
InstrZ:
A1 A2 A3 ... A15 A16
B1 B2 B3 ... B15 B16
...
F1 F2 F3 ... F15 F16
This way the instruction decode latency is minimal, which makes the instruction execution pipeline extremely short (you only need to decode once before running off 96 pixels!).
You only process those 96 pixels when the texels have been produced. Ideally you'd blat the TMUs for the maximum coverage of texels with the minimum of requests. Quads are no longer a useful way of thinking of texels. Too fine-grained.
When another texture operation is required, then you swap out this wodge of 96 pixels and attack some other group. And keep on attacking groups until our 96 pixels' texels are ready.
My worry is all the small triangles - what I've described seems to be good at large areas of the same shader. The only defence would appear to be that lots of contiguous triangles across a surface will be running the same shader - so ideally you want to guarantee that you generate the vertices for contiguous surfaces serially, to maximise texturing coherency. Which is prolly where higher order surfaces or tesselation come in (he says, knowing nothing about both!).
Anyway, that's my mad variation on your ideas.
Can't wait to find out the truth! Exciting stuff.
Jawed