xbox360 gpu explained...or so

RoOoBo said:
I can't find the patent in the european patent site where you can get them as pdfs rather than tiffs.

Fek, if an application is really using fragment shaders hardly any stage below the shader stages is going to be fully used, in fact utilizations of 10% or less are more likely. The modern graphic applications is likely to be either shader limited or bandwidth limited. In the case of being shader limited the prioritazion of fragments creates stalls in the shader while the priorization of vertex doesn't produce stalls in the shader. So the second is better.

I can't understand why priorization of fragments is a bad thing in a fragment limited situation. It looks like the opposite to me: in a fragment limited scenario you likely want to process as many fragments you can to not fill up the queue and stall. Can you elaborate this point please?
 
Well the patent is a nice simple read with no surprises, confirming that a pixel-driven (rather than vertex driven) load balancing scheme is used.

Jawed
 
Jawed said:
Well the patent is a nice simple read with no surprises, confirming that a pixel-driven (rather than vertex driven) load balancing scheme is used
Umh..where?
 
Jawed said:
I'm not going to hold your hand.
LOL! Jawed, we're here to discuss, if you don't have anything to add don't even bother to reply, who makes a statement also has the burden of the proof. C'mon..cut&paste, it's not that difficult.
I found this:
The unified shader 62 has ability to simultaneously perform vertex manipulation operations and pixel manipulation operations at various degrees of completion by being able to freely switch between such programs or instructions, maintained in the instruction store 98, very quickly. In application, vertex data to be processed is transmitted into the general purpose register block 92 from multiplexer 66. The instruction store 98 then passes the corresponding control signals to the processor 96 on line 01 to perform such vertex operations. However, if the general purpose register block 92 does not have enough available space therein to store the incoming vertex data, such information will not be transmitted as the arbitration scheme of the arbiter 64 is not satisfied. In this manner, any pixel calculation operations that are to be, or are currently being, performed by the processor 96 are continued, based on the instructions maintained in the instruction store 98, until enough registers within the general purpose register block 92 become available. Thus, through the sharing of resources within the unified shader 62, processing of image data is enhanced as there is no down time associated with the processor

When I first wrote in this forum about a very dumb vertex driven arbiter I wrote something very similar to the quote above:
nAo said:
A stupid arbiter would just transform vertices until a internal buffer that stores transformed vertices is full, then arbiter switches ALUs to process pixels and so on, autobalancing pixels and vertexs troughput.
I hope hw is a bit smarter than that Wink

Pixels driven what? :p
 
What you've described earlier as dumb and what the patent describes aren't the same, naturally.

As the unified shader completes the transformation of triangles the rasterised pixels that result, by necessity, have priority over vertex data, if the register block (aka "command thread queue") is short on spare capacity.

The sequencer will, obviously, take cognisance of the respective vertex and pixel shading loads, too.

When you've got a fixed-size FIFO you can't push data in unless you've taken data out. In the case of the unified shader a vertex command thread (for a triangle) and a pixel command thread (for a quad or a larger group?) each consume one slot in the FIFO.

Since the relationship between vertices and pixels is one to many, i.e. one input triangle generates lots of quads, you have no choice but to prioritise pixel shading.

It's not difficult to understand. There may be occasions when there are no vertex command threads in the register block, but in general the balance of the two will ebb and flow - determined largely by z-testing and the size of the triangles.

Plainly the complexity of the shaders for either vertices or pixels will affect the balance, too.

Oh, btw, thanks for the cut'n'paste, you saved me the rummage ;)

Jawed
 
Jawed said:
What you've described earlier as dumb and what the patent describes aren't the same, naturally.

As the unified shader completes the transformation of triangles the rasterised pixels that result, by necessity, have priority over vertex data, if the register block (aka "command thread queue") is short on spare capacity.

I agree with you on this. That's exactly the same thing I understood from reading that passage. And that's more or less on the same line of reasoning we heard from ATI.
 
Jawed said:
What you've described earlier as dumb and what the patent describes aren't the same, naturally.
In fact I said they're similar, not the same.
I wrote about transformed vertices buffer, the patent talks about incoming vertices buffer.
It's a pretty natural thing to do since if you don't transform vertices first you don't have any pixels to shade.

As the unified shader completes the transformation of triangles the rasterised pixels that result, by necessity, have priority over vertex data, if the register block (aka "command thread queue") is short on spare capacity.
Pixels have priority over vertices only when there's no more space for incoming vertices. You're very good at twisting arguments, nice ;)
Once there is more space on on chip buffers (shaded pixels buffer it's included too) the hw start to fetch new vertices and so on..

It's not difficult to understand.
In fact it isn't, but you haven't still grasped it :)
 
Jawed said:
When you've got a fixed-size FIFO you can't push data in unless you've taken data out. In the case of the unified shader a vertex command thread (for a triangle) and a pixel command thread (for a quad or a larger group?) each consume one slot in the FIFO.
I just want to revise this.

The register block uses one slot per vertex or pixel, since the register values for individual vertices (as part of a triangle) or pixels (as part of a quad) will differ. e.g. one pixel in a quad might have black stored in R0 whilst another pixel in the same quad might have white stored in R0.

So plainly each rasterised triangle can generate an avalanche of pixels, easily capable of entirely filling the register block.

Obviously, if there's another 47 unified shaders to share the workload, then the avalanche can be mitigated somewhat. It all depends on how many of the unified shaders are working on triangles...

Jawed
 
But what you're saying fek doesn't match up with what Dave said

Dave said:
At the basic level it analyses the sizes of the vertex and pixel buffers and tries to apportion the load so that they are equalised as much as possible (dependant on the program load). It can also recieve hints from the OS and application so the programmer can give a bias to the load balancer, increasing the priority of load types.

This seems to suggest that the arbiter never lets either buffer become empty, which goes against the "run the pixel shaders until no more pixels in queue" idea you claimed, which I pointed out a while ago might cause pipeline hiccups.

A "load balancer" which doesn't "balance" load isn't really a load balanced.
 
nAo said:
I wrote about transformed vertices buffer, the patent talks about incoming vertices buffer.

Which are two quite different concepts.

It's a pretty natural thing to do since if you don't transform vertices first you don't have any pixels to shade.

If there's no fragment to process, vertices will be processed to produce new fragments. That's easy.
 
DemoCoder said:
But what you're saying fek doesn't match up with what Dave said

Me and Dave discussed the incongruence in PM. It turns out we had inputs at two different levels from ATI. We can't disclose the specifics about it, but it looks like the high level load balancing algorithm is to "always" give precedence to pixels (which is natural), where my "always" would have been more accurate if turned into a "most of the times", following what Dave wrote, which happens at a lower level. This is only our guess though, but it looks somehow confirmed by the patent.

I hope this clarify my first rather imprecise and ambiguos claim.
 
fek said:
If there's no fragment to process, vertices will be processed to produce new fragments. That's easy.
you will have a lot of pipeline bubbles acting this way, you have to fetch (and transform) new vertices way before you are out of fragments to shade if
you don't want your ALUs to stall for:
1) new vertices to transform (fetch then in advance!)
2) new pixels to shade that the rasterizer have still to walk upon (tranform vertice and assemble primitives in advance!)
 
nAo said:
1) new vertices to transform (fetch then in advance!)

New vertices are fetched (but not processed according to the patent) in advance and stored in the queue.

Bubbles or not, in the very rare case there's no fragment to process, vertices must be processed to produce new fragments. This situation is better than processing vertices first; each primitive would likely produce many more fragments that have no space to be stored in, and would often brings the GPU to a big stall. This is clearly not a smart way to drive the pipelines array.

2) new pixels to shade that the rasterizer have still to walk upon (tranform vertice and assemble primitives in advance!)

Can you make this sentence clearer please?
 
nAo - when a triangle is z-tested to be visible (or partially visible) the pixels consequently generated by the rasteriser will consume way more register block slots than the triangle, itself, frees up.

The triangle will have used 3 slots. The resulting pixels might consume 500 slots.

Since triangles and pixels pass through the same bottleneck, you have to let through the pixels first, because they're ahead of any new triangles. They're ahead simply because pixel command threads are older than any newly generated vertex command threads you might care to process.

Vertex command threads (one triangle:) A, B and C generates pixels a1, a2 ... a500. Vertex command threads D, E and F which have just come along from the CPU can't be processed on this unified shader yet because the unified shaders 1 to 7 are stuffed full of the a1, a2 ... a500 pixels. Unified shader number 8 does have some spare capacity, so D, E and F can go there.

As I said, if you don't empty the FIFO (register block) you can't push data in. Emptying the FIFO consists of completely shading a pixel (or rejecting a triangle as entirely hidden).

Naturally the ideal way to run Xenos is with all the register blocks "not quite full". Unless the scene being drawn is heavily vertex fetch limited, the register blocks are going to spend the majority of their time stuffed with a majority of pixel data.

Clearly a great example of vertex fetch limited rendering is a stencil shadow rendering pass. Xenos should be very good at those...

Jawed
 
fek said:
nAo said:
I wrote about transformed vertices buffer, the patent talks about incoming vertices buffer.

Which are two quite different concepts.
Since there's a 1:1 mapping between untrasformed and transformed vertices, they're not that different concepts as balancing metric, even if it's not the same thing since untrasformed and transformed vertices buffers size can differ.
 
nAo said:
fek said:
nAo said:
I wrote about transformed vertices buffer, the patent talks about incoming vertices buffer.

Which are two quite different concepts.
Since there's a 1:1 mapping between untrasformed and transformed vertices, they're not that different concepts as balancing metric, even if it's not the same thing since untrasformed and transformed vertices buffers size can differ.

We are discussing about which element is processed (transformed as you say) first. In this context "input vertices" and "transformed vertices" are two completely different concepts, although there's a 1 to 1 mapping between the two concepts. An input vertex has clearly not been processed (transformed) yet.
 
Jawed said:
Naturally the ideal way to run Xenos is with all the register blocks "not quite full". Unless the scene being drawn is heavily vertex fetch limited, the register blocks are going to spend the majority of their time stuffed with a majority of pixel data.

Which is by far the most common scenario and the one to optimize for.
 
fek said:
Bubbles or not, in the very rare case there's no fragment to process, vertices must be processed to produce new fragments.
If hw would work as Jawed described this case will be quite common, not rare! ;)
In fact every time a fragments batch is shaded a bubble would happen if the hw has not processed some vertices in advance
This situation is better than processing vertices first; each primitive would likely produce many more fragments that have no space to be stored in, and would often brings the GPU to a big stall. This is clearly not a smart way to drive the pipelines array.
No one is saying the hw has to transform ALL the vertices before start to work in pixels, what I say is in order to avoid pipeline bubbles you'd want to allocate some ALUs slots to vertices way before all the previous fragments batch is completely shaded.

Can you make this sentence clearer please?
Once vertices are transformed new primitives are assembled.
Rasterizer takes new primitives and extract (walk..) fragments to shade.
 
Back
Top