xbox360 gpu explained...or so

fek said:
No, the prioritisation algorithm will be pretty simple: as long as there are pixels in queue, pixels will be processed, when there are no more pixels, vertices will be processed to create more pixels to process.

Hmm, sounds interesting. On one side that creates a stall (the time between "no more pixels left" and "new pixels ready" and the buffering involved) and on the other side, nothing is ever idle (ideal case). I wonder if that's better or worse for effectivity than the traditional architectures, or to put it this way: in which situation is one better than the other?
 
_xxx_ said:
fek said:
No, the prioritisation algorithm will be pretty simple: as long as there are pixels in queue, pixels will be processed, when there are no more pixels, vertices will be processed to create more pixels to process.

Hmm, sounds interesting. On one side that creates a stall (the time between "no more pixels left" and "new pixels ready" and the buffering involved) and on the other side, nothing is ever idle (ideal case). I wonder if that's better or worse for effectivity than the traditional architectures, or to put it this way: in which situation is one better than the other?

Dave's description is more accurate than the simplified one I gave you.

Traditional architectures should be theoretically better when the actual balance between vertices and pixels is close to the balance hardcoded in the hardware. The closer it is the more efficient traditional architectures are.

Unified shader architecture might be more efficient when the balance wildly change across time. An interesting situation might be deferred rendering engines, where a first stage prepares geometrical data for each pixel and is heavily vertex biased, and it's followed by a series of post processing stages to apply the lighting models which are heavily fragment biased.
 
_xxx_ said:
fek said:
No, the prioritisation algorithm will be pretty simple: as long as there are pixels in queue, pixels will be processed, when there are no more pixels, vertices will be processed to create more pixels to process.

Hmm, sounds interesting. On one side that creates a stall (the time between "no more pixels left" and "new pixels ready" and the buffering involved) and on the other side, nothing is ever idle (ideal case). I wonder if that's better or worse for effectivity than the traditional architectures, or to put it this way: in which situation is one better than the other?

In my opinion that is the wrong approach. You should try to keep the triangle queue filled rather than the fragment queue empty to avoid stalls. So vertices should have priority and be executed as soon as all the resources they require are available.

What Dave is telling is that rather than just implementing this fully dynamic but simplistic approach ATI has implemented a mechanism to assign weight/priority to vertex and fragment threads. That weight may be the dynamically calculated by the GPU and/or statically calculated for each batch/frame by the library/driver from feedback GPU statistics.
 
RoOoBo said:
In my opinion that is the wrong approach. You should try to keep the triangle queue filled rather than the fragment queue empty to avoid stalls. So vertices should have priority and be executed as soon as all the resources they require are available.

I understand your point of view, but I think it's the other way around. The reason behind this being that there are many more pixels flying than vertices and you want to process them as soon as you can. Once all pixels are gone, vertices can produce more primitives to rasterize to produce more pixels to fill up the queue again for the next cycle. At least, this was (more or less) the line of reasoning ATI gave us.
 
DaveBaumann said:
The key thing is to best choose a processing size that fits with the inherant latency of the ALU's.

OK I'm going to assume from this that the ALUs are not pipelined at all, and that the inherent latency is due to instruction decode/setup. This would reinforce the idea of quad-serialisation, since the ALU would suffer decode latency once per quad, per instruction.

On this basis it pays to string lots of quads through one ALU, as the aggregate latency is proportional to the inverse of the number of quads.

But you want to share decode logic across the ALUs, otherwise you have all ALUs forced to run the same instruction. Most triangles would be far too small for that and it would make execution of vertex shaders incredibly granular (unrealistically large batches).

So it boils down to a trade-off that's similar to the trade-off that ATI makes in the R300-based architectures for sizing the quad-tiles, currently 16x16 pixels in size. That trade-off seems to be driven by texture cache size versus triangle size and overdraw.

So the group size determines decode latency for the ALUs, but it also determines the granularity of triangle T&L/shading or per-triangle pixel-shading.

Large granularity is undesireable, but fine-granularity creates a huge overhead both in terms of instruction decode logic as well as increased latency.

All presuming that there is no ALU pipelining...

Jawed
 
fek said:
I understand your point of view, but I think it's the other way around. The reason behind this being that there are many more pixels flying than vertices and you want to process them as soon as you can. Once all pixels are gone, vertices can produce more primitives to rasterize to produce more pixels to fill up the queue again for the next cycle. At least, this was (more or less) the line of reasoning ATI gave us.

I don't understand how that works. If you don't have shaded vertices that can be rasterized into fragments you don't have fragments to shade. The priorization of fragments implies that there may be stalls when all the queued fragments have been processed and new vertices have to be shaded. It may take many cycles until the new vertices are assembled into triangles, the triangles rasterized and the new fragments reach the shader. If the vertex programs are long that stall may be hundreds of cycles long.

There is no additional work performed when vertices are prioritized as all those vertices have to be shaded. The rendering of a batch ends after the last fragment from the last triangle assembled from the last vertices are fully processed so there is no way you can finish the batch before the last vertices are shaded.

If the batch is fragment limited the vertex queues will be filled fast and the shader will be executing fragments most of the time. Only briefly as a groups of shaded vertices are processed and the post vertex shading queues emptied will the shaders execute new vertex threads.

The only problems I see is that if the size of a vertex group (the processing unit for a shader) is too large there will be a longer delay until the first fragments reach the z and color stages. If the overhead of switching from vertex shading to fragment shading (remember there are two shader programs and two different sets of shader state involved) is that large that implies large vertex groups that may be a problem. However I don't see how executing fragment first solves that problem as you still have to process vertices in those large groups and wait until the first vertices in the group generate new fragments.

If the batch is vertex limited the fragment stages are going to be underutilised whatever you do and fragments will only execute when the vertex queues are full. If the fragment queues become full while the vertex queue is still being filled (unlikely as the fragment queues are quite larger) it won't matter as the render time will be determined by the number of vertex to shade and the temporal burst of fragments will be hidden by other vertices generating less fragments.
 
Jawed said:
But you want to share decode logic across the ALUs, otherwise you have all ALUs forced to run the same instruction. Most triangles would be far too small for that and it would make execution of vertex shaders incredibly granular (unrealistically large batches).

What is the problem with fragments of different triangles fetching/decoding/executing the same instruction? All the triangles in a batch execute the same program. For the shader unit a fragment being from a triangle or another doesn't matter much. Well, unless the attribute interpolation is performed in the same shader unit and then it requires to access per triangle attribute data ... But that would be something like a special register move/load instruction that should be handled in a special way.

And batches must be actually quite large in terms of fragments (and relatively large in terms of triangles) if you don't want to be state change limited. Which is a quite silly way to waste GPU performance unless is unavoidable. All those powerpoints about GPU optimization asking for large batches are for a reason.

Also I wonder if the decode stage is that complex in a shader unit as they could be still mostly microcoded. As in a DSP the scheduling and data dependency checks could all be performed by the compiler at the library/driver. An instruction fetch, depending on how the shader architecture is, could be reused for dozens of inputs. That is what a vector processor is about, a single instruction (a single fetch) executed over dozens of inputs reducing any needs for complex and expensive fetch units like these in general purpose CPUs.
 
DaveBaumann said:
The engineers nearly groaned when I ask for an explaination of the load balancer - in conceptual terms its simple, in real terms it seems fairly complex logic. At the basic level it analyses the sizes of the vertex and pixel buffers and tries to apportion the load so that they are equalised as much as possible (dependant on the program load)
In a perfect world the load balancer logic should be fairly simple, I believe the extra complexity mostly arise from every variable which value can be predicted. R500 someway already fixes this problem making frame buffer reads/writes completely predictable, but textures sampling latencies are still dependant from a shared bus.
Dynamic branching in vertex and pixel shaders would make scheduling even more complex as it can shorten or lengthen shaders execution time
in unpredictable ways.
. It can also recieve hints from the OS and application so the programmer can give a bias to the load balancer, increasing the priority of load types.
I remember we already speculated about this 'feature' a year ago or so.. ;)
However this fact might tells us autobalancing is not the best choice in every case.
What if the balacing politics reserves/gives some ALUs cycle to vertex processing even when there's no need at all?
Let's say we are drawing a full screen quad and we're appying some convolution filter, we would like to assign all ALUs to pixel processing..
 
The balancer is also going to be working in conjunction with the sequencers - the balancer can't actually allocate ALU's, thats the sequencers job, so if there is no VS programs in for the sequencers then the ALU's will all be executing pixel processing jobs.
 
RoOoBo said:
I don't understand how that works. If you don't have shaded vertices that can be rasterized into fragments you don't have fragments to shade. The priorization of fragments implies that there may be stalls when all the queued fragments have been processed and new vertices have to be shaded. It may take many cycles until the new vertices are assembled into triangles, the triangles rasterized and the new fragments reach the shader. If the vertex programs are long that stall may be hundreds of cycles long.

This kind of stall might happen but it's an infrequent situation. You want to optimise the most frequent situation though.

There is no additional work performed when vertices are prioritized as all those vertices have to be shaded. The rendering of a batch ends after the last fragment from the last triangle assembled from the last vertices are fully processed so there is no way you can finish the batch before the last vertices are shaded.

This policy would lead to two problems:

- the first is with complex vertices holding fragments to go through, starving the rest of the pipeline (this isnt such a big problem anyway)

- you very quickly and often fill up the pixel queues, by rasterizing lots of triangles and not processing their fragments, cause you are busy processing all vertices first; this is a stall and when one triangle is producing 20 pixels on average, this is something very much likely to happen and something you want to avoid; when the pixels queue is filled up, you must cancel all vertex processing, process some pixels, go back to vertices, fill up, stall, and so on

However I don't see how executing fragment first solves that problem as you still have to process vertices in those large groups and wait until the first vertices in the group generate new fragments.

I think this kind of stall is unavoidable but it's more desirable than starving the whole pipeline or having to flush vertices.

If the batch is vertex limited the fragment stages are going to be underutilised whatever you do and fragments will only execute when the vertex queues are full. If the fragment queues become full while the vertex queue is still being filled (unlikely as the fragment queues are quite larger) it won't matter as the render time will be determined by the number of vertex to shade and the temporal burst of fragments will be hidden by other vertices generating less fragments.

Yes, that's true, but batches tend to be more often fragment limited than vertex limited.

I think there's no perfect solution for every case, but this solution works best in the average case and it's easy to implement. As Dave pointed out, this is a high level view of things, the actual low level implementation might be slightly different and use some logic to prioritize vertices in some corner cases where it might make more sense.
 
DaveBaumann said:
The balancer is also going to be working in conjunction with the sequencers - the balancer can't actually allocate ALU's, thats the sequencers job, so if there is no VS programs in for the sequencers then the ALU's will all be executing pixel processing jobs.
Nice. Everyone is overspeculating..Dave..give us all the stuff you learnt ASAP! please :)
 
I can't find the patent in the european patent site where you can get them as pdfs rather than tiffs.

Fek, if an application is really using fragment shaders hardly any stage below the shader stages is going to be fully used, in fact utilizations of 10% or less are more likely. The modern graphic applications is likely to be either shader limited or bandwidth limited. In the case of being shader limited the prioritazion of fragments creates stalls in the shader while the priorization of vertex doesn't produce stalls in the shader. So the second is better.

If the vertex and triangle queues are full the fragment queues (pre shading) will only be empty if the triangles don't generate any valid fragment. Which means that there can't be stalls because of lack of fragments to process as there are no fragments to process. In fact, as the Hierarchical Z test and early Z and stencil stages are before fragment shading you want those queues to be as full as possible, and that can only happen if the vertex and triangle queues are filled.
 
RoOoBo said:
I can't find the patent in the european patent site where you can get them as pdfs rather than tiffs.
it will appear in a couple of weeks..
 
Back
Top