xbox360 gpu explained...or so

Jawed said:
rwolf said:
Here is a nasty limitation.

All 48 of the ALUs are able to perform operations on either pixel or vertex data. All 48 have to be doing the same thing during the same clock cycle (pixel or vertex operations), but this can alternate from clock to clock. One cycle, all 48 ALUs can be crunching vertex data, the next, they can all be doing pixel ops, but they cannot be split in the same clock cycle.

Yep, it sounds shit to me. It makes me wonder if dynamic branching is ever going to bring improved performance. Seems unlikely.

Jawed
How does this relate to dynamic branching. Would not vertex branches be to other vertex processing only (and same for pixel shading)?
 
rwolf said:
It is not about feeding the GPU, but utilizing the resources in the GPU effectively. The GeforceFX had terrible perfomance because it couldn 't utilize all its processing power. The ALUs in the GPU were starved and it wasn't until Nvidia wrote a decent shader compiler that the performance improved.

ATI has developed a chip that dynamically does what the shader compiler was doing and keeps the ALUs busy all the time.

The GFX's performance troubles are orthogonal to the issue of unified shading. You could have a unified shader architecture and still not have enough buffer space to hold the context of all the threads you've got going (temporary register space, etc), and thus you'd stall. Unified shading wouldn't fix that. The GFX's problem was that it ran out of resources and the penalty for a resource overrun was very bad.

In the case of the arbiter design, we are talking about the arbitration algorithm and what's the best policy. Do all of one thing, until you block, then swich to the other tasks (geometry shading, vertex shading, pixel shading), OR, don't wait until FIFOs are completely full, but say, when other pipelines are almost out of work (say, pixel shading pipeline is 90% free), in which case, you switch to trying to prevent the workload from drying up.

The issue I'm getting at is, trying to prevent bubbles from entering the pipeline. It seems to be that switching tasks only when you can't stuff anymore work into buffers could mean a hiccup, unless you assume that for instance, if a vertex shader finishes, in the very next cycle, the pixel shading ALUs can start work on that item (leaving aside triangle setup, et al), it just seems to me that there is a big gap inbetween, and you'd never want to to wait until say vertex shaders have "nothing more they can do" before you switch to PS, and vice versa.
 
nelg said:
How does this relate to dynamic branching. Would not vertex branches be to other vertex processing only (and same for pixel shading)?
Yes, sure.

I'm curious about the number of instructions that each object (pixel or vertex) has executed for it.

If a dynamic loop would execute 2 times for one pixel but 5 times for another you would hope that the first pixel would exit the loop early and continue with another part of the shader code.

What we're hearing is that all pixels in a group will run the loop the same number of times (the maximum is determined by whichever pixel takes longest). Pixels like the first one aren't affected by the loop after the second time round, but they still have to go through the motions until all pixels have finished the loop.

We don't know how big the group is. Is it all 48 objects, or 16 or 4?...

The larger the group, the lower the chances that you'll gain any performance from using dynamic branching.

(Hope that's what you're asking!)

Jawed
 
rwolf said:
Here is a nasty limitation.

All 48 of the ALUs are able to perform operations on either pixel or vertex data. All 48 have to be doing the same thing during the same clock cycle (pixel or vertex operations), but this can alternate from clock to clock. One cycle, all 48 ALUs can be crunching vertex data, the next, they can all be doing pixel ops, but they cannot be split in the same clock cycle.

This is not actually the case. I'm beginning to see where the confusion arises from though...
 
Jawed - based on these statements(rumors?) :
- a thread consists of 64 pixels or verts
- there are 3 banks of 16 processing elements (a bank is 16way SIMD)

here are my guesses :

- each of the 3 banks of 16 processing elements has a different program counter
- each processing element inside a bank runs the same instruction on a given cycle
- each processing element takes care of 4 out of 64 pixels/verts in a thread and each instruction gets run for 4 consecutive cycles. This provides execution unit latency hiding and maintains organization of pixels into quads.

If this is true, dynamic branching performance should still be much better than NV4x pixel shaders (but possibly not as good as NV4x vertex shaders - this will be interesting to find out)

On the other hand, maybe each processing element tracks a PC for each of its assigned pixels or verts. Even if all PEs in a bank get the same instruction, an individual PE could still search through its assigned pixels/verts in an attempt to find ones which continue with the given instruction. It would be a form of out of order execution - instructions get processed in-order, but data gets processed out-of order.

:? I have the feeling that without being an NDAed developer, we're not going to find out the interesting stuff until it's no longer interesting.

Also - on that block diagram - doesn't it sort of look like each bank passes stuff to the next bank? Also - the shader interpolators are connected only to the first bank... dunno what that would mean
 
psurge - I've just started transcribing a conference call with a few of the lead architects. Your initial three guesses are very good. Basically, ignore the diagram for any kind of flow, its not at all accurate - the three SIMD engines are not in any way dependant on one another.
 
:LOL: - psurge, there's Dave posting just ahead of you to say just how confusing this architecture is!

Going back to the patent and thinking about the command thread queues (one for pixel shaders, the other for vertex shaders), each thread is described as having a status recorded against it, in the queue. e.g. "needs texturing" "needs ALU". This is what drives thread switching, finding threads that are ready to execute and prioritising by age, or shader length (or something).

In general, when you start a group of objects together, and run the same code on them, then they'll all run in sync with each other (dynamic branches cause the exception). So each time you pull a single thread out of a command queue to execute it (ALU or texturing), you're actually pulling out 4 or 16 or 48 objects, all with a common (coherent) state, at least partially so.

The patent also talked about thread interleaving in the ALU pipeline. This may turn out to be a reference to the interleaving of vertex and pixel threads. Or...

Your idea about quad serialisation is one that I've pondered a number of times, since quad-organisation is primarily a mechanism to optimise texturing, as far as I can tell - there's no specific reason to run the pixels of a quad across multiple pipes when they could simply follow one after the other.

In this scenario you don't process twelve pixel quads (48 pixels total), you process 48 pixels at a time without thinking in "quads". Then you repeat over and over, until you've exhausted all the pixels for the current triangle/shader.

By swapping round-robin between threads (one thread per group), you execute one instruction per group of objects.

So if you have 96 pixels to be shaded then you split them into 6 groups of 16, say. Each group of 16 pixels is submitted one after the other for execution, but with thread interleaving. So if your shader looks like:

InstrX
InstrY
InstrZ

Execution of the pixel groups (threads A to F, each of 16 pixels) looks like:
Code:
InstrX:
A1 A2 A3 ... A15 A16
B1 B2 B3 ... B15 B16
...
F1 F2 F3 ... F15 F16

InstrY:
A1 A2 A3 ... A15 A16
B1 B2 B3 ... B15 B16
...
F1 F2 F3 ... F15 F16

InstrZ:
A1 A2 A3 ... A15 A16
B1 B2 B3 ... B15 B16
...
F1 F2 F3 ... F15 F16

This way the instruction decode latency is minimal, which makes the instruction execution pipeline extremely short (you only need to decode once before running off 96 pixels!).

You only process those 96 pixels when the texels have been produced. Ideally you'd blat the TMUs for the maximum coverage of texels with the minimum of requests. Quads are no longer a useful way of thinking of texels. Too fine-grained.

When another texture operation is required, then you swap out this wodge of 96 pixels and attack some other group. And keep on attacking groups until our 96 pixels' texels are ready.

My worry is all the small triangles - what I've described seems to be good at large areas of the same shader. The only defence would appear to be that lots of contiguous triangles across a surface will be running the same shader - so ideally you want to guarantee that you generate the vertices for contiguous surfaces serially, to maximise texturing coherency. Which is prolly where higher order surfaces or tesselation come in (he says, knowing nothing about both!).

Anyway, that's my mad variation on your ideas.

Can't wait to find out the truth! Exciting stuff.

Jawed
 
Of course I've blithely gone on about 96 pixels, when 64 pixels (groups A, B, C and D) would match all the rumours and the leak, and would, of course, form up to make a quad, serialised.

Also, I've just realised this makes for incredibly fast branching (when prediction fails)!

Because the pipeline for any one thread only holds one instruction (!), the latency for branching is effectively only 1 cycle (the branch test itself). If the branch prediction fails, the thread comes back with whatever the next instruction should be - there's no pipeline flush required as the pipeline is too short to hold the next instruction for this thread.

Against this you have the fact that 16 pixels in a group (say) are joined together, which means that there could be quite a severe cost associated with one pixel that wants to loop five times, whilst the rest of the group (15 pixels) only wants to loop two times.

Anyway, we'll soon see how big a group is...

Jawed
 
psurge said:
here are my guesses :

- each of the 3 banks of 16 processing elements has a different program counter
- each processing element inside a bank runs the same instruction on a given cycle
- each processing element takes care of 4 out of 64 pixels/verts in a thread and each instruction gets run for 4 consecutive cycles. This provides execution unit latency hiding and maintains organization of pixels into quads.


This works well as long as you have 2 or less branches. Once you have 3 or more branches, you need 3 or more program counters. Fortunately, that probably won' t be a common occurance.
 
Jawed - some more thoughts. The "out-of-order" approaches we are discussing are quite complicated - the quads exist not only for texture cache optimization, but for texture LOD optimization (I think each pixel in a quad gets the same texture LOD as its neighbours) and Z compression. So I would think that at the very least you'd have to retire quads (or something like them) in-order. If you are able to check out arbitrary groups of pixels or verts, you have to deal with synching everything back up at the end since output order is critical for verts, and API specified for pixels.... This would make scheduling much more difficult, since you'd have to make sure that a particular pixel didn't get "left behind" for too long just because there was other work available...
 
psurge, what about moving quads between the 3 "banks". For example, if you were able to figure out that a particular quad always takes one branch (via some sort of prediction) and should therefore be in Bank A (which always takes branch), but somehow that poor ole quad ended up in Bank B (which doesn't take the branch), then of course, that quad ends up executing extra code and it's "spot" in the Bank B could have been given up to another quad whose better suited.
 
DemoCoder -

I was thinking something similar: You only run a small subsequence of the shader program on each quad in a group (the body of a loop, or the body of an if statement for example). You allow a set of groups to accumulate results into 2 sets of child groups (1 for each side of a branch). You make child groups available to the scheduler as soon as they are filled up, will no longer grow, or some deadline expires. Each bank would send results to both of the other banks...

So for example, if most quads take a branch, you would accumulate the ones that didn't over some limited amount of time or # of input groups, then schedule as many as you can en-masse...

I think Dave's comment pretty much shoots all this speculation down.

BTW - I'm equally excited about the RSX pipeline architecture, maybe because info on that is non-existent. I figure that just because it can't process vertices doesn't mean that it won't do something novel with regards to branching...
 
psurge said:
Jawed - based on these statements(rumors?) :
- a thread consists of 64 pixels or verts
- there are 3 banks of 16 processing elements (a bank is 16way SIMD)

here are my guesses :

- each of the 3 banks of 16 processing elements has a different program counter
- each processing element inside a bank runs the same instruction on a given cycle
- each processing element takes care of 4 out of 64 pixels/verts in a thread and each instruction gets run for 4 consecutive cycles. This provides execution unit latency hiding and maintains organization of pixels into quads.

What Dave said was that these guesses by psurge was very good .. so that would be almost right, if not right.

DaveBaumann said:
psurge said:
Also - on that block diagram - doesn't it sort of look like each bank passes stuff to the next bank? Also - the shader interpolators are connected only to the first bank... dunno what that would mean

Basically, ignore the diagram for any kind of flow, its not at all accurate - the three SIMD engines are not in any way dependant on one another.

What Dave then said was that the diagram flowchart was inaccurate .. even if depicted like it is.

I take it, it is this diagram?

block.gif
 
psurge said:
Jawed - based on these statements(rumors?) :
- a thread consists of 64 pixels or verts
- there are 3 banks of 16 processing elements (a bank is 16way SIMD)

here are my guesses :

- each of the 3 banks of 16 processing elements has a different program counter
- each processing element inside a bank runs the same instruction on a given cycle
- each processing element takes care of 4 out of 64 pixels/verts in a thread and each instruction gets run for 4 consecutive cycles. This provides execution unit latency hiding and maintains organization of pixels into quads.

That sounds like three (true) vector processors with the shader main loop (per shader primitive either vertex or fragment) unrolled to the propper vector size (16 or 48?), the last point would mean that the vector ALUs aren't true SIMD ALUs but in fact SIMD instructions are also unrolled and converted into up to four consecutive instructions. In fact that would be quite more optimal for a vector architecture as would exploit all those scalar instructions and SIMD instructions with a result mask of less than three/four components.

On the topic of branches and latency hiding if switching threads was as cheap as updating the PC (that is, could be done each cycle) and you had enough threads (something like half a dozen to a dozen for ALU and branch instructions, quite more hiding memory instructions) there shouldn't be any penalty because of data or control dependencies. Shading is a so parallel task that anything other that to continue working with the next primitive could be more complex to implement and quite less efficient.

In vector architectures most branching is implemented with vector masks and I'm not sure how they handle true branches (they obviously support them) but they should be relatively expensive if they happen with small data sets (smaller than a multiple of the vector length). For shading true branching at the quad level doesn't makes sense in my opinion masking will always be enough efficient and quite more easy to handle for all kind of special situations. You have to remember that even if shader masking at the quad level adds innefficiency that shouldn't be worse than the inherent innefficience of quads. Up to three of the fragments in a quad could be already fully masked by early Z and stencil test.

I don't think that the arbiter between vertex and fragment threads requires any kind of complex policy but I haven't analysed the question in detail. I tried giving priority to vertices so that the vertex-triangle pipes are always filled and providing work for the fragment pipe and it worked well. The workload balance between batches (vertex limited vs fragment limited batches) inside a frame made that the configurations with more than 2 (quad) unified shaders units would have slightly better performance that the equivalent architecture with 4 vertex shaders and the same number of (quad) fragment shader units.
 
I agree, all the objects in a group will run the code to completion, with a "vector mask" (predication) on instructions to determine whether a given pixel or vertex is affected by the instruction.

I don't expect the GPU to try to re-sequence pixels/vertices into separate groups, as dynamic branching occurs. The fragmentation is practically infinite and it'll run away into a mess.

The key thing, ultimately, is to choose the size of group that typically provides the best compromise between most code that has no dynamic branches in it and the odd occasions when some pixels/vertices will run portions of code or iterations of a loop.

Obviously there are times when a dynamic branch affects all pixels/vertices equally, anyway - e.g. if the player is under water and there's an extra path in the shader for under-water rendering.

Jawed
 
The key thing is to best choose a processing size that fits with the inherant latency of the ALU's.
 
nAo said:
A stupid arbiter would just transform vertices until a internal buffer that stores transformed vertices is full, then arbiter switches ALUs to process pixels and so on, autobalancing pixels and vertexs troughput.
I hope hw is a bit smarter than that ;)

It'll be slow because you have to sort the whole stuff and buffer everything before you send it into the "pipelines". Buffering == bad.
I think you are not completely grasping what unified ALUs are doing, you don't need to sort/buffer anything.

Well I said I don't really grasp it ;)

Thx for the bit of explanation, I didn't think of it that way. :)
 
_xxx_ said:
You mean like, say, we have 25% pixel shading compared to 75% vertex shading workload in a scene and so the GPU would do n cycles pixel shading and 3xn cycles vertex shading?

No, the prioritisation algorithm will be pretty simple: as long as there are pixels in queue, pixels will be processed, when there are no more pixels, vertices will be processed to create more pixels to process.
 
The engineers nearly groaned when I ask for an explaination of the load balancer - in conceptual terms its simple, in real terms it seems fairly complex logic. At the basic level it analyses the sizes of the vertex and pixel buffers and tries to apportion the load so that they are equalised as much as possible (dependant on the program load). It can also recieve hints from the OS and application so the programmer can give a bias to the load balancer, increasing the priority of load types.
 
Back
Top