xbox360 gpu explained...or so

Discussion in 'Architecture and Products' started by tEd, May 19, 2005.

  1. _xxx_

    Banned

    Joined:
    Aug 3, 2004
    Messages:
    5,008
    Likes Received:
    86
    Location:
    Stuttgart, Germany
    Hmm, sounds interesting. On one side that creates a stall (the time between "no more pixels left" and "new pixels ready" and the buffering involved) and on the other side, nothing is ever idle (ideal case). I wonder if that's better or worse for effectivity than the traditional architectures, or to put it this way: in which situation is one better than the other?
     
  2. fek

    fek
    Newcomer

    Joined:
    Oct 27, 2003
    Messages:
    20
    Likes Received:
    0
    Location:
    Guildford, England
    Dave's description is more accurate than the simplified one I gave you.

    Traditional architectures should be theoretically better when the actual balance between vertices and pixels is close to the balance hardcoded in the hardware. The closer it is the more efficient traditional architectures are.

    Unified shader architecture might be more efficient when the balance wildly change across time. An interesting situation might be deferred rendering engines, where a first stage prepares geometrical data for each pixel and is heavily vertex biased, and it's followed by a series of post processing stages to apply the lighting models which are heavily fragment biased.
     
  3. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    In my opinion that is the wrong approach. You should try to keep the triangle queue filled rather than the fragment queue empty to avoid stalls. So vertices should have priority and be executed as soon as all the resources they require are available.

    What Dave is telling is that rather than just implementing this fully dynamic but simplistic approach ATI has implemented a mechanism to assign weight/priority to vertex and fragment threads. That weight may be the dynamically calculated by the GPU and/or statically calculated for each batch/frame by the library/driver from feedback GPU statistics.
     
  4. fek

    fek
    Newcomer

    Joined:
    Oct 27, 2003
    Messages:
    20
    Likes Received:
    0
    Location:
    Guildford, England
    I understand your point of view, but I think it's the other way around. The reason behind this being that there are many more pixels flying than vertices and you want to process them as soon as you can. Once all pixels are gone, vertices can produce more primitives to rasterize to produce more pixels to fill up the queue again for the next cycle. At least, this was (more or less) the line of reasoning ATI gave us.
     
  5. Unknown Soldier

    Veteran

    Joined:
    Jul 28, 2002
    Messages:
    4,047
    Likes Received:
    1,670
    I thought it would have to be prioritised, there was no other explanation.

    btw. fek .. how long till B&W2 ??

    :D

    US
     
  6. fek

    fek
    Newcomer

    Joined:
    Oct 27, 2003
    Messages:
    20
    Likes Received:
    0
    Location:
    Guildford, England
    When it's done :D
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    OK I'm going to assume from this that the ALUs are not pipelined at all, and that the inherent latency is due to instruction decode/setup. This would reinforce the idea of quad-serialisation, since the ALU would suffer decode latency once per quad, per instruction.

    On this basis it pays to string lots of quads through one ALU, as the aggregate latency is proportional to the inverse of the number of quads.

    But you want to share decode logic across the ALUs, otherwise you have all ALUs forced to run the same instruction. Most triangles would be far too small for that and it would make execution of vertex shaders incredibly granular (unrealistically large batches).

    So it boils down to a trade-off that's similar to the trade-off that ATI makes in the R300-based architectures for sizing the quad-tiles, currently 16x16 pixels in size. That trade-off seems to be driven by texture cache size versus triangle size and overdraw.

    So the group size determines decode latency for the ALUs, but it also determines the granularity of triangle T&L/shading or per-triangle pixel-shading.

    Large granularity is undesireable, but fine-granularity creates a huge overhead both in terms of instruction decode logic as well as increased latency.

    All presuming that there is no ALU pipelining...

    Jawed
     
  8. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    I don't understand how that works. If you don't have shaded vertices that can be rasterized into fragments you don't have fragments to shade. The priorization of fragments implies that there may be stalls when all the queued fragments have been processed and new vertices have to be shaded. It may take many cycles until the new vertices are assembled into triangles, the triangles rasterized and the new fragments reach the shader. If the vertex programs are long that stall may be hundreds of cycles long.

    There is no additional work performed when vertices are prioritized as all those vertices have to be shaded. The rendering of a batch ends after the last fragment from the last triangle assembled from the last vertices are fully processed so there is no way you can finish the batch before the last vertices are shaded.

    If the batch is fragment limited the vertex queues will be filled fast and the shader will be executing fragments most of the time. Only briefly as a groups of shaded vertices are processed and the post vertex shading queues emptied will the shaders execute new vertex threads.

    The only problems I see is that if the size of a vertex group (the processing unit for a shader) is too large there will be a longer delay until the first fragments reach the z and color stages. If the overhead of switching from vertex shading to fragment shading (remember there are two shader programs and two different sets of shader state involved) is that large that implies large vertex groups that may be a problem. However I don't see how executing fragment first solves that problem as you still have to process vertices in those large groups and wait until the first vertices in the group generate new fragments.

    If the batch is vertex limited the fragment stages are going to be underutilised whatever you do and fragments will only execute when the vertex queues are full. If the fragment queues become full while the vertex queue is still being filled (unlikely as the fragment queues are quite larger) it won't matter as the render time will be determined by the number of vertex to shade and the temporal burst of fragments will be hidden by other vertices generating less fragments.
     
  9. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    What is the problem with fragments of different triangles fetching/decoding/executing the same instruction? All the triangles in a batch execute the same program. For the shader unit a fragment being from a triangle or another doesn't matter much. Well, unless the attribute interpolation is performed in the same shader unit and then it requires to access per triangle attribute data ... But that would be something like a special register move/load instruction that should be handled in a special way.

    And batches must be actually quite large in terms of fragments (and relatively large in terms of triangles) if you don't want to be state change limited. Which is a quite silly way to waste GPU performance unless is unavoidable. All those powerpoints about GPU optimization asking for large batches are for a reason.

    Also I wonder if the decode stage is that complex in a shader unit as they could be still mostly microcoded. As in a DSP the scheduling and data dependency checks could all be performed by the compiler at the library/driver. An instruction fetch, depending on how the shader architecture is, could be reused for dozens of inputs. That is what a vector processor is about, a single instruction (a single fetch) executed over dozens of inputs reducing any needs for complex and expensive fetch units like these in general purpose CPUs.
     
  10. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    In a perfect world the load balancer logic should be fairly simple, I believe the extra complexity mostly arise from every variable which value can be predicted. R500 someway already fixes this problem making frame buffer reads/writes completely predictable, but textures sampling latencies are still dependant from a shared bus.
    Dynamic branching in vertex and pixel shaders would make scheduling even more complex as it can shorten or lengthen shaders execution time
    in unpredictable ways.
    I remember we already speculated about this 'feature' a year ago or so.. ;)
    However this fact might tells us autobalancing is not the best choice in every case.
    What if the balacing politics reserves/gives some ALUs cycle to vertex processing even when there's no need at all?
    Let's say we are drawing a full screen quad and we're appying some convolution filter, we would like to assign all ALUs to pixel processing..
     
  11. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    The balancer is also going to be working in conjunction with the sequencers - the balancer can't actually allocate ALU's, thats the sequencers job, so if there is no VS programs in for the sequencers then the ALU's will all be executing pixel processing jobs.
     
  12. fek

    fek
    Newcomer

    Joined:
    Oct 27, 2003
    Messages:
    20
    Likes Received:
    0
    Location:
    Guildford, England
    This kind of stall might happen but it's an infrequent situation. You want to optimise the most frequent situation though.

    This policy would lead to two problems:

    - the first is with complex vertices holding fragments to go through, starving the rest of the pipeline (this isnt such a big problem anyway)

    - you very quickly and often fill up the pixel queues, by rasterizing lots of triangles and not processing their fragments, cause you are busy processing all vertices first; this is a stall and when one triangle is producing 20 pixels on average, this is something very much likely to happen and something you want to avoid; when the pixels queue is filled up, you must cancel all vertex processing, process some pixels, go back to vertices, fill up, stall, and so on

    I think this kind of stall is unavoidable but it's more desirable than starving the whole pipeline or having to flush vertices.

    Yes, that's true, but batches tend to be more often fragment limited than vertex limited.

    I think there's no perfect solution for every case, but this solution works best in the average case and it's easy to implement. As Dave pointed out, this is a high level view of things, the actual low level implementation might be slightly different and use some logic to prioritize vertices in some corner cases where it might make more sense.
     
  13. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Nice. Everyone is overspeculating..Dave..give us all the stuff you learnt ASAP! please :)
     
  14. Demirug

    Veteran

    Joined:
    Dec 8, 2002
    Messages:
    1,326
    Likes Received:
    69
    Maybe this can help a little bit to understand it.
     
  15. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Demirug, that's prolly made my day. :D

    Now I wonder how long it's going to take to decode...

    Jawed
     
  16. Unknown Soldier

    Veteran

    Joined:
    Jul 28, 2002
    Messages:
    4,047
    Likes Received:
    1,670
    /me smacks head .. guess I deserved that biatch :D

    ;)

    fanks man

    US
     
  17. Unknown Soldier

    Veteran

    Joined:
    Jul 28, 2002
    Messages:
    4,047
    Likes Received:
    1,670
    Fek(no not you fek :D) .. I can't see the images

    US
     
  18. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    You need to download the special TIFF applet. Instructions are on the USPTO site...

    Jawed
     
  19. RoOoBo

    Regular

    Joined:
    Jun 12, 2002
    Messages:
    308
    Likes Received:
    31
    I can't find the patent in the european patent site where you can get them as pdfs rather than tiffs.

    Fek, if an application is really using fragment shaders hardly any stage below the shader stages is going to be fully used, in fact utilizations of 10% or less are more likely. The modern graphic applications is likely to be either shader limited or bandwidth limited. In the case of being shader limited the prioritazion of fragments creates stalls in the shader while the priorization of vertex doesn't produce stalls in the shader. So the second is better.

    If the vertex and triangle queues are full the fragment queues (pre shading) will only be empty if the triangles don't generate any valid fragment. Which means that there can't be stalls because of lack of fragments to process as there are no fragments to process. In fact, as the Hierarchical Z test and early Z and stencil stages are before fragment shading you want those queues to be as full as possible, and that can only happen if the vertex and triangle queues are filled.
     
  20. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    it will appear in a couple of weeks..
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...