Anandtech - Inside the XBox 360 article

Discussion in 'Console Technology' started by therealskywolf, Nov 16, 2005.

  1. Bill

    Banned

    Joined:
    Oct 6, 2005
    Messages:
    388
    Likes Received:
    3
    Another thing that has come to my attention is this:

    http://www.beyond3d.com/reviews/ati/r520/index.php?p=03

    The ALU breakdown in a ATI GPU pipe.

    Note the first ALU has reduced functionality.

    X360 gets 48 of the big ALU.

    This could make it more powerful than it's ALU count seems. Each Xenos ALU might be 66% of a traditional ATI pipe say, instead of 50.

    In the ALU 1 and ALU 2 breakdown's on the linked page, X360 gets 48 ALU 2's.

    The implications are obvious. I've always figured two ALU's per pipe, so X360 is 24 pipes, but they have to be used for both functions. Wherea Nvidia has 24 pixel plus 8 vertex pipes.

    But if they're say, 66% of a pipe, it's more like Xenos is 32 pipes..

    The only thing I cant reconcile is how they kept the transistor count so low if that's the case, but I'm working on that.
     
    #41 Bill, Nov 17, 2005
    Last edited by a moderator: Nov 17, 2005
  2. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    The simple reason why he says that R500's pipes are 0.5x R420 pipes is that the number of ALUs is different, and if I remember the R500 design properly (which I might not, been a while since I read B3D's article), that seems like a VERY rough guess considering the number of ALUs and not the actual kind of ALU or anything else.
    Furthermore, comparing the R500 to the R420 and not the R520 is most original, because from that POV it's the same damn thing. Also, no comparaison with R520/R420 would properly show just how nice the dynamic branching & vertex texturing are on the Xenos.

    As for Vec3/Vec4: there definitively are Vec4 operations in Pixel Shaders, just not quite as much as in the Vertex Shader, where you very often work on x,y,z and w at the same time. Also, the Vertex Shader tends to have slightly more scalar (x, y, z, OR w) operations, relatively speaking, but the PS can also have a bunch, that depends completely on what kind of thing it does at any given time.
    I'm indeed not quite convinced by Vec4+Scalar for PS, but it can make things slightly faster than Vec3+Scalar. Also, Vec2+Vec2 is an horrible worst-case scenario for Vec4+Scalar, but that kind of operation is extraordinarly rare anyway, at least for the time being.


    Uttar
    EDIT: But remember about R520 vs R500 clocks on the same process: 625Mhz vs 500Mhz.
     
    #42 Arun, Nov 17, 2005
    Last edited by a moderator: Nov 17, 2005
  3. The GameMaster

    Newcomer

    Joined:
    Feb 9, 2005
    Messages:
    109
    Likes Received:
    1
    Yeap... by my observations while ATI Radeon R4xx and R5xx series had two shader units PER pixel pipeline but only ONE of those shader units was fully featured while the other one was a cut down version (half unit?)... it is not exactly known as to what exactly the differences are between those two except for the fact that it is not fully featured. I would surmise (and I am not totally sure) that the non-featured shader unit is used for normalization, much like how nVidia's shader units have those various "MiniALUs" for similar purposes. Looking at the claimed shader operations per second between the two video cards (x850XT and the Geforce 6800 Ultra as a comparison) seems to comfirm this as the Geforce 6800 Ultra is claimed to perform 51.2 Billion shader operations per second (at 400Mhz) versus ATI Radeon x850XT's claimed 43 Billion shader operations per second (at 540Mhz). So yes... only the second shader unit in the Radeon R42x/R52x really counts towards the shader performance in the Radeon cards. The Geforce cards (NV4x/NV5x) have two fully featured shader units, though the FIRST shader unit is coupled to a texture unit and if a texture operation is done you lose that first shader unit while the texture units in the Radeon cards are not... so it kinda evens out between the two video cards.

    As for XENOS... each of the pipelines have fully featured shader units and they are designed to handle 5 component (Vec4+Scalar) instead of the 4 component (Vec3+scalar) that the Radeon cards handled in the pixel shader units. According to other articles those shader units also possess additional logic to handle the normalization.

    Snipped from Dave's article...
    So yes... I tend to believe that XENOS is more akin to a 32-36 pixel pipeline Radeon R520 (as that is the only other ATI GPU that relies on increased ALU ultization from heavy multithreading) rather than a R420 as the shader units in XENOS and the x1000 series GPUs achieve a greater utilization rate compared to the R420/NV4x/NV5x shader units).

    It is REALLY hard to say for sure without a more absolute comparison... in which case we will have to wait until ATI's and nVidia's next video cards at the end of NEXT year (R6xx/NV6x) before we can really do some comparisons.
     
  4. TurnDragoZeroV2G

    Regular

    Joined:
    Nov 14, 2005
    Messages:
    583
    Likes Received:
    23
    Location:
    Who knows...
    Well, they did move 20-25million transistors of logic to the daughter die (and there's only half as many ROPs, which go for 4 samples but ditch programmable pattern for fixed 4-sample pattern, too, right?). And doesn't Xenos kinda lack a need for anything resembling Avivo, which would allow for some savings there?

    On the subject of Xenos' dynamic branching, I have a question there. I've lurked around long enough that I've read a fair majority (or maybe not, heh) of threads on Xenos, but I don't know if this has ever been addressed (simple yes or no would send me packing into search-wonderland... assuming this isn't a very large/huge misunderstanding on my part instead): From Dave's article:

    Xenos is supposed to work on batches of 8x8 pixels. But, it seems like it could actually batch up 2x2 pixels from any triangles in the scene, potentially, into the same batch. The r520 article's picture illustrating dynamic branching implies that all the pixels in the batch are adjacent to eachother and that the smaller batch size is what allows it to achive greater efficiency. But if Xenos is batching up pixel quads from multiple triangles, wouldn't this be the equivalent of making some batches out of a few 2x2 blocks in the shadow, some in the "grey," and then some in the full light? In which case, DB efficiency would essentially go down the toilet? Is my understanding of this batching totally off, or is there alot more logic that goes into properly batching things together (or is it that all the pixels in the 64-pixel batch are adjacent to one another, even if they lie on multiple triangles... :???: )
     
  5. Bill

    Banned

    Joined:
    Oct 6, 2005
    Messages:
    388
    Likes Received:
    3
    I just dont see how they kept the transistor count on Xenos so light.

    It obviously has more ALU's than current ATI designs.

    From R520 321m, yeah, you can knock maybe 30 off for controller and extra rops..

    And yeah maybe 20 for Avivo. No idea really.

    One thing is, R520 has a HUGE memory controller area. I dont see how Xenos can multi-thread as well with much less transistors, though.
     
    #45 Bill, Nov 17, 2005
    Last edited by a moderator: Nov 17, 2005
  6. The GameMaster

    Newcomer

    Joined:
    Feb 9, 2005
    Messages:
    109
    Likes Received:
    1
    For reference... the R520 pixel pipelines are identical to the pixel pipelines from the R420 in the fact that each pixel pipeline has one full featured shader unit and one "Half Unit"... except for the fact the texture units was removed from the pixel pipelines. I do not know if they improved the shader units like nVidia did with the NV5x as in allowing the shader units to operate 5 component operations (Vec4+scalar) instead of the 4 component (Vec3+scalar) that the R420 pixel shader units handled.

    Because XENOS does not have that massive memory controller (Ring Bus) that the R520 has (which if I remember hearing from ATI correct is actually taking up the most space on the R520 currently)... the actual threading processor does not take that many transistors. I am still not clear on exactly how many threads each array can process at once as that number is under NDA. At least on the Radeon x850 each pixel quad (4 pixel pipelines) consumed roughly 20-25 million transistors each, equal or less than that on the R520 as the texture units was removed from the pixel pipelines, and roughly 30-35 million transistors per pixel quad (4 pixel pipelines) on the NV4x/NV5x (slightly more on the NV5x due to increased logic). The R520 is unusual in the sense that the shader core is actually in the minority in terms of transistor usage, but considering ATI will be reusing this controller for the next several GPU incarnations I would imagine why... this memory controller was designed for the future.

    On the subject of dynamic branching on XENOS... indications are that it is present, but I don't have any solid information on that.
     
    #46 The GameMaster, Nov 17, 2005
    Last edited by a moderator: Nov 17, 2005
  7. TurnDragoZeroV2G

    Regular

    Joined:
    Nov 14, 2005
    Messages:
    583
    Likes Received:
    23
    Location:
    Who knows...
    I have no reason to doubt it has dynamic branching (though perhaps there's the question of whether it has a branch execution unit like r520?). My question is more about: does the way Xenos works make it even less efficient that its batch size might indicate.

    But, while we're offtopic:

    Here (some of) Xenos' capabilities are thrown on the table

    Under the dynamic flow control depth, xenos has listed "4 for loops/calls, 2^23 if nesting." What differentiates the two, and how exactly does this compare to the flat-out "24" of SM3.0? Is this a case of Xenos falling short of or exceeding SM3.0 spec?
     
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    Yep. Even counting all that, Xenos seems to have a "low" transistor count. And not to forget it actually has 64 shader pipes (any 16 from the 64 are given up for redundancy to improve yield).

    I think the "simpler" shader ALU organisation of Xenos is prolly a big part of it. Not only does that cut out the ALU (which is supposedly just capable of ADD) but it also cuts out a heap of complex issue/decode circuitry that has to work out if the ADD can be dual-issued.

    Additionally, Xenos saves transistors over conventional PC GPUs by lumping batches into 16-wide phases. The batch size in R520 is 16, but in four phases - on each phase a quad of pixels is processed (all running the same instruction). On Xenos a batch size is 64 pixels, again in four phases. By making the phases wider, like this, you use less transistors on the instruction fetch/issue/decode block - since you now have one of these blocks for each of 16 pipes. Whereas in PC GPUs, you have one of these blocks for each 4 pipes. So Xenos has the same number of these blocks as R520 does - but Xenos has four times as many pipes (ignoring R520's vertex pipes for a second). That's a big transistor saving over what you might expect.

    Similarly, Xenos's texture pipes are treated as 16-wide, instead of four 4-wide quads. This means Xenos has one quarter of the texture-pipe decode logic (though I imagine that it's nowhere near as complex as the fetch/issue/decode logic required in the shader pipeline).

    Dave's article was written before it was known Xenos uses 64-sized batches.

    Yes, Xenos will suffer lower DB efficiency than R520. R520 is a curiosity in this respect, because it's expected that all future ATI GPUs will increase the ALU:texture op ratio. In R520 it's 1:1. In R580 it's 3:1. So in R580, the batch size becomes 48, instead of R520's 16. So R580 (like Xenos) suffers a shortfall in DB efficiency compared with R520.

    A batch is formed of pixels that all have the same shader state. A shader state is defined by the need to run the same shader program. As far as I can tell, in ATI hardware this means the pixels must all come from the same triangle.

    In vertex processing, the shader state effectively relates to vertex batches - all the vertices must be in the same batch. Since vertex batches are normally hundreds to tens of thousands (or more) in size, that's not a problem.

    But it's arguable that dynamic branching in vertex shader programs is going to suffer a lot from the inefficiency of running in batches of 64. On the other hand the tessellation (creation, destruction or shifting) of vertices that Xenos supports may make this moot.

    From the XFest documents I have Xenos has extra tricks up its sleeve to do with dynamic branching. These are instructions that allow the dev to program the sequencer (the control block in Xenos that organises shader execution at the batch level). A simple example might be to jump over portions of code in the shader, or to loop over a portion of code - doing so for all 64 objects in the batch, if they all match the same condition (or, if any one matches a condition). It's all a bit hairy, to be honest - I haven't worked out how that would be used :oops: That's where real dev comments are going to be needed. Huge chunks of this document go right over my head.

    Jawed
     
  9. TurnDragoZeroV2G

    Regular

    Joined:
    Nov 14, 2005
    Messages:
    583
    Likes Received:
    23
    Location:
    Who knows...
    Well, that was what I was basically curious about. If that part in Dave's article still holds true, then it would appear there's no guarantee that the batch will consist of adjacent pixels. Taking the shadow example again, if the pixels in the shadow, partial shade, etc. all execute the same shader (with dynamic branching), then they're all potential choices for batches. In which case, the batcher might take 6 2x2 quads that are mostly in the partial shade, and 10 2x2 quads completely in the complete shadow. And now they all have to execute both/all the branches anyway. And if enough batches are created in this manner, then it's completely inefficient. But, I'm probably just not thinking properly.

    Though, I suppose the ideal situation would have a branch detected, and then pixel quads that take one branch to be rebatched with other pixels executing the same instruction/shader code. Meanwhile, the ones that take the opposing branch be rebatched the same way. And the other 64 available threads fill in this time while that data gets put into new batches/threads. But, that's just a round-about fix for having batches larger than 1 in the first place, so it wouldn't necessarily be needed later on down the line when somebody could put it in. Eh, just crazy-talk on my ignorant part, being bored and all at the moment. :???:

    Still curious about the flow control, though.
     
  10. AzBat

    AzBat Agent of the Bat
    Legend

    Joined:
    Apr 1, 2002
    Messages:
    7,749
    Likes Received:
    4,847
    Location:
    Alma, AR
    Anand released a new article with more details about the Xbox 360 motherboard...

    "Inside Microsoft's Xbox 360 - A Tour of the 360's Motherboard"
    http://anandtech.com/systems/showdoc.aspx?i=2611

    Also, here's another site that took apart the Xbox 360 and controller. Has pictures and discusses things like the video connector and the TSOP.

    "The Soft Life"
    http://softlife.blogspot.com/


    BTW,

    Anand mentions this...

    He later theorizes it might have something to do with Ethernet or an audio codec. Personally I believe it's the encoder from Microsoft's WebTV division. I thought all the Ethernet stuff was on the Southbridge?

    I also find it kind of funny that Microsoft would include another TSOP. If it wasn't for the TSOP on the original Xbox, we probably wouldn't had seen the Xbox mod community get so big so fast.

    Tommy McClain
     
  11. scooby_dooby

    Legend

    Joined:
    May 28, 2005
    Messages:
    8,563
    Likes Received:
    145
    Location:
    E-town, Alberta
    Funny indeed :twisted:
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    That's what it amounts to, yes.

    But older GPUs with DB, i.e. NV40 and G70, are restricted to batch sizes in the hundreds and thousands of pixels!

    G70 has a batch size in the region of 800 pixels - though I can't explain how that batch size is derived.

    It's near-enough useless for any kind of shader that's attempting to perform per-pixel control flow.

    Jawed
     
  13. scificube

    Regular

    Joined:
    Feb 9, 2005
    Messages:
    836
    Likes Received:
    9
    It not looking good for DB uage at all now :(
     
  14. TurnDragoZeroV2G

    Regular

    Joined:
    Nov 14, 2005
    Messages:
    583
    Likes Received:
    23
    Location:
    Who knows...
    If all the pixels in the batch are adjacent, (i.e., 8x8 square, rather than 16 2x2 squares from any triangles in the scene with the same state), then DB won't really be less efficient than what's already here in RV530 (which should be like R580, 4x12).

    Which is probably the likelihood, I was just curious if the difference in working with triangles meant that Xenos had a more intelligent batching system that might change DB efficiency (I was looking at the extreme end, towards it being less efficient, but there's the opposite extreme as well).
     
  15. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    Would it be possible to conclude that such tricks position Xenos' DB performance closer R520 than previously speculated? Could it be that R520 offers just as much granularity in its batch sequencer, but that access to its functionality will be moreso responsibility of Ati's driver team as opposed to devs?
     
    #55 Luminescent, Nov 18, 2005
    Last edited by a moderator: Nov 18, 2005
  16. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    I dunno, really. I haven't found any examples, and I'm not sure what kind of meaning you can attach to a random selection of 64 pixels.

    R520 should be able to perform DB at the vertex level in vertex shader programs - something that Xenos can't do (since it operates on 64 vertices).

    Sequencer programming is something you have to do in microcode. Seems like the kind of thing that's gonna take a while to get to grips with.

    Jawed
     
  17. Luminescent

    Veteran

    Joined:
    Aug 4, 2002
    Messages:
    1,036
    Likes Received:
    0
    Location:
    Miami, Fl
    From what I heard, R520s vertex shader units still operate under a single SIMD issuing node, so I'm guessing dynamic branching is still an issue if only a small portion of vertices in a batch require DB. There may be branch prediction logic, though.
     
  18. Bill

    Banned

    Joined:
    Oct 6, 2005
    Messages:
    388
    Likes Received:
    3
    While I'd love to believe, and enjoy reading, Jawed's propaganda..

    I have issues with, if ATI could pack 64 pipes into 230m transistors, why they didn't do it in desktops.

    In fact, it's a ludicrous number, considering they have 16 in a 321m transistor R520.

    Or for example, if the second ALU provides few benefits at great transistor cost, again, why is it in ATI desktop parts?
    Doesn't make any sense..
    And all this batch stuff is naturally over my head..

    At this point, I'd much rather MS had waited until spring to launch, and used R580. That likely would have assured total superiority over PS3 if what we hear of R580 is true (48 pipes).

    Or really, if they had just scrapped the EDRAM, they would have been fine. They used nearly 340 million transistors, which is a very large chip.

    Or I think, they could have gone with dual Xenos, and made up the cost any number of way, for that matter. If analysts speculate each GPU costs $100, they could have raised the price $100, and still been at a respectable 399/499. After all, Blu Ray alone will likely cost Sony $100 initially. So they might have been at cost parity with Sony at worst with Dual Xenos. Not to mention Cell is much more expensive than it's Xbox counterpart. Hell without the EDRAM< Xenos is a small chip anyway. They could have included two, and with the way they are doing this this go around, they likely still would have been hugely better off costwise than they were with Xbox one. After all, 65nm is right around the corner.

    It remains to be seen, but I think MS botched the X360..and I think they had victory in all phases in their hands, too..
     
    #58 Bill, Nov 19, 2005
    Last edited by a moderator: Nov 19, 2005
  19. TurnDragoZeroV2G

    Regular

    Joined:
    Nov 14, 2005
    Messages:
    583
    Likes Received:
    23
    Location:
    Who knows...
    Perhaps this is the answer:
    http://www.beyond3d.com/reviews/ati/r520/index.php?p=03
     
  20. Bill

    Banned

    Joined:
    Oct 6, 2005
    Messages:
    388
    Likes Received:
    3
    That's the explanation I think of, too.

    But it makes one wonder why ATI wouldn't have been smart enough to build a better pipeline to start with.

    Or why they wouldn't rewrite the compiler, if it would save them massive silicon costs.

    Anyway, do we assume MS wrote the compiler for Xenos? And therefore the cost was on them?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...