ATI Xenos: XBOX 360 Graphics Demystified

Discussion in 'Beyond3D Articles' started by Dave Baumann, Jun 12, 2005.

  1. Simon F

    Simon F Tea maker
    Moderator Veteran

    Joined:
    Feb 8, 2002
    Messages:
    4,563
    Likes Received:
    171
    Location:
    In the Island of Sodor, where the steam trains lie
    That approach could be very expensive if the application was doing a lot of instancing or dynamically manufacturing extra sets of texture coordinates.
    Furthermore, it might be pointless saving geometry that ended up being obscured by later geometry.
     
  2. richardpfeil

    Newcomer

    Joined:
    Jun 22, 2005
    Messages:
    34
    Likes Received:
    0
    Simon: You are absolutely right that instancing would be expensive. Procedurally created geometry and terrain height-field mapping would also be a problem. It's definately pointless to store obscured geometry, but a front to back sort should be able to cull at least 60% of the polys that are occluded in the first pass (I bet 90% is obtainable). Modern GPUs already do a good job at this.

    Rockster: Until vertex shading is completed the position of the vertex in screen space is unknown. I don't see how the driver (a bit of a misnomer for a console) can do the tagging. That used to work on PCs, before hardware T&L. All vertex setup happened on the CPU, and the driver had access to the vertexes position in screen space. Tiling solutions died once the driver gave up control of the vertices. Question is, how did ATI resurrect it?
     
  3. tEd

    tEd Casual Member
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,105
    Likes Received:
    70
    Location:
    switzerland
    @wavey

    The hierarchical z featured in xenos is there any improvements over previous implementations? Does it still not work when certain z/stencil operations are done or has this been improved?
     
  4. IgnorancePersonified

    Regular

    Joined:
    Apr 12, 2004
    Messages:
    778
    Likes Received:
    18
    Location:
    Sunny Canberra
    Yeh that's sort of what I thought arrrse. Big changes afoot!

    Maybe ati could intergrate a cpu into thier northbridge here and be done with it :lol:
     
  5. vblh

    Newcomer

    Joined:
    Apr 14, 2005
    Messages:
    22
    Likes Received:
    2
    I posted this question the the console forum but as many of you know it now locked. I hope that someone aleast finds the time to answer as i've read Dave's article on Xenos many times but there are still things that i don't quite undersatand being a nubie and all.

    Since Xenos has 48ALU pipes grouped 16 x 16 x 16 & they can do both pixel & vertex shading, I was wondering if ATI designed Xenos in such away that a programmer would have the option to do the following:
    A: Use each group of 16 to do a frame each.
    B: Use two of the 16 to do a frame & let the third do any additional effects.
    C: Let all 48 do a frame at a time.

    Secondly, if Xenos can infact do either A or B, would you be able to make an educated guess as to how effient the Unified Shader Arch really is?

    & lastly, just how did ATI decide that 48ALU pipes would be fine for this particular application
     
  6. Mariner

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,288
    Likes Received:
    1,055
    IIRC, for each clock the 48 ALUs can do either Pixel or Vertex processing. They can't be split to do both Pixel and Vertex at the same time.

    At least, I'm pretty sure that's the information we have been provided with.
     
  7. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    No, they are 3 MIMD engines, so at any one point in time they will be processing three entirely separate threads, of which can be any program type. Each engine contains 16 SIMD processors which will be operating on the same data from a single thread.
     
  8. vblh

    Newcomer

    Joined:
    Apr 14, 2005
    Messages:
    22
    Likes Received:
    2
    Thanks Dave. :D So since what i asked is possible, which of my sugessted options would you prefer?
     
  9. nelg

    Veteran

    Joined:
    Jan 26, 2003
    Messages:
    1,557
    Likes Received:
    42
    Location:
    Toronto
    Dave,

    In your article you made mention that final clocks were undecided. Seeing that the r520 in the 1800xl clocks at 500MHz and the latest revision had an instant 160MHz gain, it seems that the C1 may have a lot of headroom. Has there been any confirmation on the final clocks? And since I dug up this old thread let me ask;

    How to scale C1?

    What is the relationship, in number of transistors, between the portions of the chip devoted to scheduling and control logic to the ALU arrays and texture units and how do they scale? Does this give it an advantage compared to non USA designs whereby the transistor count will not increase as rapidly?
     
  10. pakpassion

    Banned

    Joined:
    Oct 7, 2005
    Messages:
    167
    Likes Received:
    2
    hey dave:

    http://techon.nikkeibp.co.jp/article/NEWS/20051005/109392/20051005protecfig1.jpg

    NEC posted the above picture .. is this right? they are saying that the Edram to GPU Connection is not 32 gb/s like in the article but 22.4 GB/s.

    here is the picture article:

    NEC Electronics exhibited the SiP (System in Package) developed for Microsoft Xbox 360 next-generation home game console at PROTEC JAPAN 2005 at Makuhari Messe from 2005/10/5.

    It contains a DRAM-embedded LSI made by NEC Electronics and a graphics LSI made by TSMC in Taiwan in one package. Those bare LSI chips are laid horizontally in a package. Specifically, the graphics LSI and the DRAM-embedded LSI are connected to interposers in a package with flip chips. The reason why NEC Electronics adopted flip chip to connect them is to improve the data transfer speed between chips. In Xbox 360, the maximum data transfer speed required between the DRAM-embedded LSI and the graphics LSI is rather high at 22.4GB/sec. According to NEC Electronics, to achieve this transfer speed, they passed up wire-bonding which is popular for in-package wiring as wire-bonding results in too big wiring inductance. The work for making them in SiP is done by Microsoft.
     
  11. DarkRage

    Newcomer

    Joined:
    Jul 25, 2005
    Messages:
    70
    Likes Received:
    1
    Location:
    Spain
    Ok, probably stupid question, and probably nobody is going to read it as this thread is almost dead, but let's try it before creating a whole thread:

    As Xenos is working with groups of 16 ALUs working on the same instruction in the same shader... what happens when we have a small triangle?
    If we have a triangle with just 3 pixels... 13 ALUs are stalling?
    If we have many small triangles, ALUs can work on all of them? for example, if we have got 20 triangles with the same shader, each one with 3 pixels (60 pixels in total), Xenos could be assigning 16 ALUs -one bank- to the first 16 pixels even if they are not next to each other? Or we would have 3 ALUs working and 13 ALUs waiting.

    So, basically, are those banks of 16 ALUs working as a typical quad? That would be a massive waste of resources IMO.

    Sorry, I got such a mess with it.
     
  12. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,714
    Likes Received:
    2,135
    Location:
    London
    On the face of it, the wastage from single small triangles appears to be what's happening.

    It's worse because a thread (batch) is actually 4 phases of 16 = 64 pixels in size.

    This kind of thing appears to be a common limitation with all GPUs - other GPUs work with even larger groups of pixels (256, 1000, 4000 - roughly speaking).

    NVidia GPUs appear to support 20 triangles:

    http://www.beyond3d.com/forum/showthread.php?p=597388#post597388

    which ameliorates the problem, somewhat - although they're dealing with pixel counts in the thousands (NV40 and older, though G70 is more like 800-1000 apparently).

    Jawed
     
  13. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    Geometry and pixel data are batched according to the same state. If you have lots of small objects from different commands then you'd fget wastage, if a command generates larger batches then its minimised.
     
  14. Rys

    Rys Graphics @ AMD
    Moderator Veteran Alpha

    Joined:
    Oct 9, 2003
    Messages:
    4,182
    Likes Received:
    1,579
    Location:
    Beyond3D HQ
    I have wondered in recent weeks about < 64 fragment batches on Xenos, but I was too embarassed to ask Wavey :lol:

    I didn't understand what happened when you got batch sizes that small and whether you'd just waste execution units.

    I wondered that, say, there's only a single quad of fragments to process (or less than 64 at least), would the hardware process them anyway or buffer for more fragments.

    When batches get so tiny it's like that, you just suffer the hit in efficiency because it'll rarely happen that way in real-world usage when using Xenos to draw games. So now I know.
     
  15. 3dcgi

    Veteran Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    2,493
    Likes Received:
    474
    There's CPU overhead and GPU register programming overhead involved with changing state so you'd probably never notice the inefficiency of the ALUs for batches that are super small.
     
  16. DarkRage

    Newcomer

    Joined:
    Jul 25, 2005
    Messages:
    70
    Likes Received:
    1
    Location:
    Spain
    Thanks all for your answers.

    So, compared to G70 and R520 from a theoretical point of view, can we expect better efficiency in Xenos for small triangles?
     
  17. RobertR1

    RobertR1 Pro
    Legend

    Joined:
    Nov 2, 2005
    Messages:
    5,852
    Likes Received:
    1,297
    Dave going by this line in the conclusion:

    "given that most of the first generation titles will not have been developed on the final hardware."

    Does that indicate that the current titles are nothing more than PC ports, developed on PC hardware, DX9 Api and then optimized to run on the xbox 360? If so, what titles are you aware of that will be developed directly on the final hardware and optimized well, thus demonstrating some of the true capabilities of the 360.

    I personally do not have a good idea of exactly what I am looking at when playing so I was hoping you could shed some light onto this issue. Am I just seeing a PC game on the Xbox or am I seeing a true Xbox 360 game when I'm playing PGR3 and Madden 06?
     
  18. pc999

    Veteran

    Joined:
    Mar 13, 2004
    Messages:
    3,628
    Likes Received:
    31
    Location:
    Portugal
    Just one question, does anyone have anymore info on the Tesselator function as the only info I can see is in thispdf.

    Does anyone know hoe much performance it draws, can it be used as a primary rendering mode and LOD with always the ideal minimum/maximum detail basead in the distance from the camera, is it hard to implement?

    Anyway any info is good.:smile:
     
  19. thax

    Newcomer

    Joined:
    Feb 1, 2006
    Messages:
    2
    Likes Received:
    0
    Does anyone have a final transistor count on the NEC daughter die? I am looking for a breakdown of transistors devoted to logic and the eDRAM.

    I assume that 83.88M transistors are devoted to the eDRAM. I have read on different press releases that the total transistor count is either 90M, 100M, just over 100M, 105M or 150M.

    The last update by DaveBaumann indicates a figure of 105M, however it was made to the main page and has been overwritten, the main aricle was not updated:
    Google Cache

    The most popular figure right now is 90M transistors, however that means that the chip only has about 6M transistors devoted to logic.
     
  20. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
    105M was the latest from ATI's engineering.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...