ATI Xenos: XBOX 360 Graphics Demystified

Discussion in 'Beyond3D Articles' started by Dave Baumann, Jun 12, 2005.

  1. TheAlSpark

    TheAlSpark Moderator
    Moderator Legend

    Joined:
    Feb 29, 2004
    Messages:
    22,146
    Likes Received:
    8,533
    Location:
    ಠ_ಠ
    What is the maximum theoretical pixel fillrate of the X850XT at 4xMSAA? Just divide by 4?
     
  2. Rockster

    Regular

    Joined:
    Nov 5, 2003
    Messages:
    973
    Likes Received:
    129
    Location:
    On my rock
    Take the total bandwidth and divide it by 8 bytes per pixel (color + z). So max theoretical is 37.8GB/sec / 8 or approximately 4.7Gpixels/sec.
     
  3. Reverend

    Banned

    Joined:
    Jan 31, 2002
    Messages:
    3,266
    Likes Received:
    24
    Rather simplistic question : Can we know if some XB360 games will have a "Anti-Aliasing" checkbox in one of the menus (and if there'll be "2x" and "4x" options at that) or will some fillrate-friendly games already have AA "hardcoded-implemented"?

    Also, if I develop a game now that I want to be made available for both the XB360 and the PS3, given the differences between the two, how hard would it be and if there really needs to be two separate developer teams? I'll probably shoot this question off to a few developers (throwing in the obvious technical differences between the two consoles) some time later but comments by you guys are welcomed.

    Finally, anyone knows how much MS paid ATI?
     
  4. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    4X MSAA is virtually free on Xenos so it wouldn't make much sense to deactivate it for the main render target.
    Maybe MS would not even let developers deactivate it as a technical requirement needed to publish a game.
    A game that doesn't really push the envelope wouldn't be to hard to develop for a single team, IMHO.
     
  5. jvd

    jvd
    Banned

    Joined:
    Feb 13, 2002
    Messages:
    12,724
    Likes Received:
    9
    Location:
    new jersey
    I would think ms would require 2x fsaa to be released
     
  6. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,237
    Likes Received:
    4,260
    Location:
    Guess...
    So are we saying that X360 ROPs can handle 4 samples per pixel, and thats why 4x FSAA is free in terms of fill rate? And of course its free in terms of memory bandwidth because of the eDRAM?

    So why can't RSX support 4 samples per pixel without a fill rate hit aswell? Surely this is something that would be very important on a PC like card which focusses on things like FSAA?

    Or is it because it has more ROPs and hence doesn't need them to handle as many samples?
     
  7. Rockster

    Regular

    Joined:
    Nov 5, 2003
    Messages:
    973
    Likes Received:
    129
    Location:
    On my rock
    It could, but would be useless since the memory bandwidth could not support it. ROP count is largely irrelevant as even though RSX may have twice as many, the eDram will likely ensure that Xenos can sustain the greater fill rate. The question would be if the PS3 wanted to target 4xAA, could they save die space by using 8 ROP's with single cycle 4xAA vs. 16 ROP's with 2xAA.
     
  8. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    It takes multiply cycles for a NV40's ROP to handle 4x MSAA (2 cycles AFAIK), whilst Xenos ROPs can handle 4x MSAA in one cycle.
     
  9. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    Z compression is better than that. The lowest you get is 5 bytes per pixel for X800 AFAIK. And you don't always need to write Z.
     
  10. Monty

    Regular

    Joined:
    Aug 9, 2004
    Messages:
    259
    Likes Received:
    2
    Location:
    UK
    i tried using this equation for the framebuffer usage but i cant seem to get the same answers which are in the article, eg

    640x480 = 307200 pixels
    307200 pixels*(32+32) = 19660800
    19660800\8 = 2457600 - bits to bytes
    2457600\1000000 = 2.4576MB -bytes to megabytes

    edit - lol, doesnt matter, gotta divide by 1024 not 1000 - heat is getting to me over here.
     
  11. Pete

    Pete Moderate Nuisance
    Moderator Legend

    Joined:
    Feb 7, 2002
    Messages:
    5,777
    Likes Received:
    1,814
    Is it cheaper to slap on some eDRAM (assuming it's both higher yield and easier to move to smaller processes) than to spend the money on architecting and fabricating a more complex GPU (Xbox 360, PS3)? I'm thinking in terms of the PS2's GS and its 4MB EDRAM, which was likened to a Voodoo 2 on steroids, and still produces decent visuals (considering its memory limitations vs. Xbox).
     
  12. Rockster

    Regular

    Joined:
    Nov 5, 2003
    Messages:
    973
    Likes Received:
    129
    Location:
    On my rock
    I haven't seen a benchmark that bears out better than 6 bytes per pixel, but the point I was trying to make is that it's a bandwidth limit not a ROP limit. People tend to get caught up with number of ROP's and forget the important part. Didn't want to see anymore posts with RSX listed at 8.8GP/sec when the max theoretical is somewhere between 3 and 4.
     
  13. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,344
    Likes Received:
    176
    Location:
    On the path to wisdom
    ATI claims up to 24:1 compression for Z (with 6xMSAA, I guess).
    With RSX, we still don't know about it's ROP architecture. Because of the two different interfaces, they could well have made some changes there. But peak ROP performance indeed hardly matters for anything but Z-only passes.
     
  14. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,723
    Likes Received:
    242
    observation ~ question:

    Gamecube's Flipper GPU can do single-cycle trilinear filtering, right?
    648M pixels/sec - that is with trilinear on, at least some form of it.

    but Xenos can only do single-cycle bilinear filtering, correct? it would take another cycle to do trilinear filtering w/ loopback. it can still do trilinear in a single pass, but not a single cycle.

    even though there is a large difference in fillrate between Gamecube-Flipper and Xbox 360-Xenos, it seems Flipper was optimised with trilinear filtering and Xenos optimised for bilinear filtering.

    ok now someone with real graphics knowledge show me where I am wrong.
     
  15. X-AleX

    Newcomer

    Joined:
    May 20, 2005
    Messages:
    75
    Likes Received:
    14

    That seems odd, if true. :evil:
     
  16. Rockster

    Regular

    Joined:
    Nov 5, 2003
    Messages:
    973
    Likes Received:
    129
    Location:
    On my rock
    Once again, this is a bandwidth issue. Texture samples in the GameCube are made from eDram (1MB texture buffer) which allows for the additional samples per clock required for trilinear. That particular chip, since it was designed around the start of the shader era, tackled the problem of programability through extensive/exotic texture use as opposed to shaders.
     
  17. richardpfeil

    Newcomer

    Joined:
    Jun 22, 2005
    Messages:
    34
    Likes Received:
    0
    There seems to be some things happening here that are not being explictly stated. Tiling means either processing the vertices multiple times (once for each tile) or bining and deferring rendering. The article states "During the Z only rendering pass the max extents within the screen space of each object is calculated and saved in order to alleviate the necessity for calculation of the geometry multiple times." The key word here is 'Saved'. Where is this information saved? The answer seems to be MEMEXPORT.

    This is purely speculation on my part, but it seems to make some sense. Here's the process...

    Send all geometry to the GPU.
    - All 48 shaders processing vertices.
    - The results of vertex shading go to two places, rasterization and MEMEXPORT.
    - Rasterization generates Z values, out to ROPs.
    - Raster also calculates tile hits, and attaches info to shading results in MEMEXPORT.
    - MEMEXPORT queue written to main memory.

    Render each tile.
    - All 48 shaders processing pixels.
    - Set up tile.
    - MEMEXPORT data marked for this tile sent back into GPU.
    - This data goes directly back into the Rasterizer.
    - Raster results sent to shaders.
    - Shader results to ROPs.

    If I'm right it brings up some interesting questions...
    - Can the ROPs work in double color, as well as double Z modes?
    - How much memory would the exported vertex shader results take up? (My guess, 64 bytes * numVerts or 64MB for roughly a million polygons)
    - What is the bandwidth cost?

    Bottlenecks when running this way...
    - 48 Vec4 + Scalar per clock, only one triangle to rasterizer per clock. (Ouch! But that's not an unheard of vertex shader length)
    - 48 Vec4 + Scalar per clock, 8 ROPS per clock. (Not to bad, just need 6 Vec4/Scalar pairs in your pixel shader)
    - Triangles smaller than 8 pixels will starve the ROPs. (True in any case)
     
  18. Rockster

    Regular

    Joined:
    Nov 5, 2003
    Messages:
    973
    Likes Received:
    129
    Location:
    On my rock
    I think you're miss understanding the process. The driver simply tags each vertex fetch command with the tiles that command affects. All those commands are stored by the driver since it is handling the Z-only and rendering passed. So it sends all geometry to update hier-Z and gets updated as to which tiles each command affects. Then when rendering for example tile 1, the driver only submits the commands which effect that tile. Some objects will cross tile boundries, and those will require its geometry processed multiple times. The z-only rate with 4xAA is 64 z samples per clock or 32Gzixels/sec.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,716
    Likes Received:
    2,137
    Location:
    London
    The identity of the tile(s) containing the triangle still need to be stored somewhere. If you have 3 tiles, you need to know which of the three tiles a triangle intersects - e.g. a tile coverage mask, with batches of triangles' masks compressed in some meaningful way.

    By making the z-only pre-pass perform transform, lighting and shading of vertices, you are left with a reduced set of vertices in screen space, rather than world or object space. They're all fully lit and should only need rasterising.

    Jawed
     
  20. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Xenos doesn't tag triangles, it tags primitives batches.
    It needs to reserve just one or two more bytes in the commands buffer to save tags since you have a few tiles
    It depends, developers can do lighting in the firts pass or in any subsequent pass
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...