XBox2 specs (including graphics part) on Xbit

Discussion in 'Architecture and Products' started by Rys, Apr 26, 2004.

  1. pc999

    Veteran

    Joined:
    Mar 13, 2004
    Messages:
    3,628
    Likes Received:
    31
    Location:
    Portugal
    A racional , empiric estimation ...
     
  2. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,638
    Likes Received:
    148
    no because CELL is an architecture not a specific chip. there will be many different processors based on CELL.

    it's like asking:
    'can anyone estimate the power of this in comparison to X86, MIPs, SuperH, ARM, etc.'

    what you might want to ask is:

    'how will this (these Xbox 2 specs) compare to the PS3's Cell-based chipset'
     
  3. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,902
    Likes Received:
    218
    Location:
    Seattle, WA
    I suppose I should have been more specific about tense.

    Anyway, my point was that it's not clear to me that that file was before or after "spec lockdown." As a side note: if the console is set to be released by the end of 2005, then spec lockdown may be right about now.
     
  4. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    What about if they have a copy on write capability. They certainly have enough bandwidth into the main memory to support the full screen write bandwidth, and it shouldn't be too hard to implement the copy on write or a trickle write system to copy the final frame out.

    In this way you get the bandwidth of the embedded frame buffer without the requirement that the frame buffer be sized such that it can handle double buffering requirements. It appear that they should be able to fit a 1280x720 4xFP16 RGBA + 24b Z into a ~10MB embedded dram. If you have an additional bit per pixel for a frame Id, copy on write plus trickle should allow you to get the ~300 MB/s into the main memory without too many issues.

    Maybe add another bit to indicate a FSAA compression failure to main memory which in general should be rare. If true the effective bandwidth would be extremely high. I would assume that they texture solely out of the main memory. The big question is if they can use the embedded dram for off screen buffers for effects or they go directly to main memory, though it probably won't be a big performance hit if they did the off screen buffers using the main memory if it reall has the 22+ GB/s of bandwidth.

    Aaron Spink
    speaking for myself inc.
     
  5. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    I'm not clear how copy-on-write is supposed to solve the problem. Copy on write saves space, when say, you fork() a process, and all memory is shared until you write to a page, and then that page is copied. That saves you the trouble of wasting 2x the memory immediately when fork()ing.

    But you still need enough eDRAM to store one complete copy, and it appears to me that there isn't enough to store one HDTV 4xFSAA HDR frame. Are you suggesting that only areas that fail 4:1 compression get "spilled" to main memory? Even with virtualized FB memory, it still seems like it would be a huge hit to performance, since the page-ins and page-outs will happen at main memory speeds.

    Virtual memory works so long as you only require a portion of the VM space for your inner loop or hotspot. Virtual Memory, on a CPU forexample, would not work very well if every app in the system had to touch every part of it's code, since you'd be bogged down in page faults. It works because in the vast majority of apps, only a small portion of the code needs to be in main memory. (e.g. what percentage of the Emacs code base is needed to edit a text file?)

    Ditto for texture virtualization, if you have a huge texture atlas, but only need a portion of the pixels for any given frame.


    With the FB, it is much more likely that a huge number of pixels will be touched more than once, and therefore, the entire FB will be paged in/out at some point, limiting you to the main memory bandwidth.
     
  6. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,807
    Likes Received:
    473
    You could always tile (that said, I dont think this will be doing 4x at HDTV resolutions).
     
  7. DemoCoder

    Veteran

    Joined:
    Feb 9, 2002
    Messages:
    4,733
    Likes Received:
    81
    Location:
    California
    Yeah, you need to do the "inner loop" on a tile at a time and finally write it out to main memory will it never be touched again during that render pass. If it gets touched again in the same pass, that means reading the spilled tiles back into eDRAM.
     
  8. bloodbob

    bloodbob Trollipop
    Veteran

    Joined:
    May 23, 2003
    Messages:
    1,630
    Likes Received:
    27
    Location:
    Australia
    Is it just me or does the external video scaler seem odd wouldn't it be better to render the stuff at the native resolution rather then have a seperate chip doing the scaling???
     
  9. KimB

    Legend

    Joined:
    May 28, 2002
    Messages:
    12,902
    Likes Received:
    218
    Location:
    Seattle, WA
    I think this would be the best way to make use of relatively small amounts of eDRAM. You could have a FIFO "tile cache" that would automatically demote the last written tile to the end of the list (such that the tile last written to is always going to be the first to output).

    The main question would be when to output tiles (reading is obvious: whenever a tile is needed that's not in cache). So, what I would do is speculatively predict when there will be extra available memory bandwidth that could be used to output a tile, and thus attempt to not starve other applications (vertex buffers, textures, etc.) of memory bandwidth while at the same time attempting to keep the tile cache from filling.

    One could even not immediately erase a tile from cache once it's written, but instead keep it around until the cache does fill. Then that tile can be trivially deleted (since it's known to be concurrent with the framebuffer).

    This would be a significant performance benefit whenever there are effects like blending that cover the same rough region of the screen a few times before covering much else of the screen. There would be a problem, however, with effects that cover a large area of the screen, but selectively choose not to write to many of those pixels (for example, an alpha test that covers a large portion of the screen).

    Other than that, it would be a significant benefit in memory granularity: memory bandwidth usage is always more efficient if done in larger chunks (toward that end it may be useful to not only cache the framebuffer, but also textures, vertex buffers, etc.). This means that even relatively normal rendering would benefit a fair amount when using this method.

    Edit: one could also have an option for reading in the z-buffer and using a write mask for color buffer outputs to prevent re-reading of the color buffer when it is unnecessary to do so.
     
  10. DeathKnight

    Regular

    Joined:
    Jun 19, 2002
    Messages:
    744
    Likes Received:
    4
    Location:
    Cincinnati, OH
    I'd see it being put to better use for DVD playback.
     
  11. aaronspink

    Veteran

    Joined:
    Jun 20, 2003
    Messages:
    2,641
    Likes Received:
    64
    copy-on-write is used in a variety of areas in computer architecture besides OS and related areas. For instance, storage subsystems also use copy-on-write to implement virtualization and snapshotting. It has also been used in some cluster interconnects as well.

    The basic concept is that it saves you from having to do a large block copy immediately. You still have to have the space available, and you will still want to have a back-ground trickle process to effect the replication, but it saves you the immediate cost of delaying other things while you do a large time consuming copy.

    Why have enough memory capacity in your costly custom memory to actually store 4xFSAA at 4xFP16+Z if in the vast majority of cases will only require a fraction of that memory?

    Why not make the high speed memory the primary buffer for the primary sample, and only use the higher latency, lower bandwidth memory when you actually require the additional samples.

    You never need to do the fill. Only spill. You only spill in the case where the additional FSAA samples will not compress into the value already stored. ie. the embedded memory only ever contains the first AA sample, any other samples (if they are required) are stored in the main memory. If the assumption is that in general you get the 4:1 compression of the AA samples, you get most of the bandwidth/latency advantage of the embedded memory, but you don't have to size the embedded memory to cover the full FSAA memory footprint.

    Virtual memory is the wrong analogy. Think more of a storage subsystem. Say you bought the storage subsystem to work as the data store for a data base. You would therefore need to support a high number of I/Os per second requiring you to buy lots and lots of expensive 15K RPM 36 GB disks (which are the most expensive per MB, generally 3-5x). But you also need to do nightly backups of the database in a consistant manor which requires the disk subsystem to support doing a "snapshot" of the data on the disk (without a snapshot the data could get overwritten in areas resulting in an inconsistant restore and the backup is pointless).

    Do you go out and buy another complete set of high cost disks to support this snapshot or do you buy large capacity, low performance drives that are cheap? Most of the time you won't need the additional disk drives because the data blocks don't get overwritten that fast. But when they do, you need the copy-on-write function to copy the data to the "snapshot" array.

    I believe that this is a better analogy to what I am proposing. Most of the time you don't need the extra storage space for the additional anti-alias samples because of the compression. In addition, you don't need to copy the whole of the back buffer to the front/display buffer in one atomic operation. By using the main memory as one-way backing store for additional AA samples, and the final display frame, you can significantly reduce the memory required in the embedded DRAM. You never fill the embedded dram from main memory, only spill/copy the embedded dram to main memory.

    As I stated earlier, I don't think that they are using the embedded memory for texturing, instead using the high bandwidth that they have directly to the main memory for texture fetch.


    But what if you had enough memory in the embedded dram to hold the entire 1280x720 4xFP16+Z frame? You would never need to fill into it from main memory, just spill to it when needed/required for additional AA samples that won't compress or when doing a FB flip (copying the embedded dram based back buffer to the main memory based front/display buffer)

    Aaron Spink
    speaking for myself inc.
     
  12. bbot

    Regular

    Joined:
    Apr 20, 2002
    Messages:
    715
    Likes Received:
    5
    Why does everyone assume that three separate cpus are being used? Isn't it more likely, from looking at the diagram, that one tri-core cpu is being used?
     
  13. ERP

    ERP Moderator
    Moderator Veteran

    Joined:
    Feb 11, 2002
    Messages:
    3,669
    Likes Received:
    49
    Location:
    Redmond, WA
    What does it matter?
    Obviously from MS's standpoint a single package is a cheaper solution, but from a programming standpoint 1 tricore CPU is the same as 3 seperate CPU's.
     
  14. Ardrid

    Joined:
    Feb 27, 2004
    Messages:
    165
    Likes Received:
    1
    The assumption comes from the idea that MS was/is rumored to be using 3 PowerPC 976s (dual-core) in the next Xbox.
     
  15. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    TBR in this case..since I read this patent I believe Sony is going to do the same on their cell based visualizer.

    ciao,
    Marco
     
  16. snakejoe

    Newcomer

    Joined:
    Apr 27, 2004
    Messages:
    51
    Likes Received:
    3
    more information from original thread
    http://bbs.gzeasy.com/index.php?showtopic=149175

     
  17. Fafalada

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    2,773
    Likes Received:
    49
    Actually looking closer, the diagram pretty clearly states that 33GB/s read is for memory(22) + L2 cache(11) combined, not eDram.
    I suspect this is like many of the current IBM cpus where you can configure parts of L2/L3 cache to work as scratchpad mem (usually half of the cache can be switched to that mode).
    512KB large streaming buffer at the speed of L2 cache sounds pretty yummy to work with for fast CPU-GPU interaction. Makes the whole idea of chiefly using CPU for vertex shading much more of an interesting possibility too.

    Well all the little pluses would likely indicate parts that are possible subjects to change no :p ? Most of numbers with'+' next to them seem linked to clock speeds anyhow, which is usually not locked down until the last moment.
     
  18. JohnH

    Regular

    Joined:
    Mar 18, 2002
    Messages:
    586
    Likes Received:
    2
    Location:
    UK
    Ignoring the problem of random access on a per pixel basis when variable compression is applied (as its not strictly variable here). I would adebate the concluesion that the majority of pixels compress given an architecture with a what is probably a very high poly throughput i.e. lots of triangle edges may be present...

    John.

    John.
     
  19. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,638
    Likes Received:
    148
    this diagram is said to be old. but no source to back that up.



    DemoCoder, how much eDRAM would you concider satisfactory:
    16 MB, 20 MB, 24 MB, 32 MB, 40 MB, 48 MB ?
     
  20. pc999

    Veteran

    Joined:
    Mar 13, 2004
    Messages:
    3,628
    Likes Received:
    31
    Location:
    Portugal
    Thanks by the correction :)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...