XBox2 specs (including graphics part) on Xbit

By this spec can anyone estimate the power of this in comparation to CELL

no because CELL is an architecture not a specific chip. there will be many different processors based on CELL.

it's like asking:
'can anyone estimate the power of this in comparison to X86, MIPs, SuperH, ARM, etc.'

what you might want to ask is:

'how will this (these Xbox 2 specs) compare to the PS3's Cell-based chipset'
 
Vince said:
Chalnoth said:
Right. I was trying to state that anything that is currently in development may well differ from that plan. As an example, depending upon engineering concerns, the eDRAM may be reduced or dropped altogether.
Right. And Joe's still right. The designs have basically been locked down and the specs have been in developers hands for several months now. Any change, especially one as large as dropping the eDRAM, would constitute a major failure. These things are pretty static, Sony has two years of R&D prior to today on MS and possibly a year after today on them - so there is alot of set-piece thinking when designing a console.
I suppose I should have been more specific about tense.

Anyway, my point was that it's not clear to me that that file was before or after "spec lockdown." As a side note: if the console is set to be released by the end of 2005, then spec lockdown may be right about now.
 
DemoCoder said:
The MSAA buffer *is* the backbuffer.

Anyway, "spilling" the backbuffer into main memory seems to defeat most of the purpose of keeping it in eDRAM. Keeping Z in eDRAM would make sense, since it needs to be read and written to many times. But if you're doing alpha blending, once you spill to main memory, you're performance is effectively bottlenecked by the read/write rate of main memory. When rendering, sure, you'd have a huge FB bandwidth to write to, but then, that FB has to be flushed to main memory at a certain point, and you'll have a stall, since it can't be flushed faster than the pipelines are filling up the edram.

"spilling" buffers only really gains you big boosts if you have a TBDR.

What about if they have a copy on write capability. They certainly have enough bandwidth into the main memory to support the full screen write bandwidth, and it shouldn't be too hard to implement the copy on write or a trickle write system to copy the final frame out.

In this way you get the bandwidth of the embedded frame buffer without the requirement that the frame buffer be sized such that it can handle double buffering requirements. It appear that they should be able to fit a 1280x720 4xFP16 RGBA + 24b Z into a ~10MB embedded dram. If you have an additional bit per pixel for a frame Id, copy on write plus trickle should allow you to get the ~300 MB/s into the main memory without too many issues.

Maybe add another bit to indicate a FSAA compression failure to main memory which in general should be rare. If true the effective bandwidth would be extremely high. I would assume that they texture solely out of the main memory. The big question is if they can use the embedded dram for off screen buffers for effects or they go directly to main memory, though it probably won't be a big performance hit if they did the off screen buffers using the main memory if it reall has the 22+ GB/s of bandwidth.

Aaron Spink
speaking for myself inc.
 
I'm not clear how copy-on-write is supposed to solve the problem. Copy on write saves space, when say, you fork() a process, and all memory is shared until you write to a page, and then that page is copied. That saves you the trouble of wasting 2x the memory immediately when fork()ing.

But you still need enough eDRAM to store one complete copy, and it appears to me that there isn't enough to store one HDTV 4xFSAA HDR frame. Are you suggesting that only areas that fail 4:1 compression get "spilled" to main memory? Even with virtualized FB memory, it still seems like it would be a huge hit to performance, since the page-ins and page-outs will happen at main memory speeds.

Virtual memory works so long as you only require a portion of the VM space for your inner loop or hotspot. Virtual Memory, on a CPU forexample, would not work very well if every app in the system had to touch every part of it's code, since you'd be bogged down in page faults. It works because in the vast majority of apps, only a small portion of the code needs to be in main memory. (e.g. what percentage of the Emacs code base is needed to edit a text file?)

Ditto for texture virtualization, if you have a huge texture atlas, but only need a portion of the pixels for any given frame.


With the FB, it is much more likely that a huge number of pixels will be touched more than once, and therefore, the entire FB will be paged in/out at some point, limiting you to the main memory bandwidth.
 
Yeah, you need to do the "inner loop" on a tile at a time and finally write it out to main memory will it never be touched again during that render pass. If it gets touched again in the same pass, that means reading the spilled tiles back into eDRAM.
 
Is it just me or does the external video scaler seem odd wouldn't it be better to render the stuff at the native resolution rather then have a seperate chip doing the scaling???
 
DemoCoder said:
Yeah, you need to do the "inner loop" on a tile at a time and finally write it out to main memory will it never be touched again during that render pass. If it gets touched again in the same pass, that means reading the spilled tiles back into eDRAM.
I think this would be the best way to make use of relatively small amounts of eDRAM. You could have a FIFO "tile cache" that would automatically demote the last written tile to the end of the list (such that the tile last written to is always going to be the first to output).

The main question would be when to output tiles (reading is obvious: whenever a tile is needed that's not in cache). So, what I would do is speculatively predict when there will be extra available memory bandwidth that could be used to output a tile, and thus attempt to not starve other applications (vertex buffers, textures, etc.) of memory bandwidth while at the same time attempting to keep the tile cache from filling.

One could even not immediately erase a tile from cache once it's written, but instead keep it around until the cache does fill. Then that tile can be trivially deleted (since it's known to be concurrent with the framebuffer).

This would be a significant performance benefit whenever there are effects like blending that cover the same rough region of the screen a few times before covering much else of the screen. There would be a problem, however, with effects that cover a large area of the screen, but selectively choose not to write to many of those pixels (for example, an alpha test that covers a large portion of the screen).

Other than that, it would be a significant benefit in memory granularity: memory bandwidth usage is always more efficient if done in larger chunks (toward that end it may be useful to not only cache the framebuffer, but also textures, vertex buffers, etc.). This means that even relatively normal rendering would benefit a fair amount when using this method.

Edit: one could also have an option for reading in the z-buffer and using a write mask for color buffer outputs to prevent re-reading of the color buffer when it is unnecessary to do so.
 
bloodbob said:
Is it just me or does the external video scaler seem odd wouldn't it be better to render the stuff at the native resolution rather then have a seperate chip doing the scaling???
I'd see it being put to better use for DVD playback.
 
DemoCoder said:
I'm not clear how copy-on-write is supposed to solve the problem. Copy on write saves space, when say, you fork() a process, and all memory is shared until you write to a page, and then that page is copied. That saves you the trouble of wasting 2x the memory immediately when fork()ing.

copy-on-write is used in a variety of areas in computer architecture besides OS and related areas. For instance, storage subsystems also use copy-on-write to implement virtualization and snapshotting. It has also been used in some cluster interconnects as well.

The basic concept is that it saves you from having to do a large block copy immediately. You still have to have the space available, and you will still want to have a back-ground trickle process to effect the replication, but it saves you the immediate cost of delaying other things while you do a large time consuming copy.

But you still need enough eDRAM to store one complete copy, and it appears to me that there isn't enough to store one HDTV 4xFSAA HDR frame. Are you suggesting that only areas that fail 4:1 compression get "spilled" to main memory? Even with virtualized FB memory, it still seems like it would be a huge hit to performance, since the page-ins and page-outs will happen at main memory speeds.

Why have enough memory capacity in your costly custom memory to actually store 4xFSAA at 4xFP16+Z if in the vast majority of cases will only require a fraction of that memory?

Why not make the high speed memory the primary buffer for the primary sample, and only use the higher latency, lower bandwidth memory when you actually require the additional samples.

You never need to do the fill. Only spill. You only spill in the case where the additional FSAA samples will not compress into the value already stored. ie. the embedded memory only ever contains the first AA sample, any other samples (if they are required) are stored in the main memory. If the assumption is that in general you get the 4:1 compression of the AA samples, you get most of the bandwidth/latency advantage of the embedded memory, but you don't have to size the embedded memory to cover the full FSAA memory footprint.

Virtual memory works so long as you only require a portion of the VM space for your inner loop or hotspot. Virtual Memory, on a CPU forexample, would not work very well if every app in the system had to touch every part of it's code, since you'd be bogged down in page faults. It works because in the vast majority of apps, only a small portion of the code needs to be in main memory. (e.g. what percentage of the Emacs code base is needed to edit a text file?)

Virtual memory is the wrong analogy. Think more of a storage subsystem. Say you bought the storage subsystem to work as the data store for a data base. You would therefore need to support a high number of I/Os per second requiring you to buy lots and lots of expensive 15K RPM 36 GB disks (which are the most expensive per MB, generally 3-5x). But you also need to do nightly backups of the database in a consistant manor which requires the disk subsystem to support doing a "snapshot" of the data on the disk (without a snapshot the data could get overwritten in areas resulting in an inconsistant restore and the backup is pointless).

Do you go out and buy another complete set of high cost disks to support this snapshot or do you buy large capacity, low performance drives that are cheap? Most of the time you won't need the additional disk drives because the data blocks don't get overwritten that fast. But when they do, you need the copy-on-write function to copy the data to the "snapshot" array.

I believe that this is a better analogy to what I am proposing. Most of the time you don't need the extra storage space for the additional anti-alias samples because of the compression. In addition, you don't need to copy the whole of the back buffer to the front/display buffer in one atomic operation. By using the main memory as one-way backing store for additional AA samples, and the final display frame, you can significantly reduce the memory required in the embedded DRAM. You never fill the embedded dram from main memory, only spill/copy the embedded dram to main memory.

Ditto for texture virtualization, if you have a huge texture atlas, but only need a portion of the pixels for any given frame.

As I stated earlier, I don't think that they are using the embedded memory for texturing, instead using the high bandwidth that they have directly to the main memory for texture fetch.


With the FB, it is much more likely that a huge number of pixels will be touched more than once, and therefore, the entire FB will be paged in/out at some point, limiting you to the main memory bandwidth.

But what if you had enough memory in the embedded dram to hold the entire 1280x720 4xFP16+Z frame? You would never need to fill into it from main memory, just spill to it when needed/required for additional AA samples that won't compress or when doing a FB flip (copying the embedded dram based back buffer to the main memory based front/display buffer)

Aaron Spink
speaking for myself inc.
 
Why does everyone assume that three separate cpus are being used? Isn't it more likely, from looking at the diagram, that one tri-core cpu is being used?
 
bbot said:
Why does everyone assume that three separate cpus are being used? Isn't it more likely, from looking at the diagram, that one tri-core cpu is being used?

What does it matter?
Obviously from MS's standpoint a single package is a cheaper solution, but from a programming standpoint 1 tricore CPU is the same as 3 seperate CPU's.
 
bbot said:
Why does everyone assume that three separate cpus are being used? Isn't it more likely, from looking at the diagram, that one tri-core cpu is being used?

The assumption comes from the idea that MS was/is rumored to be using 3 PowerPC 976s (dual-core) in the next Xbox.
 
DemoCoder said:
Yeah, you need to do the "inner loop" on a tile at a time and finally write it out to main memory will it never be touched again during that render pass. If it gets touched again in the same pass, that means reading the spilled tiles back into eDRAM.
TBR in this case..since I read this patent I believe Sony is going to do the same on their cell based visualizer.

ciao,
Marco
 
more information from original thread
http://bbs.gzeasy.com/index.php?showtopic=149175

CPU:
3 identical cores, 64bit PowerPC at 3.5G+
total 6 HW threads (AI, Rendering, Generation/Loading/Unpacking, Collision/Physics, Audio, Graphics)
32K 2way I-cache, 32K 4way D-cache
1MB 8way shared L2 cache
84+ GFlops in one chip

GPU:
The GPU can read/write all the system memory, including the CPU L2 Cache.
shader 3.0+, 4096 instr, dynamic branching, unified VS & PS
memory write from shader code

48 ALUs(shader process) each at 500M Hz
466 GFLOPS

Free 4xFSAA

Frame buffer format:
INT, 2:10:10:10, 8:8:8:8, 16:16:16:16
FP, 11f:11f:10f, 16f:16f:16f:16f, 32f:32f:32f:32f

Pixel rate 4.0 BPix/S
Triangle generate 500 MTri/s
PixelShader 48 Ginstr/s
VertShader 48 Ginstr/s
Tex rate 8 Gpix/s
Tex Bandwidth 22GB/s
 
aaronspink said:
From the diagram, it certainly seems that the GPU uses both the console main memory and the embedded memory together ( 33 GB/s read stream and 22 GB write stream out of GPU to CPU/Main Mem).
Actually looking closer, the diagram pretty clearly states that 33GB/s read is for memory(22) + L2 cache(11) combined, not eDram.
I suspect this is like many of the current IBM cpus where you can configure parts of L2/L3 cache to work as scratchpad mem (usually half of the cache can be switched to that mode).
512KB large streaming buffer at the speed of L2 cache sounds pretty yummy to work with for fast CPU-GPU interaction. Makes the whole idea of chiefly using CPU for vertex shading much more of an interesting possibility too.

Chalnoth said:
Anyway, my point was that it's not clear to me that that file was before or after "spec lockdown." As a side note: if the console is set to be released by the end of 2005, then spec lockdown may be right about now.
Well all the little pluses would likely indicate parts that are possible subjects to change no :p ? Most of numbers with'+' next to them seem linked to clock speeds anyhow, which is usually not locked down until the last moment.
 
aaronspink said:
DemoCoder said:
With only 10mb of EDRAM, FSAA and HDTV resolutions are not supportable. 1280*720*8bpp*4xfsaa = 29mb (yes, compression can be used, but you cannot depend on a guaranteed compression ratio, you have to allocate a buffer the size of the worst case). If you use 64-bit FB for HDR, it's worse: 1280*720*12bpp*4x = 44mb. Of course, it's all even worse for 1080i.

From the diagram, it certainly seems that the GPU uses both the console main memory and the embedded memory together ( 33 GB/s read stream and 22 GB write stream out of GPU to CPU/Main Mem). If this is true, you could use the embedded dram as the primary buffer in the case of AA if you have confidence that the majority of pixels will compress. You would then add the additional buffers in the main memory to hold the additional samples for those pixels that did not compress.


Aaron Spink
speaking for myself inc.

Ignoring the problem of random access on a per pixel basis when variable compression is applied (as its not strictly variable here). I would adebate the concluesion that the majority of pixels compress given an architecture with a what is probably a very high poly throughput i.e. lots of triangle edges may be present...

John.

John.
 
this diagram is said to be old. but no source to back that up.



DemoCoder, how much eDRAM would you concider satisfactory:
16 MB, 20 MB, 24 MB, 32 MB, 40 MB, 48 MB ?
 
Megadrive1988 said:
By this spec can anyone estimate the power of this in comparation to CELL

no because CELL is an architecture not a specific chip. there will be many different processors based on CELL.

it's like asking:
'can anyone estimate the power of this in comparison to X86, MIPs, SuperH, ARM, etc.'

what you might want to ask is:

'how will this (these Xbox 2 specs) compare to the PS3's Cell-based chipset'

Thanks by the correction :)
 
Back
Top