The most Detailed Tech Information on the Xbox360 yet

Jawed · May 22, 2005

Shifty Geezer said:
What about rendering to textures and the like? I envisage post-processing and more advanced rendering techniques being somewhat restricted by XENOS sharing the system-wide bandwidth. Or is this just not seen as heavy-use requirement and able to be absorbed into the existing rendering pipeline?

But a 720p full-frame copy at a rate of 120 frames per second (stencil shadow + final frame) only consumes 332MB/s, 1.5% of available bandwidth.

Why are people getting so excited about a full-frame copy from back-buffer to front or a texture?

Jawed

DemoCoder · May 22, 2005

It's not just bandwidth, it's also latency.

AzBat · May 22, 2005

ERP said:
Interesting observation of the day 640x480x4xAA just fits in the EDRAM.

That's neat. Calculations I've done over the past months was that they would include enough eDRAM to do 480p at 4xAA: 10mb. Seems they might have planned that from the start, but later down the process went ahead and put 720p at 2xAA as the default even though it has to move multiple tiles per frame to do so.

I'm still not sure how 4:3 standard TVs will handle the Xbox360 games. Does it display full screen or letterboxed? If full screen, is it formated to fit the screen, sides cut off or some kind intelligent pan and scan implemented in the game?

Either way, regular TV viewers might be in for surpise to see how well it looks even with half the scan lines.

Tommy McClain

Jawed · May 22, 2005

DemoCoder said:
It's not just bandwidth, it's also latency.

Well, if the EDRAM allowed you to create the render 4x faster (say) than if the render was directly against 22.4GB/s system memory, then I think you've come out way ahead.

Jawed

DemoCoder · May 22, 2005

It can't empty the framebuffer until it is done rendering that tile, and while it is transferring out the framebuffer to main frame, it cannot overlap that with rendering the next frame, because all of the RAM is presumably used, unless you want to make the tiles even smaller!

That translates into waiting.

Rockster · May 22, 2005

It's clear that the tiles will need to be very small. Especially since it contains upsampled data. Microsoft is finally bringing out Talisman!

Jawed · May 22, 2005

I'm not saying there's no wait. I'm saying that the wait is far outweighed by the speed of rendering. If you can render a target in 2ms using EDRAM, instead of 8ms using 22.4GB/s memory, then the 1ms cost of the copy isn't worth complaining about.

Obviously I'm guessing with these figures. Perhaps you ought to put some figures of your own together to support your argument.

Jawed

Jawed · May 22, 2005

Rockster said:
It's clear that the tiles will need to be very small. Especially since it contains upsampled data. Microsoft is finally bringing out Talisman!

At 4xAA, an upsampled pixel consumes 32 bytes, which is 3 tiles of 10MB each.

Jawed

Acert93 · May 22, 2005

I am totally ignorant in this area...

Why does the eDRAM have to be filled before it is posted to the GDDR3 RAM pool? e.g The #'s Jaws just threw out is three 10MB tiles. If latency is an issue could it not just do the same work in 6 titles, after it finishes one tile and begins the next in the eDRAM the smaller tile would just be transferred. That probably makes no sense since I have no clue what I am talking about... it just seems like there should be ways to work around this. 256GB between the daughter die logic and the eDRAM and like 32GB/s write 16GB/s read between the daughter logic and the Shader core seems just like a ton of bandwidth to exploit. Even if size is an issue, you can post the eDRAM buffer to the main memory so quickly it would not seem size would be an issue.

It seems like the eDRAM is really fast, compensating for the lack of system bandwidth. It would be hard for me to imagine ATI, NV, Sony, or any other tech company going with a design that reduces logic transistors for eDRAM if that eDRAM offers no benefit. We saw eDRAM in the GS and Flipper and it seemed to be a big plus for those systems.

As a couple people have noted ATI has a few other surprises yet. We learned a bit more about the eDRAM today and FP10, we still have yet to get into some of the more technical stuff on how it is doing this. I would be surprised if there were not some ingenious ways to maximize the design.

DemoCoder · May 22, 2005

No one is saying eDRAM isn't a benefit, just that there is clearly a cost to tiling, otherwise, why even use 10Mb? Why not just stick 2mb of eDRAM on there and tile out the wazoo! Hell, they could have squeezed 2mb on the R500 core itself and just used tiling.

As for how the tiling works in HW, I'm at a loss. My best guess is this: all post-transformed screen space data is saved and reprocessed for each tile. Primitives that span a tile can either be clipped by the GPU. Primitives that lie only in one tile or the other are "filtered" and not resent to the GPU.

The question is, what does the filtering? The CPU, or HW?

Acert93 · May 22, 2005

I asked this in another thread, but most of you are not there and this is a more technical discussion, so here it goes.

The trade off seems to be MS went with 10MB of eDRAM and 23GB/s of system bandwidth. Sony went with 48GB/s for the entire system (~ 23GB/s + 15GB/s for the GPU).

1. Was the eDRAM a good tradeoff? Will the eDRAM save 25GB/s or more bandwidth? It looks like it needs to save at least this much bandwidth to be comparable to other options (like #3 below)

2. Do we really expect the RSX, with its memory configuration, to be able to do 1080p with 4x AA, HDR, and other memory bandwidth intensive tasks? How about 1080i with those same settings?

3. Would MS have been better off dumping the eDRAM and going with 256bit memory and having a single, fast, 46GB/s memory pool?

4. Was taking a 70M transistor cut in shader logic worth the eDRAM?

Basically it looks like MS loses 70M logic transistors and 256bit memory (and thus the system bandwidth) by going for eDRAM.

Those seem like pretty big tradeoffs. I want to know the advantages!

Specifically how much "savings" the framebuffer is offering because it looks like those saving have come at the cost of logic and possibly going with a higher bandwidth memory architecture.

Jawed · May 22, 2005

DemoCoder said:
No one is saying eDRAM isn't a benefit, just that there is clearly a cost to tiling, otherwise, why even use 10Mb? Why not just stick 2mb of eDRAM on there and tile out the wazoo! Hell, they could have squeezed 2mb on the R500 core itself and just used tiling.

This is a damn good question. As has been observed already, a 480p frame fits within 1 tile, so maybe that's why 10MB was originally chosen.

Maybe it's like the tiling on R300 and later architectures (where each quad owns a tile of 16x16 pixels). In this case, ATI's experiments showed that 16x16 is optimal.

So whilst 10MB seems like a pretty strange amount, we're stuck not knowing what the constraints are.

As for how the tiling works in HW, I'm at a loss. My best guess is this: all post-transformed screen space data is saved and reprocessed for each tile. Primitives that span a tile can either be clipped by the GPU. Primitives that lie only in one tile or the other are "filtered" and not resent to the GPU.

The question is, what does the filtering? The CPU, or HW?

The implications, thus far, have been that the GPU is doing the clipping. It seems to me that any tile-clipping is just like viewport clipping. Since viewport clipping comes very early it would be possible to predicate the raw vertex/triangle data (20 bytes per vertex?) but that's such a lot of data that it would have to go to system RAM.

Obviously triangles that cover multiple tiles are a special case. And there'll be a lot of them...

Jawed

aaaaa00 · May 22, 2005

DemoCoder said:
10:10:10:2 is not an HDR format IMHO. HDR means "HIGH dynamic range" 10:10:10:2's dynamic range is only 4x that of 8:8:8:8.

How's that?

Say that FP10 is 7e3 bits per component.

1e0 = 1
127e7 = 16,129

That's 63x the dynamic range of 8:8:8:8, at the same speed. Tradeoff is loss of accuracy, but its up to the developer what they want to use. Yes it supports blending.

True, it's not as good as FP16, but it's twice as fast. Depending on what you're doing, that's good enough.

The tile size can bet set by the developer, its up to them how they want to tune it.

Gubbi · May 22, 2005

DemoCoder said:
It can't empty the framebuffer until it is done rendering that tile, and while it is transferring out the framebuffer to main frame, it cannot overlap that with rendering the next frame, because all of the RAM is presumably used, unless you want to make the tiles even smaller!

That translates into waiting.

But since they are flushing fully rendered frames, they only need to flush the frame buffer (Z buffer is cleared)

2MB /22GB/s ~ 0.1ms or less than 0.6% of a frame (assuming 60 fps) per tile flush. One flush for 640x480, two for 1280x720, five for 1920x1080 (assuming 10:10:10 bit pixels and 4x FSAA)

That is so low, that they probably haven't bothered to do something clever.

Cheers
Gubbi

therealskywolf · May 22, 2005

Have you guys checked he thread about Xenos? E3, close doors.....it even has pictures......

Gubbi · May 22, 2005

therealskywolf said:
Have you guys checked he thread about Xenos? E3, close doors.....it even has pictures......

Tried to read it .... But gave up. Auto translated japanese suck. Reading stuff like:

In other words, increasing Texture Pipe random, if speed of memory is not sufficient, there is no meaning. In case of memory zone - Xbox 360 22.4GB/sec - with thinking of best balance, it seems that becomes this constitution.

Results in instant headache.

Cheers
Gubbi

DemoCoder · May 22, 2005

aaaaa00 said:
DemoCoder said:

10:10:10:2 is not an HDR format IMHO. HDR means "HIGH dynamic range" 10:10:10:2's dynamic range is only 4x that of 8:8:8:8.

Click to expand...

How's that?

Say that FP10 is 7e3 bits per component.

1e0 = 1
127e7 = 16,129

That's 63x the dynamic range of 8:8:8:8, at the same speed. Tradeoff is loss of accuracy, but its up to the developer what they want to use. Yes it supports blending.

Well, I assumed it was an "FX10" format, not an "FP10" format since I wouldn't believe they'd go to less than 8 bits precision. I thought maybe they'd try something funky like 8e2. Am I to assume your info is legit? Even so, average real world scenes exhibit dynamic ranges of 100,000:1. With only 7 bits for mantissa, you're going to accumulate significant error during blends.

True, it's not as good as FP16, but it's twice as fast. Depending on what you're doing, that's good enough.

But it's also going to have more artifacts. Especially at 7-bits, each additional bit has a greater impact. I don't think this format is appropriate for next-gen HDR titles, since the whole point of HDR in many cases is to *avoid* artifacts imposed by LDR. You're just trading one set of artifacting for another. It only makes sense if you've maxed out performance and want to switch it on "for free" (at the cost of artifacts). But I'd prefer to burn the fillrate/bandwidth on a real FP16 backbuffer.

DemoCoder · May 22, 2005

Gubbi, if tiling is so cheap, why didn't ATI use *less* eDRAM, boost yields, lower costs, increase margins, and possibly even put so little that it could fit on the main R500 core?

Clearly, ATI analyzed the issue and tried to put enough to hold 640x480x4xFSAA. There must have been a reason for their decision.

Jawed · May 22, 2005

Acert93 said:
I asked this in another thread, but most of you are not there and this is a more technical discussion, so here it goes.

The trade off seems to be MS went with 10MB of eDRAM and 23GB/s of system bandwidth. Sony went with 48GB/s for the entire system (~ 23GB/s + 15GB/s for the GPU).

Which is exactly the same bandwidth that X850XTPE has. And X850XTPE is considered bandwidth limited. RSX's theoretical fill-rate, even using 16 ROPS, will be higher than X850XTPE's, because fill-rate is determined by ROP-count and memory clock (i.e. 16*700=11.2GP/s). But if X850XTPE is already limited, how the hell is RSX going to be better?... In HL-2 X850XTPE is bandwidth limited at 800x600 when you turn on AA.

In B3D's test for AA performance:

http://www.beyond3d.com/reviews/ati/r480/index.php?p=12

X850XTPE is losing 45% with 4XAA. ATI is reporting a loss of 5% for R500 using slightly more pixels. That's the best data we have right now...

1. Was the eDRAM a good tradeoff? Will the eDRAM save 25GB/s or more bandwidth? It looks like it needs to save at least this much bandwidth to be comparable to other options (like #3 below)

As the triangle count increases in games, you need more and more bandwidth. Hierarchical-Z will save writing an awful lot of those triangles into the frame buffer, but IMR pre-supposes overdraw. Next-gen games are going to use more frame-buffer bandwidth. There's no escape.

I wish there were some breakdowns of this stuff, e.g. using Quake 2, UT2k3, HL-2 as generational models for performance...

It's extremely hard to quantify the benefits of EDRAM right now...

2. Do we really expect the RSX, with its memory configuration, to be able to do 1080p with 4x AA, HDR, and other memory bandwidth intensive tasks? How about 1080i with those same settings?

Well 1024x768 with FP16 HDR no AA is the playable limit (60fps) for NV40, and that's 85% of the pixels in 1280x720. 1080p 4xAA HDR looks doomed to me.

3. Would MS have been better off dumping the eDRAM and going with 256bit memory and having a single, fast, 46GB/s memory pool?

That's prolly on the limit of being fast enough and doesn't provide any headroom for HDR or multiple render targets (whether for stencil shadowing or motion blur, etc.).

4. Was taking a 70M transistor cut in shader logic worth the eDRAM?

Basically it looks like MS loses 70M logic transistors and 256bit memory (and thus the system bandwidth) by going for eDRAM.

I believe that ATI has designed a set of balanced bandwidths - an holistic design. It may well be true that they've targetted, say, 2 million polys on a 640x480 resolution screen, hence the balance has been lost somewhat with a 720p target. But overall the design speaks of "what's the best way to use 300 million transistors", rather than "hey we've got 300 million transistors, we can increase this bit by 50%, yay!".

I think the XB360-specific API that ATI's designed-for is also a big factor. This is WGF2.0-lite. This is where GPUs are headed anyway...

Those seem like pretty big tradeoffs. I want to know the advantages! Specifically how much "savings" the framebuffer is offering because it looks like those saving have come at the cost of logic and possibly going with a higher bandwidth memory architecture.

The EDRAM chip's internal bandwidth is very much a real-world win. Blending/filtering and z-testing all consume lots of bandwidth. You might only be consuming 32GB/s of bandwidth in sending fragment data to the ROP, but the ROP itself consumes many times that in order to perform all its duties - or it would if you let it

By keeping that bandwidth consumption away from the 22.4 GB/s of system memory bandwidth, which is primarily where texture data (or rendered texture data) lives, you've freed-up a lot of the plain texturing bandwidth, as well as the bandwidth the CPU needs.

Obviously XB360's CPU bandwidth is limited because it actually only has 10.8GB/s of read bandwidth available to it (half of Cell - which is, perhaps, a good match for the respective FLOPS ratings?...). How much do physics, AI and world geometry consume? Good question.

Jawed

DemoCoder · May 22, 2005

Jawed said:
In HL-2 X850XTPE is bandwidth limited at 800x600 when you turn on AA.

In B3D's test for AA performance:

http://www.beyond3d.com/reviews/ati/r480/index.php?p=12

X850XTPE is losing 45% with 4XAA.

Aren't you forgetting that every 2-samples requires an extra clock, so going from 2x-4x cuts fillrate in half, regardless of the bandwidth. The XT850XTPE could have infinite bandwidth, and 4xFSAA would still lose ~50%. The RSX could win if its ROPs (doubtful) have been upgraded to write more AA-samples per clock, or if it has more ROPs than the XT850PE. Even though they'd be bandwidth limited, more ROPs would mean less of a fillrate hit for 4xFSAA.

Well 1024x768 with FP16 HDR no AA is the playable limit (60fps) for NV40, and that's 85% of the pixels in 1280x720. 1080p 4xAA HDR looks doomed to me.

Playable limit on what game? I can run Counter-Strike source @ 1280x1024x2x FSAA on a GeForce 6600 above 60fps on most maps.

The most Detailed Tech Information on the Xbox360 yet

Jawed

DemoCoder

AzBat

Agent of the Bat

Jawed

DemoCoder

Rockster

Jawed

Jawed

Acert93

Artist formerly known as Acert93

DemoCoder

Acert93

Artist formerly known as Acert93

Jawed

aaaaa00

Gubbi

therealskywolf

Gubbi

DemoCoder

DemoCoder

Jawed

DemoCoder

Similar threads