Complete Details on Xenos from E3 private showing!

Laa-Yosh said:
(Z/stencil rendering, alpha blended polygon rendering, overdraw, etc.) into a far smaller amount of bandwith, thus it is likely that it'll become bottlenecked by it.
So all this stuff is happening on the eDRAM by the logic on the eDRAM then? What we're saying is ATI moved the most common and bandwidth intensive aspects of rendering onto a small partner-chip, with lots of local storage and enough logic to do this work at high speed? they've taken this logic AWAY from the main GPU and all it's shaders, to localise it onto a fastest area. This eDRAM isn't enough room for a full high-res backbuffer, so the scene is divided into tiles, sent 35 GB/s to the eDRAM, where the logic does it's fast thing, and then the front-buffer is composed from several writes from the eDRAM.

Have I finally cracked it?! :?
 
Shifty Geezer said:
Laa-Yosh said:
(Z/stencil rendering, alpha blended polygon rendering, overdraw, etc.) into a far smaller amount of bandwith, thus it is likely that it'll become bottlenecked by it.
So all this stuff is happening on the eDRAM by the logic on the eDRAM then? What we're saying is ATI moved the most common and bandwidth intensive aspects of rendering onto a small partner-chip, with lots of local storage and enough logic to do this work at high speed? they've taken this logic AWAY from the main GPU and all it's shaders, to localise it onto a fastest area. This eDRAM isn't enough room for a full high-res backbuffer, so the scene is divided into tiles, sent 35 GB/s to the eDRAM, where the logic does it's fast thing, and then the front-buffer is composed from several writes from the eDRAM.

Have I finally cracked it?! :?

George i think he's got it! :D

From I read yeah thats it. Those 192 fpus maybe the "hard work" MS was talking about doing *shrug*.


Question: Does daughter die able to write tiles directly to memory or does the main die have o assemble the tiles into a frame before posting to memory?

Also how many tiles is that? 192 tiles? (one for each fpu) or something like a batch where 4 groups of 48 fpus (= 192) are dedicated to backbuffer work every frame?

I'm doing my best to keep up with you guys so I'm asking a lot of questions. This isnt my normal field of expertise but it is so damned interesting that I really want to learn how it works :oops:
 
Laa-Yosh said:
RSX on the other hand has to fit all its backbuffer operations (Z/stencil rendering, alpha blended polygon rendering, overdraw, etc.) into a far smaller amount of bandwith, thus it is likely that it'll become bottlenecked by it.

It'll be bottlenecked, but not crippled. The system can use both XDR and GDDR so it has over 40Gb/s bandwidth, and if you count 4:1 compression in MSAA modes, you wind up with 100+Gb/s. Hier-Z saves on z-reads/clears/writes as well. Given that a 6800 Ultra with 35Gb/s can hit 4500+Mpix/s today, an RSX should be able to hit about 3Mpix/s. But in stencil/z mode, it can go over 15Gpix/s easily.


I'd also emphasize the Xenos's ability to read/write the CPU L2 cache. This should allow games to stream content like this:

Compressed stuff in main memory -> uncompressing on CPU into L2 cache -> process on Xenos into rendered image in EDRAM -> output to the framebuffer

And why can't CELL+RSX to this?

Compressed stuff in main memory -> DMA'ed to SPE SRAM + decompress in SPE -> process on RSX via FlexIO -> rendered image in GDDR3 -> RAMDAC/TDMS.

This method utilizes a very small amount of the main system memory bandwith, especially compared to doing the same on PS3. For example you could do a lot of procedural textures and geometry this way.

The equivalent method on CELL also uses a small amount of main memory bandwidth, and you're less likely to experience cache hiccups with multithreading going on (6 threads sharing L2) compared to SRAM where you have total control. What's the L2 cache bandwidth of XB360 compared to the collective bandwidth 7 x SRAM caches?
 
Okay, that's tile rendering huh? Small bits in FAST ram, because lots of FAST ram is too expensive.

Could PS3 manage this too? RSX outputs data into a tiny tile to fit into a SPE's LS. SPE does the fast mundane work and chucks it back out? Presumably the tile would have to be TINY to work, and moving them all would be trouble.
 
Shifty Geezer said:
Okay, that's tile rendering huh? Small bits in FAST ram, because lots of FAST ram is too expensive.

Could PS3 manage this too? RSX outputs data into a tiny tile to fit into a SPE's LS. SPE does the fast mundane work and chucks it back out? Presumably the tile would have to be TINY to work, and moving them all would be trouble.

Not very useful. The SPEs combined has less than 2MB.
 
That's why I said TINY! What would be a minimum RAM requirement then for a decent tile size? The Xenos' 10 MB is good for a 480p tile, but you could get away with say a 256x256 tile? 64x64 even? Presumably it's a balancing act betweens gains of processing on fast local storage and overheads in fragmenting into tiles.
 
The smaller amount of SPE memory really just means more tiling though, right? I doubt the extra tiling would outweigh the savings of doing the rest of the work using internal cell bandwidth versus external memory bandwidth (?)
 
Xenos will probably render in relatively large tiles, 2-8 per frame. They have 10MB to fit into which is quite a lot. ERP? wrote that they can fit 640*480*4xAA into it, which means about 3 tiles for 1280*720 4xAA?
 
DemoCoder said:
Laa-Yosh said:
I'd also emphasize the Xenos's ability to read/write the CPU L2 cache. This should allow games to stream content like this:

Compressed stuff in main memory -> uncompressing on CPU into L2 cache -> process on Xenos into rendered image in EDRAM -> output to the framebuffer

And why can't CELL+RSX to this?

Compressed stuff in main memory -> DMA'ed to SPE SRAM + decompress in SPE -> process on RSX via FlexIO -> rendered image in GDDR3 -> RAMDAC/TDMS.

(Please correct me if I'm wrong, but) I don't think the RSX can read directly from the SPE RAM. I think the best the PS3 can do is:

Compressed stuff in main memory -> DMA'ed to SPE SRAM + decompress in SPE -> DMA to GDDR3 RAM -> Read from GDDR3 RAM -> rendered image in GDDR3 -> RAMDAC/TDMS.

So 1 extra DMA step for the uncompressed data, which means almost twice the total bandwidth requirements.
 
RSX is hooked up to the FlexIO bus which is shared by everything on the system, hence it should be possible for SPEs to "write" to the RSX via DMA or polled I/O just like the CPU in your desktop can "write" to registers on the PCI bus. Supposedly, SPEs can read each other's SRAM and PPE can read/write it. It's all virtualized.
 
DemoCoder said:
RSX is hooked up to the FlexIO bus which is shared by everything on the system, hence it should be possible for SPEs to "write" to the RSX via DMA or polled I/O just like the CPU in your desktop can "write" to registers on the PCI bus. Supposedly, SPEs can read each other's SRAM and PPE can read/write it. It's all virtualized.

Honda & Ken Kutaragi interview

For example, RSX is not a variant of nVIDIA's PC chip. CELL and RSX have close relationship and both can access the main memory and the VRAM transparently. CELL can access the VRAM just like the main memory, and RSX can use the main memory as a frame buffer. They are just separated for the main usage, and do not really have distinction.

This architecture was designed to kill wasteful data copy and calculation between CELL and RSX. RSX can directly refer to a result simulated by CELL and CELL can directly refer to a shape of a thing RSX added shading to (note: CELL and RSX have independent bidirectional bandwidths so there is no contention). It's impossible for shared memory no matter how beautiful rendering and complicated shading shared memory can do.
 
You know, this has me wondering something... It was brought up on another site that it was possible that the CPU had direct access to the GDDR3 via the FlexIO interface, and the GPU had direct access to the XDR also via the FlexIO... Is this possible? I'm wondering if perhaps the GPU can go directly to the XDR without going through the GPU, and if that is the case, that's a whole lot more bandwidth than it would have had by going through the CPU (since the FlexIO bus is around 75 GB/s).
 
It would make little sense to have the FlexIO ring-like bus and DMA controller, but require the CPU to copy data around like a Pentium on AGP.
 
Shifty Geezer said:
PC-Engine said:
If we can count the 48GB/s bandwidth of GS in PS2 then why can't we count this 256GB/s of bandwidth of the eDRAM?
If we can count this 'on chip' bandwidth as part of a 'system aggregate' bandwidth like MS did, does that not mean that the PS3's 'system aggregate' should also include bandwidth between 7 SPEs logic and local storage + 1 PPE and cache + Level 2 cache on Cell?

Maybe they should come out with a new measure - electrons in motion/square mm/second? :rolleyes:

Seriously, why is this talked of as bandwidth (unless just marketting)? The true bandwidth between GPU and eDRAM modules is 256 Gbits/s, right? 32 MB/s? Anything with a 256 GB/s number is phoney?


once again, as I understand it:

the bandwidth between parent GPU die and daughter die is 32 GigaBytes per second aka 256 Gigabits per second, for read bandwidth only. (48 GigaBytes total for read+write)


bandwidth between eDRAM on daughter die and the logic on daughter die is 256 GigaBytes per second aka 2 Terabits per second.
 
@ MD,

That is how most are understanding it.


@ Shifty, while it may not be "system bandwidth" it is very legitimate to take this bandwidth into consideration. eg. on the RSX the backbuffer WILL be using the system memory's bandwidth, while on the Xbox 360 the eDRAM isolates all the high bandwidth back buffer (AA samples, Z, alphas, blending) into the eDRAM.

On the RSX if you use 10GB/s for the backbuffer on the GDDR3 memory (lets ignore the XDR for the moment to make this easy), you have done two things: first is you have almost cut your bandwidth in half for the GDDR3, the other thing is you are running the risk of treating the GDDR3 pool as a framebuffer. If you use ~60MB of the GDDR's space but most of the bandwidth for a framebuffer--well, that is one expensive framebuffer!

The eDRAM isolates all those high bandwidth tasks away form the general memory pool and since the bandwidth wont be wasted on the backbuffer you have more access to the memory contents (instead of a 256MB pool of memory that is under utlized because you are saturating the BW with your buffer).

Basically, the eDRAM gives real savings. The bandwidth the eDRAM uses is real bandwidth being used on the PS3.

That being said, I agree that the entire 256GB/s total should not have been added arbitrarily. Instead, I would have liked to have seen it compare the typical needs of 1080i @ 60fps with HDR, 4x AA, and so forth. Look at what the bandwidth needs would be and compare them side by side. That would have been fair in my opinion.

I am not too worried about the PS3 though. The RSX has almost 38GB/s of bandwidth to use and access to all 512MB of memory. While there may need to be some tradeoffs at times (4x AA with HDR at 1080p seems like a system killer to me) overall I do not see the PS3 having issues. The eDRAM on the Xbox 360 was a way to keep the UMA free from the backbuffer bandwidth needs (and thus they were able to go with cheaper 128bit memory) and a neat way to give some nice effects, like 4x AA, almost free.

Different methods, different philosophies, similar results. But the bandwidth the eDRAM saves is VERY real. So both sides are wrong: Sony is wrong for just counting the bandwidth as if it is apples-to-apples; MS was wrong for adding the BW together.

Instead, a fair and honest way to really look at it, would have been to look at what the backbuffer savings are. Whether it be 1GB/s or 30GB/s does not really matter, but knowing what that savings is tells us a lot more about the system bandwidth in general.

(EDIT: Just an example of why I think that is more fair: If a "game" uses 15GB/s of backbuffer bandwidth per second, that leaves the RSX with 23GB/s of bandwidth left. If the R500 can do that 15GB/s of backbuffer in the eDRAM, that leaves 23GB/s also [real stupid numbers because of the CPU pools are different... but ignore that fror this stupid example]. In the scenario I just gave both systems are left with the same amount of remaining system bandwidth. The question of course is a game-by-game one, but games with features that use require a large amount of bandwidth for the framebuffer will benefit from the eDRAM. So the bandwidth and savings are real... the question is how to do a fair apples-to-apples comparison. So far I have not seen one from either side).
 
Acert this is for you:

"HDR, alpha-blending, and anti-aliasing require even more memory bandwidth. This is why Xbox 360 has 256 GB/s bandwidth reserved just for the frame buffer. This allows the Xbox 360 GPU to do Z testing, HDR, and alpha blended color rendering with 4X MSAA at full rate and still have the entire main bus bandwidth of 22.4 GB/s left over for textures and vertices."

from that ign article by douglass perry...
 
blakjedi, thanks. I will just paste that next time! Quick and dirty, no explaining, no examples, but hits it right on the head :p
 
PS3 is not going to have eDRAM.

The Hollywood GPU in Revolution may. Flipper (the GPU in the GCN made by ArtX, now owned by ATI) had eDRAM so it looks likely.

Just FYI its already confirmed that Hollywood definitely will use eDRAM (1T-SRAM-Q).
 
StarFox said:
PS3 is not going to have eDRAM.

The Hollywood GPU in Revolution may. Flipper (the GPU in the GCN made by ArtX, now owned by ATI) had eDRAM so it looks likely.

Just FYI its already confirmed that Hollywood definitely will use eDRAM (1T-SRAM-Q).

Cool, thanks! Hopefully they can build on the R500 eDRAM design.
 
Acert93 said:
@ MD,

Instead, a fair and honest way to really look at it, would have been to look at what the backbuffer savings are. Whether it be 1GB/s or 30GB/s does not really matter, but knowing what that savings is tells us a lot more about the system bandwidth in general.

What's wrong with adding the bandwith if you would have needed it anyway if it wasnt for the embedded memory?

The embedded memory just killed a bottleneck, and the competitor does not have it, why not state its advantage when its so huge? Yes that advantage is bandwidth.

And overall I recall sony hyping numbers that make less sense and will show up less in final game quality than this feature of the r500.
 
Back
Top