Observations, thoughts and questions about X360 and PS3

Titanio said:
I think it only shows the XDR bandwidth on one, bi-directional arrow, because that figure is split evenly going up and down. If it was 25.6GB/s both ways, we'd be talking about 51.2GB/s of bandwidth to XDR ;)

And that is an interesting quote from MPR, I had missed that too. So basically if a thread is blocked, it's like a 3.2Ghz PPE, but if both threads are not blocked, it's like 2 1.6GHz PPEs? Interesting..

In reality, the split would be arbitrary I guess, depending on the blocking behaviour of the threads. One might get 2.5 billion cycles, the other 0.7billion etc. etc.

That might also explain the Crytek guy's comments if Xenon is different in this respect?

I think i misunderstand what your saying or rather what im thinking, i thought of the main-ram as a regular FSB as in PC. The RSX is going to be really bandwidth starved, i mean it has as much bandwith as my 6800 (256bit-350DDR) witch is starved. Add what(?) 4-5 times the fillrate and its not looking good.
 
overclocked said:
I think i misunderstand what your saying or rather what im thinking, i thought of the main-ram as a regular FSB as in PC. The RSX is going to be really bandwidth starved, i mean it has as much bandwith as my 6800 (256bit-350DDR) witch is starved. Add what(?) 4-5 times the fillrate and its not looking good.

I'm not really sure what you mean, I wasn't really commenting on RSX's bandwidth situation in the post you quoted..?

Also, vs a 6800 I thought RSX's fillrate might be more like 2x? (6.4Gigapixels to 13.2 - extrapolated from G70 figures? Although that'd depend on what 6800 you were talking about). But we don't really know.

I'm not sure if you're going to be consuming your fillrate though, it wouldn't even be possible to use it fully without compression at least (?) I think it may be a case of doing more per pixel than firing out more pixels too.

You can probably consider RSX's bandwidth situation to be 48GB/s - CPU consumption - framebuffer consumption. The latter has been a point of debate, I don't think we ever really figured out what it would look like, but of course, it's gonna vary from game to game anyway. You can try thinking about it yourself - a 720p FP16 frame is 7.03MB, so I guess your variables after that would be how many times you're reading it in and writing it out per sec (your overdraw, your framerate, how many times on average you'll be reading pixels back). You might want to consider the z-buffer too, and you may want to consider color and z compression (which NVidia claims can be 4:1, or maybe better now, I don't know, but they don't go into a lot of detail about it). Take that out, take out CPU usage, and that's how much you'll have left for texture/vertex reads etc. But this is still all speculative until we have final detail on RSX.
 
Last edited by a moderator:
Titanio said:
I hope he's right when it comes to the final system!
me too ;)
edit - about cell<->rsx, doesn't rsx write to cell at 15GB/s and read from it at 20GB/s? That's enough to allow it to consume XDR's bandwidth entirely if it wanted (XDR is 12.8GB/s read, 12.8GB/s write),
I'm not 100% sure about that but I don't think XDR bandwidth is split, you can have full bw when you write to mem and when you read from mem, obviously not at the same time.
In reality it will be "up to" that figure depending on CPU usage and how much cpu<->gpu bandwidth you require for things other than direct memory transactions....?
:???::oops::devilish:
 
nAo said:
I'm not 100% sure about that but I don't think XDR bandwidth is split, you can have full bw when you write to mem and when you read from mem, obviously not at the same time.


Ahh..interesting. I didn't know that at all, I just assumed it was an even split. So basically you can read and write any amount as long as combined you don't exceed 25.6GB/s? That's more flexible than I thought.

nAo said:
??? oops twisted

:p i don't know what this means! But I guess I won't be finding out today..:)
 
Titanio said:
I'm not really sure what you mean, I wasn't really commenting on RSX's bandwidth situation in the post you quoted..?

Also, vs a 6800 I thought RSX's fillrate might be more like 2x? (6.4Gigapixels to 13.2 - extrapolated from G70 figures? Although that'd depend on what 6800 you were talking about). But we don't really know.

I'm not sure if you're going to be consuming your fillrate though, it wouldn't even be possible to use it fully without compression at least (?) I think it may be a case of doing more per pixel than firing out more pixels too.

You can probably consider RSX's bandwidth situation to be 48GB/s - CPU consumption - framebuffer consumption. The latter has been a point of debate, I don't think we ever really figured out what it would look like, but of course, it's gonna vary from game to game anyway. You can try thinking about it yourself - a 720p FP16 frame is 7.37MB, so I guess your variables after that would be how many times you're reading it in and writing it out per sec (your overdraw, your framerate, how many times on average you'll be reading pixels back). You might want to consider the z-buffer too, and you may want to consider color and z compression (which NVidia claims can be 4:1, or maybe better now, I don't know, but they don't go into a lot of detail about it). Take that out, take out CPU usage, and that's how much you'll have left for texture/vertex reads etc. But this is still all speculative until we have final detail on RSX.

Your right but i was rather spinning on the Bandwidht issue, i should have maked that clearer.
On reference to cards i just made a simple compare between a 6800vanilla 325core again for a bandwidth comparison. A stock 6800NU has 3900Gtex/Gpixel and assuming rsx has 13200 plus the more advanced ALU config in G70 you should have about maybe 5x shaderfillrate atleast and still the same bandwidht as the former. All im saying is that on what we "know" this seems like a pretty big bottleneck.

I havent read anything about XDR so i cant qualify to say but if it is what i thought first. You have 25,6GB thats available in any direction you want, either read 20GB and write 5GB or how you like it the best, cause that would help with bandwidht restrictions instead of having this "fixed" 12,8UP and 12,8Down.

I agree with the other things you saying that will help bandwidth but it would be good to know for certain how the XDR works, so go and find a quote/link now Titanio :)

Edit
It was actually the split you mentioned about the XDR that made my whole understanding of PS3 go down...Hehe
 
Last edited by a moderator:
overclocked said:
On reference to cards i just made a simple compare between a 6800vanilla 325core again for a bandwidth comparison. A stock 6800NU has 3900Gtex/Gpixel and assuming rsx has 13200 plus the more advanced ALU config in G70 you should have about maybe 5x shaderfillrate atleast and still the same bandwidht as the former. All im saying is that on what we "know" this seems like a pretty big bottleneck.

It has the same amount of BW on the VRAM side alone, but if it's pulling from and/or pushing to XDR, that increases available BW. I doubt your fillrate would be going that high though to be honest.

overclocked said:
I havent read anything about XDR so i cant qualify to say but if it is what i thought first. You have 25,6GB thats available in any direction you want, either read 20GB and write 5GB or how you like it the best, cause that would help with bandwidht restrictions instead of having this "fixed" 12,8UP and 12,8Down.

Yeah, my bad, it's 25.6GB/s in an arbitrary split seemingly. Just to be clear, I assume GDDR3 is the same?
 
Titanio said:
It has the same amount of BW on the VRAM side alone, but if it's pulling from and/or pushing to XDR, that increases available BW.

Yes i know that so we are actually on the same side here about using a portion of the XDR as video memory.
 
But really you don't know what version of Cell is being used, what RSX is and how Cell <-> RSX works

Well, is that good or bad? ;) Most of us think it's just a higher clocked G70 (although transistor count is higher...)
 
overclocked said:
Yes i know that so we are actually on the same side here about using a portion of the XDR as video memory.

Yeah, absolutely. I was actually just thinking of one theoretical setup, placing the framebuffer entirely in XDR. My understanding may fall down here, but I figured if you threw 6GB/s at the CPU, you could then access (read or write) an entire 720p 64-bit frame + 32bit zbuffer ~2000 times a sec (or about 80 times per frame at 30fps, 40 times per frame at 60fps) , and leave the GDDR3 bandwidth completely untouched for texture/vertex access. Aggregate chip-to-chip bandwidth would drop significantly, to about 15GB/s, but under the same model with X360, it'd be no worst off from that perspective (6GB/s for the CPU would bring X360's chip-to-chip total bw down around 15GB/s too).

It really comes down to two things as far as I can tell though - how Nvidia's color/z compression works (which I'm not attempting to factor in yet, and which could make a decent difference), and how much you can do to reduce the depth complexity of your scene. It could be very worthwhile spending some cycles on Cell on good occlusion culling.
 
Last edited by a moderator:
I wish I had seen this MPR before I said anything...and had that diagram in front of me. Can someone post a link to the MPR? I'd certainly like to read it. It would be to everyone's benefit as it's obvious what happens when your flying blind LOL! anyway...

As far as RSX's bandwith goes...is this how RSX looks to have access to the XDR pool now:

RSX requests->FLEXIO->EIB->XDRAM controller->XDR Ram->XDRAM controller->EIB->FLEXIO->RSX

2 things would be true if this is right:

1. Even though the XDRAM controller can read/write 25.6GB/s (not at the same time?), having to move data back through FLEXIO effectively sets the max read at 20GB/s and the max write at 15GB/s for RSX to and from the XDR.

2. Unless there's a seperate link explicity for the purpose (unlikely right?) then data must flow over Cell's EIB in these situations which could have a bad affect if there is too much clutter on the EIB so that all the cores cannot communicate as freely as one would desire. I can't remember how fast the EIB is...so this may be so fast as this is not likely.

The idea is that RSX request can piggy back Cell requests to XDR and when those requests return they eat into the bandwith Cell has to communicate data back to RSX that it was working on.(the 20GB/s out to RSX from FLEXIO). Requests to XDR wouldn't really eat into RSX bandwith to write to Cell because request are trivaily small so that RSX could still write to Cell with 99.99...99% of it's bandwith.

This make sense guys? if so would RSX accessing the XDR ram be a problem for Cell's EIB?

As far as how threads are handled...

I think like Shifty does, and thread exectution being interleaved finally makes DeanoC's comments that the PPE is like unto 2 1.6GHz processors make perfect sense. It also makes clear comments the affect the Xenon is more efficient than Cell while Cell is more powerful. If I understand things correctly the situation where all all threads are being put to use the situaton is like this:

Xenon: like six processors running at 1.6Ghz due to quick switching between HW contexts and also because of the HW contexts Xenon is apt to efficiently maintain this situation when a thread blocks for whatever reason.

Cell: With the PPE the situation it is like unto a core on Xenon. It's like 2 1.6Ghz processors when 2 threads are on the PPE. The SPUs should be looked at as 7 3.2Ghz processors. However, since there is no support for a second HW context when a thread blocks it is much more costly than on the PPE or a core on Xenon. I suppose this is where comments to the affect of there being no way to hide latency etc. come from.

Ok...I'm still not in la la land am I?
 
Last edited by a moderator:
Titanio said:
Yeah, absolutely. I was actually just thinking of one theoretical setup, placing the framebuffer entirely in XDR. My understanding may fall down here, but I figured if you threw 6GB/s at the CPU, you could then access (read or write) an entire 720p 64-bit frame + 32bit zbuffer ~2000 times a sec (or about 80 times per frame at 30fps, 40 times per frame at 60fps) , and leave the GDDR3 bandwidth completely untouched for texture/vertex access. Aggregate chip-to-chip bandwidth would drop significantly, to about 15GB/s, but under the same model with X360, it'd be no worst off from that perspective.

It really comes down to two things as far as I can tell though - how Nvidia's color/z compression works (which I'm not attempting to factor in yet, and which could make a decent difference), and how much you can do to reduce the depth complexity of your scene. It could be very worthwhile spending some cycles on Cell on good occlusion culling.

Stuff like this is what I was talking about in my original post...though I lack the insight to go as far as you have with this. This I find very interesting. Things like this would be difficult on the X360 because Xenos's framebuffer is in the e-Dram barring the intelligent memory is setup to handle this kind of stuff already.
 
scificube said:
Things like this would be difficult on the X360 because Xenos's framebuffer is in the e-Dram barring the intelligent memory is setup to handle this kind of stuff already.

But you wouldn't want the framebuffer to be anywhere else on X360, that's what the eDram is for, and it has plenty of bandwidth for it. PS3 has more "general" bandwidth, so to speak, but you have to accomodate the framebuffer with it too. My point is that depending on your CPU and framebuffer needs, you may still end up having more BW for texture/vertex reads in PS3 anyway.

It's hard to compare, though. CPU consumption is probably going to be different on both systems, and that presents some interesting issues too (for example, I think Cell could burn through data a lot quicker than Xenon because it has much more execution logic to feed, but on the other hand, Xenon's cache may be pretty "busy" swapping things in and out of memory).

scificube said:
2. Unless there a seperate link explicity for the purpose (unlikely right?) then data must flow over Cells EIB in these situations which could have a bad affect if there is too much clutter on the EIB so that all the cores cannot communicate as freely as one would desire. I can't remember how fast the EIB is...so this may be so fast as this is not likely.

I don't have exact figures, but IIRC it's roughly in the 200GB/s to 300GB/s range. There's plenty of BW there, I don't think there'd be much or a problem shuttling data from FlexIO to the XDR interface.
 
Last edited by a moderator:
Titanio:

What I meant to say I it would be difficult for Xenon to act in a same manner on the frame buffer in the e-Dram as Cell could in main memory.

I certainly do think the frame buffer is in an excellent place in Xenos's e-Dram. I realize the daughter die will blast through tasks traditionally attributed to working with the frame buffer.

It was a flexibility thing not a knock or anything.

edit:

If bandwith around the EIB is indeed in the (200-300)GB/s range I am of course inclined to agree with you that shuttling data from XDR through it to RSX wouldn't seem a problem.
 
Last edited by a moderator:
scificube said:
Titanio:

What I meant to say I it would be difficult for Xenon to act in a same manner on the frame buffer in the e-Dram as Cell could in main memory.

I certainly do think the frame buffer is in an excellent place in Xenos's e-Dram. I realize the daughter die will blast through tasks traditionally attributed to working with the frame buffer.

It was a flexibility thing not a knock or anything.

Gotcha, I wasn't really thinking in those terms, but putting the framebuffer in XDR would indeed leave it very close to Cell if you wanted to do something with it there..

scificube said:
If bandwith around the EIB is indeed in the (200-300)GB/s range I am of course inclined to agree with you that shuttling data from XDR through it to RSX wouldn't seem a problem.

Trying to figure out something more exact - It's 96 bytes per cycle, so if that's every cycle, it'd be about 286GB/s I think. But I thought I might have read somewhere that the EIB was clocked at half the rate of the chip, so in that case it'd be ~143GB/s. Either way, I think it should be enough to keep everything fed.
 
Titanio said:
Trying to figure out something more exact - It's 96 bytes per cycle, so if that's every cycle, it'd be about 286GB/s I think. But I thought I might have read somewhere that the EIB was clocked at half the rate of the chip, so in that case it'd be ~143GB/s. Either way, I think it should be enough to keep everything fed.

Understood and agreed.
 
scificube said:
Understood and agreed.

Not saying this is or isn't possible but.....

You really have to look beyond bandwidth for these things. It's a requirement obviously, but for it to be practical to put the framebuffer in XDRam the destination blender in RSX would have to be able to absorb the additional latency. And that's simply not a known quantity.

As Deano said earlier, the speculation in these threads is interesting, but your speculating with far from complete knowledge, and there is a tendency to get fixated on one or two technical numbers, coupled with quotes taken out of context.
 
I understood and value what DeanoC said. I'm only looking at possibilities. I don't think anyone is making predictions...just working out what they can understand and sharing ideas.

I expect to be wrong. Maybe not so wrong as I was yesterday but I expect to be wrong and talking about it with B3D members is the best way I can think of checking myself.

I certainly don't know where else to go to talk about this stuff where green guys like me can at least be given the chance to speak.

With that said I want to toss another idea out there...

Maybe RSX shouldn't put it's frame buffer in the XDR but a sort of shadow copy of it with only data Cell could use would be there. The real frame buffer would reside in the GDDR3 mem pool. This way Cell and RSX could work concurrently on unrelated tasks associated with the frame buffer and Cell could deliver it's results to RSX in a sort of "just in time" manner to RSX for the final blend/tasks. Cell could be doing something else while RSX is doing z-tests or something else. This may be a way to combat the latency.
 
Last edited by a moderator:
ERP said:
And that's simply not a known quantity..
Unfurtunately there're are too unknown quantities/features at this time..
I don't think we have the full picture yet.
 
Back
Top