From Ps3 architecture to comparing PS3 RSX and PC G70

Yes, the Xenon CPU and GPU communicate over a 22 GB/s bus, and that BW is shared between CPU work, GPU work, and sharing data between devices.
 
. Rambus is to host this year's Rambus Developers Forum Japan at July 7/8 and one of its keynote speakers is Masakazu Suzuoki, SCE Microprocessor Division, and the title is "Decision Process of CELL - Bandwidth dominates architecture". Apparently it's not too far fetched that PS3 is architected with a similar intention.

It's 7/12/05, so does anyone have information about the Rambus Developers Forum in Japan?
 
mckmas8808 said:
. Rambus is to host this year's Rambus Developers Forum Japan at July 7/8 and one of its keynote speakers is Masakazu Suzuoki, SCE Microprocessor Division, and the title is "Decision Process of CELL - Bandwidth dominates architecture". Apparently it's not too far fetched that PS3 is architected with a similar intention.

It's 7/12/05, so does anyone have information about the Rambus Developers Forum in Japan?

Well here's an article on XDR-2 from the conference:

link

But as far as Cell info I haven't seen anything - in fact didn't even know it was a subset of the conference. Maybe info will begin to leak from it in the next couple of days or something.
 
Here are the presentation slides for "Decision Process of Cell - Bandwidth Dominates Architecture" by Masakazu Suzuoki, Sony Computer Entertainment America, and other sessions.
http://www.rambus.co.jp/events/rdfjapan2005.aspx
http://www.rambus.co.jp/events/Main1_2_SCE_Suzuoki.pdf

Interesting tidbits:

Which will be dominant?

•Single fat core
•Single fat core with multi-threaded
•Array of simple small cores (multi-core)

Which is better? Which is better?

•It depends on application
–Data mining
–Recognition
–Synthesis
•Single fat core is still important for:
–So-called “General Purpose (GP)â€processing
–Does GP apps require high performance processor?

•Sony thinks Multi-core is good solution for multi-media applications.
•Also fit to our model of distributed computing

Approach in Cell development

•Write actual program , then evaluate
1.Extract hot-spot code fragment
2.Write specific compiler
3.Compile the host-spot fragment
4.Check result
5.Feedback to H/W design
6.Feedback to compile design
•Think through compiler, not direct ISA.

Benchmark Profile

•1 SPU
•Frame Buffer
–Resolution1024x1024
–ColorRGB=8:8:8bit
•Data Software Cache
–Texture Tile16x16x4RGBA (single float x 4)
–Frame Tile16x16x4 RGBA (single float x 4)
–Vertexdynamic double buffer (size varies)
•Operation
–One overlay per object
•Newton Equation
•Collision
•Translation & Lighting
•Pseudo drop shadow (no self-shadowing)
•Bilinear texturing
•Rasterization

Local Access Profile

•Depends on data set / application
•Depends on program structure
–Software cachie/double buffering strategy
–Program overlay structure
•Reasolablehot spot is 64-128KB

Memory Access ProfileMemory Access Profile

•Highly depends on application and data set
•Tends to spread to wide address space–More than 4MB, usually 8-16MB data set–Most data is referedone time only (not re-visited)–Hard to resident on the on-chip memory

Memory Access Bandwidth

•Depends on application
–Sustained 1GBs (read : per 1 SPE)
–Peak 4GBs (read: per 1 SPE)
•Tends to increase after S/W optimization
–Code is optimized but access is same

Observation
•Local access
–32-128KB, 256KB enough
•Global access
–More than 8MB
–Up to 4GBs for most multimedia application
•But never underestimate
•Pure HPC benchmark requires more
–Access behavior significantly differs:
•Instruction
•Streaming data
•Linked (indirect reffered) data
•Resident table
•Hard to utilize 2-4MB L2 cache layer
•Larger L1 (LocalStore) + HightB/W memory
 
hmmm
aj.JPG
 
Hang on - what's that Cell to Associated Chip number? 35 GB/s write, 25 GB/s read? That's to RSX? If so, 60 GB/s total bandwidth?! How much reading are they intending to do?! Seems the integration between RSX and Cell may be even closer than expected, which ties in with KK and Kirk's comments.

Curioser and curioser...
 
Thank you One, i search on google a couple of days before posting so pdf don't seems to be ready at that time.
If somebody can make the whole pdf more clear he's welcome.
he seems there is the begining of answer about the gp perfomance behaviour of cell and spu?
25% of the die size is dedicated to bandwith, huge!
they speak about the latency, what up?
it seems the cell is at it's best under heavily multithread environnement (more than 8 ).
I by far too ignorant to understand the pdf, please make thing more trivial.
 
Shifty Geezer said:
Hang on - what's that Cell to Associated Chip number? 35 GB/s write, 25 GB/s read? That's to RSX? If so, 60 GB/s total bandwidth?! How much reading are they intending to do?! Seems the integration between RSX and Cell may be even closer than expected, which ties in with KK and Kirk's comments.

Curioser and curioser...

AFAIK the FlexIO link is divided into a (fixed amount of ) cache coherent and noncohertent Links. Sony used all coherent Links (15 and 20GB if I remember right) to connect to RSX and only a few noncoherent (Speed is 2GB/s) to Southbridge / IO

liolio said:
25% of the die size is dedicated to bandwith, huge!
They said they would need 4MB L2-Cache for hitting the sweetspot, but went for more Bandwith instead.
If you add the L2-Cache to that its around 1/3 (just by my optics). compare this to x86-CPUs which nowadays are 50+% L2 Cache this is getting less Dramatic.
 
Npl said:
AFAIK the FlexIO link is divided into a (fixed amount of ) cache coherent and noncohertent Links. Sony used all coherent Links (15 and 20GB if I remember right) to connect to RSX and only a few noncoherent (Speed is 2GB/s) to Southbridge / IO.
So the quoted numbers abouve (35+25) are a combination of cache coherant and non-coherant BWs? Any details on how much of each and how the FlexIO's divied up? Also FlexIO is 70 GB/s is it not, so what's the other 10 GB/s doing?
 
Shifty Geezer said:
So the quoted numbers abouve (35+25) are a combination of cache coherant and non-coherant BWs? Any details on how much of each and how the FlexIO's divied up? Also FlexIO is 70 GB/s is it not, so what's the other 10 GB/s doing?

http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318&p=11

Youre, right quoted numbers were 75GB/s. Prolly a result of different Clockspeeds 4Ghz to 3.2GHz ? (would result into 60GB/s lineary scaled)
I read about the amount of coherent/non-coherent lines months ago, dunno where and dunno details. I think it was an Interview with a Rambus-Guy.
 
Youre, right quoted numbers were 75GB/s. Prolly a result of different Clockspeeds 4Ghz to 3.2GHz ? (would result into 60GB/s lineary scaled)

knowing the numbers 75GB/s and 60GB/s ..any possibility to calculate the Clockspeed for the Ps3 :?:
 
I read about the amount of coherent/non-coherent lines months ago
But even if this is true - how much do we really NEED cache coherency with GPU read/writes. I mean sure there might be some situations where it could potentially come handy - but there's far more cases where I do not want cacheable GPU memory accessses at all.

Could this mean that non coherent part of FlexIO is disabled to improve yields? Because keeping it there sitting idle where it Could be usefull just doesn't make sense to me.
 
Code:
Benchmark Profile
•1 SPU
•Frame Buffer
–Resolution1024x1024
–ColorRGB=8:8:8bit
•Data Software Cache
–Texture Tile16x16x4RGBA (single float x 4)
–Frame Tile16x16x4 RGBA (single float x 4)
–Vertexdynamic double buffer (size varies)
•Operation
–One overlay per object
•Newton Equation
•Collision
•Translation & Lighting
•Pseudo drop shadow (no self-shadowing)
•Bilinear texturing
•Rasterization
Where are benchmark results? :devilish:
 
Fafalada said:
But even if this is true - how much do we really NEED cache coherency with GPU read/writes. I mean sure there might be some situations where it could potentially come handy - but there's far more cases where I do not want cacheable GPU memory accessses at all.

Since Sony touts the interoperability between Cell/RSX, it only makes sense to make the cache coherent. Surely there are lots of occasions where that aint needed, but I cant image why it could hurt? (Other than possibly beeing slower to noncoherent solution, but Im meaning from a Software Perspective)

Fafalada said:
Could this mean that non coherent part of FlexIO is disabled to improve yields? Because keeping it there sitting idle where it Could be usefull just doesn't make sense to me.

Southbridge is using a few noncoherent links. Giving it much more than 2GB/s for IO and Sound is pretty pointless I think. The Question is, could both noncoherent and coherent links be used for Cell<-->RSX? (My stomach says no)

Maybe Sony will surprise us and add a PPU to the remaining Links :LOL:
 
nAo said:
Code:
Benchmark Profile
•1 SPU
•Frame Buffer
–Resolution1024x1024
–ColorRGB=8:8:8bit
•Data Software Cache
–Texture Tile16x16x4RGBA (single float x 4)
–Frame Tile16x16x4 RGBA (single float x 4)
–Vertexdynamic double buffer (size varies)
•Operation
–One overlay per object
•Newton Equation
•Collision
•Translation & Lighting
•Pseudo drop shadow (no self-shadowing)
•Bilinear texturing
•Rasterization
Where are benchmark results? :devilish:

Haha, indeed, we needs some data on that :D

edit - heh, and there was a movie too. Wonder if we saw it before, and if so what it might have been? Looks like it could be a simpler demo, perhaps, especially if it was just for one SPE. Was this conference open to press? Someone's gotta have pics or more details..

edit 2 - looking over the pdf, it seems he was laying out the case for their design decisions and their reasoning. Also some of it is quite complex, or least difficult (for me, and others I guess!) to interpret. Anyone care to interpret the highlights?
 
Npl said:
AFAIK the FlexIO link is divided into a (fixed amount of ) cache coherent and noncohertent Links. Sony used all coherent Links (15 and 20GB if I remember right) to connect to RSX and only a few noncoherent (Speed is 2GB/s) to Southbridge / IO

Why do not use all BW available?

BTW, Southbridge BW is 5G (2.5/2.5)
 
Out of 60 available, 35 is for RSX, and say 5 for IO. That leaves 20 GB/s unaccounted. This coulr be reserved for interCell functionality, or just they couldn't afford the bandwidth elsewhere to use it all and it's leftover excess.
 
Also some of it is quite complex, or least difficult (for me, and others I guess!) to interpret. Anyone care to interpret the highlights?

To hard for you Titanio? Well we are doomed. You are usually the guy to interpret this kind of stuff. I really want to know what that stuff meant. :cry:
 
mckmas8808 said:
Also some of it is quite complex, or least difficult (for me, and others I guess!) to interpret. Anyone care to interpret the highlights?

To hard for you Titanio? Well we are doomed. You are usually the guy to interpret this kind of stuff. I really want to know what that stuff meant. :cry:

Lol, I'm flattered you think so, but I'm as much a learner as most here I think. Possibly doesn't help that my adobe reader is corrupting about half the pages ;)

Most of it is actually pretty understandable, I guess, a lot of it has been discussed before and some of the finer detail and technical data on latencies and memory access etc are new. The most interesting part to me was their apparent development methodology of actually writing programs and examining their behaviour and characteristics and designing the chip around this - so it seems it was quite "real world" driven. Obviously the demo is something of particular interest...nice to see one SPE juggle so many tasks at once - more detail would be nice. It sort of struck me as refreshing to see the whole thing, so to speak, running on just one SPE (apparently), given the perhaps more common approach of thinking of them in terms of specific tasks at a time, or setting SPEs aside specifically for something. Well for me at least it was a nice refresher.

I guess I'm interested to hear what others took away from it :)
 
Back
Top