Yes, the Xenon CPU and GPU communicate over a 22 GB/s bus, and that BW is shared between CPU work, GPU work, and sharing data between devices.
. Rambus is to host this year's Rambus Developers Forum Japan at July 7/8 and one of its keynote speakers is Masakazu Suzuoki, SCE Microprocessor Division, and the title is "Decision Process of CELL - Bandwidth dominates architecture". Apparently it's not too far fetched that PS3 is architected with a similar intention.
mckmas8808 said:. Rambus is to host this year's Rambus Developers Forum Japan at July 7/8 and one of its keynote speakers is Masakazu Suzuoki, SCE Microprocessor Division, and the title is "Decision Process of CELL - Bandwidth dominates architecture". Apparently it's not too far fetched that PS3 is architected with a similar intention.
It's 7/12/05, so does anyone have information about the Rambus Developers Forum in Japan?
Which will be dominant?
•Single fat core
•Single fat core with multi-threaded
•Array of simple small cores (multi-core)
Which is better? Which is better?
•It depends on application
–Data mining
–Recognition
–Synthesis
•Single fat core is still important for:
–So-called “General Purpose (GP)â€processing
–Does GP apps require high performance processor?
•Sony thinks Multi-core is good solution for multi-media applications.
•Also fit to our model of distributed computing
Approach in Cell development
•Write actual program , then evaluate
1.Extract hot-spot code fragment
2.Write specific compiler
3.Compile the host-spot fragment
4.Check result
5.Feedback to H/W design
6.Feedback to compile design
•Think through compiler, not direct ISA.
Benchmark Profile
•1 SPU
•Frame Buffer
–Resolution1024x1024
–ColorRGB=8:8:8bit
•Data Software Cache
–Texture Tile16x16x4RGBA (single float x 4)
–Frame Tile16x16x4 RGBA (single float x 4)
–Vertexdynamic double buffer (size varies)
•Operation
–One overlay per object
•Newton Equation
•Collision
•Translation & Lighting
•Pseudo drop shadow (no self-shadowing)
•Bilinear texturing
•Rasterization
Local Access Profile
•Depends on data set / application
•Depends on program structure
–Software cachie/double buffering strategy
–Program overlay structure
•Reasolablehot spot is 64-128KB
Memory Access ProfileMemory Access Profile
•Highly depends on application and data set
•Tends to spread to wide address space–More than 4MB, usually 8-16MB data set–Most data is referedone time only (not re-visited)–Hard to resident on the on-chip memory
Memory Access Bandwidth
•Depends on application
–Sustained 1GBs (read : per 1 SPE)
–Peak 4GBs (read: per 1 SPE)
•Tends to increase after S/W optimization
–Code is optimized but access is same
Observation
•Local access
–32-128KB, 256KB enough
•Global access
–More than 8MB
–Up to 4GBs for most multimedia application
•But never underestimate
•Pure HPC benchmark requires more
–Access behavior significantly differs:
•Instruction
•Streaming data
•Linked (indirect reffered) data
•Resident table
•Hard to utilize 2-4MB L2 cache layer
•Larger L1 (LocalStore) + HightB/W memory
Shifty Geezer said:Hang on - what's that Cell to Associated Chip number? 35 GB/s write, 25 GB/s read? That's to RSX? If so, 60 GB/s total bandwidth?! How much reading are they intending to do?! Seems the integration between RSX and Cell may be even closer than expected, which ties in with KK and Kirk's comments.
Curioser and curioser...
They said they would need 4MB L2-Cache for hitting the sweetspot, but went for more Bandwith instead.liolio said:25% of the die size is dedicated to bandwith, huge!
So the quoted numbers abouve (35+25) are a combination of cache coherant and non-coherant BWs? Any details on how much of each and how the FlexIO's divied up? Also FlexIO is 70 GB/s is it not, so what's the other 10 GB/s doing?Npl said:AFAIK the FlexIO link is divided into a (fixed amount of ) cache coherent and noncohertent Links. Sony used all coherent Links (15 and 20GB if I remember right) to connect to RSX and only a few noncoherent (Speed is 2GB/s) to Southbridge / IO.
Shifty Geezer said:So the quoted numbers abouve (35+25) are a combination of cache coherant and non-coherant BWs? Any details on how much of each and how the FlexIO's divied up? Also FlexIO is 70 GB/s is it not, so what's the other 10 GB/s doing?
Youre, right quoted numbers were 75GB/s. Prolly a result of different Clockspeeds 4Ghz to 3.2GHz ? (would result into 60GB/s lineary scaled)
But even if this is true - how much do we really NEED cache coherency with GPU read/writes. I mean sure there might be some situations where it could potentially come handy - but there's far more cases where I do not want cacheable GPU memory accessses at all.I read about the amount of coherent/non-coherent lines months ago
Benchmark Profile
•1 SPU
•Frame Buffer
–Resolution1024x1024
–ColorRGB=8:8:8bit
•Data Software Cache
–Texture Tile16x16x4RGBA (single float x 4)
–Frame Tile16x16x4 RGBA (single float x 4)
–Vertexdynamic double buffer (size varies)
•Operation
–One overlay per object
•Newton Equation
•Collision
•Translation & Lighting
•Pseudo drop shadow (no self-shadowing)
•Bilinear texturing
•Rasterization
Fafalada said:But even if this is true - how much do we really NEED cache coherency with GPU read/writes. I mean sure there might be some situations where it could potentially come handy - but there's far more cases where I do not want cacheable GPU memory accessses at all.
Fafalada said:Could this mean that non coherent part of FlexIO is disabled to improve yields? Because keeping it there sitting idle where it Could be usefull just doesn't make sense to me.
nAo said:Where are benchmark results?Code:Benchmark Profile •1 SPU •Frame Buffer –Resolution1024x1024 –ColorRGB=8:8:8bit •Data Software Cache –Texture Tile16x16x4RGBA (single float x 4) –Frame Tile16x16x4 RGBA (single float x 4) –Vertexdynamic double buffer (size varies) •Operation –One overlay per object •Newton Equation •Collision •Translation & Lighting •Pseudo drop shadow (no self-shadowing) •Bilinear texturing •Rasterization
Npl said:AFAIK the FlexIO link is divided into a (fixed amount of ) cache coherent and noncohertent Links. Sony used all coherent Links (15 and 20GB if I remember right) to connect to RSX and only a few noncoherent (Speed is 2GB/s) to Southbridge / IO
Also some of it is quite complex, or least difficult (for me, and others I guess!) to interpret. Anyone care to interpret the highlights?
mckmas8808 said:Also some of it is quite complex, or least difficult (for me, and others I guess!) to interpret. Anyone care to interpret the highlights?
To hard for you Titanio? Well we are doomed. You are usually the guy to interpret this kind of stuff. I really want to know what that stuff meant.