Real PS3 Architecture

vers

Newcomer
pic1883138.jpg


its true...
 
made up...

The Broadband Engine with no direct access to the e-DRAM will not achieve nowhere near 1 TFLOPS. not even in your wildest dreams...

That diagram limits the Broadband Engine to 102 GB/s memory bandwidth to the e-DRAM...

The 1,024 bits PE bus would be then clocked ( what sense going at 2 the speed the Redwood bus provides to the e-DRAM ) at ~800 MHz which is 1/5th of the BE's clock speed of 4 GHz...

The latency to the e-DRAM would not be pretty... not at all...

BE.PNG


quite a different design from this one... poor Broadband Engine left without e-DRAM :(


Sure the author of that schematic took some ideas from the news of Redwood and Yellowstone and added it to other ideas and tadaah, PS3 "real" schematic :)
 
That pic has some basis from the patent.

There is similar pic on the patent, and note similar.
 
this seem a mix of PS2 architecture ( GIF 2 ? only e-DRAM on the Visualizer ? ) + Cell...

...

and btw, one of several mistakes, single CRTC and no Image Cache in the Visualizer PE's ;)
 
V3... that processor has SOME ideas from the patent... the rest is PS2 with components replaced including RDRAM, GIF, I/O ASIC, etc...
 
hmm, looking at that scheme, wasn't Local Scratchpad (i assume that's what LS stands for) supposed to be shared among all APUs of a single pool (a column on this pic)?
 
another thing, that 1.3 billion polys sec is really pathetic, especially since that would mean flat shaded, theoretical polys. this PS3 is not even 100x
PS2 performance in that area :)
 
Apart from the performance of CPU and the amount of memory that is about what I always imagined the PS3 would look like (I expect about half the memory, both external and for graphics, and quarter of peak performance for the CPU).

Oh and since the R&D for Redwood and Yellowstone is far from finished it is a bit optimistic to put their clocks at the top end of what Rambus expects will be possible.
 
You can change the color of the busses of YOUR drawing my friend...

So the PEs of the BE are connected with 1,024 bits busses ( 1 for each PE ) to the memory manager and then the bus that connects to the DRAM ( which the APU needs reference to... ) with a 128 bits bus ( 102+ GB/s ) ?

1,024 bits / 8bits/byte = 128 bytes

Only clocking this bus at 800 MHz ( 400 DDR... Prescott has an external FSB reaching 800 MHz in QDR mode :( and with an on-chip bus we rely on a 800 MHz one when all the logic is running 5x faster ? That is calling for a bottleneck... the L2 bus on the Pentium 4 is a 256 bits bus running at 3 GHz delivers 96 GB/s we are talking about a processor with much more enphasis on massive parallel computation [execution units running at 4 GHz] with 100+ GB/s of bandwidth ? remember Cell is cache-less and there is heavvy enphasys on the use of e-DRAM because it is the only way to circumvent the external memory bottleneck )

This set-up would place the BE back to the mercy of the external/off-chip memory and it would slow down APU to APU communication ( the Cell paradigm is based heavvily on message passing with apulets/software cells made of both program and data... things need to travel around fast and efficiently )...

The GS would end up using most of that e-DRAM for its own needs ( between frame-buffers, Z-buffer, texture storage, etc... very few space would be left for data the Broadband Engine would need ) and each APU only has 128 KB of Local Storage... that is not that much if we have to rely on slow external memory ( embedding DRAM on the CPU was one of the keys to both Blue Gene cellular computing and the version of Cell presented in that patent )...

The BE would have twice the execution units of the Visualizer and we leave it without e-DRAM ?

The Pentium 4 3.06 GHz burns more data than 96+ GB/s and we expect the BE to be happy with 102 GB/s ?

Do not get me wrong, that is fast for external memory but judging the computing caliber of Cell, that is nowhere near enough... Try to take P4's cache off and give it a 50 GB/s main memory... compared to your PC2700 that is a GREAT jump... incredible... but do not expect the performance of the Pentium 4 to be that great...

In your diagram, even if we clocked the bus at 1-2 GHz we would only make of the Redwood bus ( and the Yellowstone memory interface ) more of a bottleneck... as we would have much less bandwidth than what we need and the BE would stall like crazy...

Besides the external memory controller ( for Yellowstone ) was supposed to be part of the I/O ASIC, not the BE... the BE is supposed to have memory controllers for the customized e-DRAM as that is the source of data for the LS's...

According to the patent the memory hierarchy was:

Registers

Local Storage

e-DRAM

External Memory

Optical Disc

( faster to slower )...

Cutting the e-DRAM step for the BE ( it would not be feasible for the BE to share as well as it needs to the Visualizer's e-DRAM ) would be the stupidest thing to do to cut on manufacturing costs as it would kill the performance of the processor in many ways...

Inter APU communication would be troublesome with the 1,024 bits bus clocked at 800 MHz, unless it uses multiple Data Rate techniques ( like Redwood does ) to boast a higher frequecny than that...

APUS work at 4 GHz... cycle time is then 0.25 ns

800 MHz means a cycle time of 1.25 ns... 1 APU cycle is equivalent to 5 bus ticks...

and in 1 clock we transmit 128 bytes... so to transmit a let's say 5 KB of data from one APU to the other ( same PE ) we would take 40 bus cycles...
and a whole 128 KB packet from memory would take 1,024 bus cycles ( 64 KB would take 512 bus cycles... )...

The latency of a 64 KB apulet as seen by the APU would be then 2,560 cycles... 5,120 cycles would be for 128 KB... and remember it is 1,024 bits wide, but it is still a bus and when it is used by one device, all the other can play with their thumbs while waiting for a memory transfer...

We are talking about feeding 16 APUs ( each with 4 FP Units and 4 Integer Units ) and you rely on external memory ?

Each PE bus could be running at 1-2 GHz... yelding 128-256 GB/s and we are feeding 4 parallel PEs... each of them would, in an ideal work, like those 128-256 GB/s... but realistically they would get from main memory less than 1/4th ( even in your current drawing that 102.4 GB/s becomes 25 GB/s for each PE )...

aah... I am going sooooo ballistic with this that I am almost not thinking straight anymore...

A comment about Yellowstone...

You present a dual channel solution with 32 bits each for data...

and you say 102.4 GB/s...

we know Yellowstone uses on chip a clock that is 4x the off-chip clock and we sample the on-chip clock on both edges... achieving ODR... for each pin we can transmit 2 bits * ( base clock x 4 )...

Let's think about the dual channels as a single 64 bits bus...

Each clock we can transmit then 128 bits per "fast clock cycle"...

our fast clock cycle must be then ( to achieve ~102.4 GB/s ) 12.8 GHz ( considering DDR data transmission )...

12.8 GHz * 8 bytes = 102.4 GB/s ( using both channels at the same time in DDR mode )

The external clock should be then 1.60 GHz... that is quite a high clock frequency for an off-chip bus ( Pentium 4 Prescott's FSB is 200 MHz base clock brought to 800 MHz using QDR data transfer... you are asking an off-chip bus in PS3 to be clocked at 10x the base clock of the Pentium 4 's FSB [no use of QDR/ODR/DDR that was already taken into account when the signal is on the chip as you can read in the Yellowstone documents )...



Broadband Engine ( with all the APUs it has ) with no e-DRAM == not so exciting performance...

and that 1.60 GHz external Yellowstone bus ( 13 GHz on the memory chip and the memory manager ) is going to cost you... in the long run more than having e-DRAM on the Broadband Engine...

a 25-50 GB/s Yellowstone Interface would be more than enough if Cell had its own e-DRAM with the fat on-chip 1,024 bits bus...

Assuming only 64 bits with two channels that could be done with a 400-800 MHz external clock...
 
Marco... that diagram assumes 1.6 GHz clock for Yellowstone ( 12.8 ODR ) and a 64 bits wide bus...

And no e-DRAM on the Broadband engine doesn't sond the best idea IMHO...
 
Take what panajev says, and fix your picture, then post it on another forum to get folks excited :)

Also change the bus from 128 to 256 bit...128 is soo 2001.

Change the names from GiF2 to something better sounding.

Speng.
 
The rate Rambus specifies for Yellowstone/Redwood is for the signalling, not the clock ... 12.8 GHz isnt even on the map, that is fairy tale country for that timeframe (Yellowstone has 2 seperate databusses for each connection, so I was assuming it was 4 times 32 with a 6.4 GHz signalling speed ... which is the upper end I was talking about).

Speng, Yellowstone and Redwood use differential signalling ... a 128 bit bus using them uses the same amount of pins as a traditional 256 bit bus, and given that the single chips needs both the pins for the yellowstone and the redwood connection the chip has the equivalent amount of pins as what a 512 bit bus would need on a graphics chip today. Which seems on the optimistic side of realistic, so I guess still not enough for the likes of yall ;)
 
According to the patent the memory hierarchy was:

Registers

Local Storage

e-DRAM

External Memory

Optical Disc

( faster to slower )...

In the patent which bit explain about the optical disc. I must have missed it.
Can you point it out to me, please :?:
 
V3... I imagine that the patent does not mention controllers in specific... should we assume it does not use 'em ? ;)

Optical Disk/HDD practically similar speed ( I was assuming no big HDD between RAM and Blu-Ray. but you could put one, maybe a tiny one acting as a buffer/temporary storage )...

You could insert the HDD in between main memory and optical disk ( Blu_ray ) if you think PS3 should have one...

The Optical disc was not clearly specified in the patent, I thought about the technology in the context of PS3 which will have an optical disc...

Mfa,

Yellowstone operates at Octal Data Rates (ODR), transferring 8 bits of data per clock. ODR enables 3.2GHz data rates with a 400MHz clock and provides a scalable path to over 6.4GHz as bandwidth needs increase.

The lower speed 400MHz system clock is routed on the PCB between chips. Oh-chip, the 400MHz clock is multiplied -- up to 1.6GHz with a PLL. This effective 1.6GHz clock is subsequently used to transmit and receive data on both clock edges, resulting in 3.2GHz data rates. The 1:8 relationship between clock and data rates results in Octal Data Rate (ODR) operation.

6.4 GHz signaling assumes a 800 MHz clock not 800 MHz signaling ( at least... it appears that the PLL can multiply the clock 4x... if you could multiply with the PLL up to 8x the base clock you could get away with a 400 MHz base clock )...

800 MHz * 4 ( PLL ) * 2 ( DDR ) = 6.4 GHz

6.4 GHz * 16 bytes = 102.4 GB/s

With e-DRAM 51.2 GB/s is acceptable and we could either use a 4x32 for combined data "bus" width with 3.2 GHz signaling ( 400 MHz base clock ) or 2x32 data lines with 6.4 GHz signaling ( 800 MHz base clock )... the best one price-wise would win...

His first design assumed to use 12.8 GHz signaling as he was achieving 102.4 GB/s with 2x32 data lines which meant a 1.6 GHz base clock ( not signaling rate, system clock... )...

1.6 GHz * 4 ( PLL ) * 2 = 12.8 GHz signaling rate...
 
ok you diagram is getting better :)

still you have 256 + 256 = 512 MB, instead in your schematic you have 1,024 MB of RDRAM with two 256 MB modules... 256-512 MB of RDRAM are ok ( 256 MB would be IMHO )...

Which seems on the optimistic side of realistic, so I guess still not enough for the likes of yall ;)

no, a 128 bits bus is "enough" for me... heck I'd be happy with a 64 bits bus and 6.4 GHz signaling or a 64 bits bus and 6.4 GHz signaling... ( 50 GB/s from e-DRAM to external RAM are not as good as 100 GB, but this is still good and after all it would be >12x the speed of a 4.2 GB/s PC1066 RDRAM... )...







another note on the diagram...

the e-DRAM will be probably clocked at 1/2 or 1/4 of the CPU clock ( or use comparable signaling... of course nobody is exclusing the e-DRAM running with QDR/ODR signaling [it would be useful since clocking e-DRAM high is not easy] )...

128 bytes * 1.8-0.9 GHz = 230.4-115.2 GB/s...
 
Back
Top