Panajev2001a
Veteran
The Cell CPU would still benefit from very fast interconnects with the other chips in the system like Redwood and a fast external RAM... it is true that we have a good amount of e-DRAM on the CPU, but we do not want to be bottlenecked too much by all the other system perypherals... and those peripherals needs to receive the results of Cell calculation fast...
Just because Cell has e-DRAM on the CPU doesn't mean that the rest of the system is not going to need it... and we also need a fast bus to initially fill that chunk of e-DRAM and keep it full...
Cell's internal bandiwth and processing capabilities is well over 50 GB/s so we have not forced the external bus system to follow Cell's high speed ( 1,024 bits bus )...
Latency considerations for the external RAM should also be relaxed ( as it is the bandwidth concern... )... we are not feeding a 1-2 MB cache on the CPU, we are feeding 64 MB of e-DRAM ( which can be optimized during compile time, it is part of the ISA ) and most importantly the LS's of the APUs from which execution takes place...
Higher external latency allows us to ramp up the clock speed of the memory bus and since the bus from memory to Cell CPU would not be THAT long and PS3 is a custom design ( motherboard inclused ) there is space for a moderatly wide bus
Yellowstone can transmit 1 byte/clock ( ODR ): the real speed is 400 MHz...
You route only 400 MHz on the PCB, the PLL multiplication occurs on the memory chip... the chip communicate with each other with the 400 MHz signal...
Yellowstone can be used to connect a processor ( in the next diagram it is used to connect the DRAM to the GPU... ) to external memory...
As you can also read at RAMBUS' website, Redwood is a parallel bus technology that connects different chips together ( like CPU with Northbridge, Northbridge with Soutbridge, etc... ) while, again, Yellowstone is a memory interface and connects the memory with the CPU/Graphics processor...
Cell could pack in a Yellowstone memory controller to talk with external RAM and could be connected with Redstone to other chips...
We could also have a Northbridge that connects ( in a PS3 design, tell me what you think ) to external RAM using Yellowstone and then has two busses: one to the Broadband Engine and one to the Visualizer.
I have a better idea ( hopefully ):
We could connect Broadband Engine and the Visualizer together with a Redwood bus ( Cell patent: bus 608 ) and have the I/O ASIC pack the Yellowstone memory interface ( makes sense... both Visualizer and Broadband Engine have e-DRAM to work on locally ) which would connect the I/O Asic to the external memory...
This sort of makes sense ( also Sony licensed both Yellowstone and Redwood )
The 400 MHz clock gets multiplied by a PLL reaching a total of 1.6 GHz and we operate at DDR on that clock ( both edges of the clock see a transfer )...
we basically have 3.2 Gbits/(s*pin)... or 400 MB/(s*pin)
50 GB/s = ~50,000 MB/s
hence, we need a 128 bits data bus... if we could raise the base frequency to 800 MHz ( could be more expensive than just using more traces for the data bus ) then we would reach 800 MB/(s*pin) thus we would need only a 64 bits bus...
If 25 GB/s is good enough for us we can deal with 400 MHz base clock and
64 bits data bus...
We will not have many memory modules and I do not expect the distance from memory module to memory controller to be really long, it is not a PC motherboard: a 128 bits bus running at 400 MHz base clock ( x4 thanks to the PLL and then add-in DDR transfers )...
We can use the 400 MHz base clock and transfer 4 phase shifted clock signals that on the chip itself thanks to the PLL would be sort of "packed/fused" in a single 4x faster clock...
We encode in a 400 MHz signal more data and the way we retrieve it is to use the PLL multiply it by 4x and sample it on both edges of the clock achieving an effective frequency of 3.2 GHz ( I hope I am not too tired and that I am not talking jibberish )... I think we can encode in the 400 MHz the signal we need...
It says it clearly, only on the chip we use the PLL for "clock multiplication", chip to chip the clock signal routed is 400 MHz...
Just because Cell has e-DRAM on the CPU doesn't mean that the rest of the system is not going to need it... and we also need a fast bus to initially fill that chunk of e-DRAM and keep it full...
Cell's internal bandiwth and processing capabilities is well over 50 GB/s so we have not forced the external bus system to follow Cell's high speed ( 1,024 bits bus )...
Latency considerations for the external RAM should also be relaxed ( as it is the bandwidth concern... )... we are not feeding a 1-2 MB cache on the CPU, we are feeding 64 MB of e-DRAM ( which can be optimized during compile time, it is part of the ISA ) and most importantly the LS's of the APUs from which execution takes place...
Higher external latency allows us to ramp up the clock speed of the memory bus and since the bus from memory to Cell CPU would not be THAT long and PS3 is a custom design ( motherboard inclused ) there is space for a moderatly wide bus
Yellowstone can transmit 1 byte/clock ( ODR ): the real speed is 400 MHz...
You route only 400 MHz on the PCB, the PLL multiplication occurs on the memory chip... the chip communicate with each other with the 400 MHz signal...
Yellowstone can be used to connect a processor ( in the next diagram it is used to connect the DRAM to the GPU... ) to external memory...
As you can also read at RAMBUS' website, Redwood is a parallel bus technology that connects different chips together ( like CPU with Northbridge, Northbridge with Soutbridge, etc... ) while, again, Yellowstone is a memory interface and connects the memory with the CPU/Graphics processor...
Cell could pack in a Yellowstone memory controller to talk with external RAM and could be connected with Redstone to other chips...
We could also have a Northbridge that connects ( in a PS3 design, tell me what you think ) to external RAM using Yellowstone and then has two busses: one to the Broadband Engine and one to the Visualizer.
I have a better idea ( hopefully ):
We could connect Broadband Engine and the Visualizer together with a Redwood bus ( Cell patent: bus 608 ) and have the I/O ASIC pack the Yellowstone memory interface ( makes sense... both Visualizer and Broadband Engine have e-DRAM to work on locally ) which would connect the I/O Asic to the external memory...
This sort of makes sense ( also Sony licensed both Yellowstone and Redwood )
The 400 MHz clock gets multiplied by a PLL reaching a total of 1.6 GHz and we operate at DDR on that clock ( both edges of the clock see a transfer )...
we basically have 3.2 Gbits/(s*pin)... or 400 MB/(s*pin)
50 GB/s = ~50,000 MB/s
hence, we need a 128 bits data bus... if we could raise the base frequency to 800 MHz ( could be more expensive than just using more traces for the data bus ) then we would reach 800 MB/(s*pin) thus we would need only a 64 bits bus...
If 25 GB/s is good enough for us we can deal with 400 MHz base clock and
64 bits data bus...
We will not have many memory modules and I do not expect the distance from memory module to memory controller to be really long, it is not a PC motherboard: a 128 bits bus running at 400 MHz base clock ( x4 thanks to the PLL and then add-in DDR transfers )...
We can use the 400 MHz base clock and transfer 4 phase shifted clock signals that on the chip itself thanks to the PLL would be sort of "packed/fused" in a single 4x faster clock...
We encode in a 400 MHz signal more data and the way we retrieve it is to use the PLL multiply it by 4x and sample it on both edges of the clock achieving an effective frequency of 3.2 GHz ( I hope I am not too tired and that I am not talking jibberish )... I think we can encode in the 400 MHz the signal we need...
Yellowstone operates at Octal Data Rates (ODR), transferring 8 bits of data per clock. ODR enables 3.2GHz data rates with a 400MHz clock and provides a scalable path to over 6.4GHz as bandwidth needs increase.
The lower speed 400MHz system clock is routed on the PCB between chips. Oh-chip, the 400MHz clock is multiplied -- up to 1.6GHz with a PLL. This effective 1.6GHz clock is subsequently used to transmit and receive data on both clock edges, resulting in 3.2GHz data rates. The 1:8 relationship between clock and data rates results in Octal Data Rate (ODR) operation.
It says it clearly, only on the chip we use the PLL for "clock multiplication", chip to chip the clock signal routed is 400 MHz...