You can change the color of the busses of YOUR drawing my friend...
So the PEs of the BE are connected with 1,024 bits busses ( 1 for each PE ) to the memory manager and then the bus that connects to the DRAM ( which the APU needs reference to... ) with a 128 bits bus ( 102+ GB/s ) ?
1,024 bits / 8bits/byte = 128 bytes
Only clocking this bus at 800 MHz ( 400 DDR... Prescott has an external FSB reaching 800 MHz in QDR mode
and with an on-chip bus we rely on a 800 MHz one when all the logic is running 5x faster ? That is calling for a bottleneck... the L2 bus on the Pentium 4 is a 256 bits bus running at 3 GHz delivers 96 GB/s we are talking about a processor with much more enphasis on massive parallel computation [execution units running at 4 GHz] with 100+ GB/s of bandwidth ? remember Cell is cache-less and there is heavvy enphasys on the use of e-DRAM because it is the only way to circumvent the external memory bottleneck )
This set-up would place the BE back to the mercy of the external/off-chip memory and it would slow down APU to APU communication ( the Cell paradigm is based heavvily on message passing with apulets/software cells made of both program and data... things need to travel around fast and efficiently )...
The GS would end up using most of that e-DRAM for its own needs ( between frame-buffers, Z-buffer, texture storage, etc... very few space would be left for data the Broadband Engine would need ) and each APU only has 128 KB of Local Storage... that is not that much if we have to rely on slow external memory ( embedding DRAM on the CPU was one of the keys to both Blue Gene cellular computing and the version of Cell presented in that patent )...
The BE would have twice the execution units of the Visualizer and we leave it without e-DRAM ?
The Pentium 4 3.06 GHz burns more data than 96+ GB/s and we expect the BE to be happy with 102 GB/s ?
Do not get me wrong, that is fast for external memory but judging the computing caliber of Cell, that is nowhere near enough... Try to take P4's cache off and give it a 50 GB/s main memory... compared to your PC2700 that is a GREAT jump... incredible... but do not expect the performance of the Pentium 4 to be that great...
In your diagram, even if we clocked the bus at 1-2 GHz we would only make of the Redwood bus ( and the Yellowstone memory interface ) more of a bottleneck... as we would have much less bandwidth than what we need and the BE would stall like crazy...
Besides the external memory controller ( for Yellowstone ) was supposed to be part of the I/O ASIC, not the BE... the BE is supposed to have memory controllers for the customized e-DRAM as that is the source of data for the LS's...
According to the patent the memory hierarchy was:
Registers
Local Storage
e-DRAM
External Memory
Optical Disc
( faster to slower )...
Cutting the e-DRAM step for the BE ( it would not be feasible for the BE to share as well as it needs to the Visualizer's e-DRAM ) would be the stupidest thing to do to cut on manufacturing costs as it would kill the performance of the processor in many ways...
Inter APU communication would be troublesome with the 1,024 bits bus clocked at 800 MHz, unless it uses multiple Data Rate techniques ( like Redwood does ) to boast a higher frequecny than that...
APUS work at 4 GHz... cycle time is then 0.25 ns
800 MHz means a cycle time of 1.25 ns... 1 APU cycle is equivalent to 5 bus ticks...
and in 1 clock we transmit 128 bytes... so to transmit a let's say 5 KB of data from one APU to the other ( same PE ) we would take 40 bus cycles...
and a whole 128 KB packet from memory would take 1,024 bus cycles ( 64 KB would take 512 bus cycles... )...
The latency of a 64 KB apulet as seen by the APU would be then 2,560 cycles... 5,120 cycles would be for 128 KB... and remember it is 1,024 bits wide, but it is still a bus and when it is used by one device, all the other can play with their thumbs while waiting for a memory transfer...
We are talking about feeding 16 APUs ( each with 4 FP Units and 4 Integer Units ) and you rely on external memory ?
Each PE bus could be running at 1-2 GHz... yelding 128-256 GB/s and we are feeding 4 parallel PEs... each of them would, in an ideal work, like those 128-256 GB/s... but realistically they would get from main memory less than 1/4th ( even in your current drawing that 102.4 GB/s becomes 25 GB/s for each PE )...
aah... I am going sooooo ballistic with this that I am almost not thinking straight anymore...
A comment about Yellowstone...
You present a dual channel solution with 32 bits each for data...
and you say 102.4 GB/s...
we know Yellowstone uses on chip a clock that is 4x the off-chip clock and we sample the on-chip clock on both edges... achieving ODR... for each pin we can transmit 2 bits * ( base clock x 4 )...
Let's think about the dual channels as a single 64 bits bus...
Each clock we can transmit then 128 bits per "fast clock cycle"...
our fast clock cycle must be then ( to achieve ~102.4 GB/s ) 12.8 GHz ( considering DDR data transmission )...
12.8 GHz * 8 bytes = 102.4 GB/s ( using both channels at the same time in DDR mode )
The external clock should be then 1.60 GHz... that is quite a high clock frequency for an off-chip bus ( Pentium 4 Prescott's FSB is 200 MHz base clock brought to 800 MHz using QDR data transfer... you are asking an off-chip bus in PS3 to be clocked at 10x the base clock of the Pentium 4 's FSB [no use of QDR/ODR/DDR that was already taken into account when the signal is on the chip as you can read in the Yellowstone documents )...
Broadband Engine ( with all the APUs it has ) with no e-DRAM == not so exciting performance...
and that 1.60 GHz external Yellowstone bus ( 13 GHz on the memory chip and the memory manager ) is going to cost you... in the long run more than having e-DRAM on the Broadband Engine...
a 25-50 GB/s Yellowstone Interface would be more than enough if Cell had its own e-DRAM with the fat on-chip 1,024 bits bus...
Assuming only 64 bits with two channels that could be done with a 400-800 MHz external clock...