Okay, as promised here goes...
Above is the PS2s EE block diagram showing the major components
Above is an image of the combined cores of the PS2s EE + GS that is manufactured for the PSX. It is manufactured with a hybrid 90/ 130 nm process and the die size is 86 mm^2. The image is placed on a grid so that we could estimate unit areas for different components.
Okay, I hope to show that the BroadbandEngine as shown in prior posts is feasible using the PSX core. The main components of the BE are;
32 APUs, 4 PUs, 4 DMACs, 64 MB eDRAM and L1, L2 and possibly L3 cache for the PUs.
Before I start, I'm going to mention some assumptions. The PS2s CPU, the EE was introduced at 250 nm, and a die area of 240 mm^2. The PSs graphics synthesizer (GS) was introduced at 250 nm and a die area of 279 mm^2. Shortly after PS2 release Sony went to 180 nm. PS3 will likely debut at 65 nm and will follow shortly afterwards to 45 nm. It is likely that they will introduce the CPU and GPU of PS3 at around 200-300 mm^2 die size again. I will work with a die size that 300 mm^2 would be the absolute upper limit and will be the basis to show that the BE is feasible at 65 nm.
Also note that by taking these unit areas from the PSX core, were also inheriting the areas of datapaths etc. which would scale with our calculations.
Also the PSX core is a hybrid 90/130 nm process. It is not fully 90 nm.
Source : EEtimes . I'll assume it was at 90 nm (a pessimistic assumption for my calculations, as will be revealed later...)
I'll start with the 32 APUs
I'm going to use the VU1 as a guide to show the feasability of the number of APUs in the BroadBand Engine.
Above shows the basic components of the APU, the registers are 128*128 bit, the local memory is 128 KB of SRAM, 4 FMACs and 4 IUs.
Above, VU1 core in the EE, there are 32*128 bit registers, 32 KB of local memory, 5 FMACs, 2 FDIVs and other units etc.
The APU and the VU1 are comparable in terms of the number of execution units (VU1 has more) but the APUs have larger registers and local memories. However, this IBM patent :
Processor implementation having unified scalar and SIMD datapath describes the APUs with shared datapaths for space saving and power saving features. I'll base the assuptionn that the APUs are 1.5 times larger than the VU1.
1.5 VU1 ~ APU in terms of die area.
The PSX die is 82*46 units in the diagram ~ 3772 square unit area
The VU1 core is 11*27 units in the diagram ~ 297 square unit area
APU = 1.5 * 297 ~ 446 square unit are
APU as a % os PSX core= 446/3772 *100 ~ 11.81 % of PSX core.
PSX core = 86 mm^2
APU area = 11.81/100 * 86 ~ 10.16 mm^2 and remember this would at 90 nm process.
The area gained by dropping to 65 nm = (90/65)^2 ~ 1.92 more area available, assuming tools scale accordingly.
Therefore, the equivalent area for APU at 65 nm = 10.16/1.92 ~ 5.3 mm^2.
We need 32 APUs = 5.3 * 32 ~ 170 mm^2 (remember we have 300 mm^2 available
Lets move on to 64 MB of eDRAM
Looking at the diagram, each yellow core on the GS side of the PSX represents 1 MB,
1 MB eDRAM = 15*10~ 150 square units.
4 MB eDRAM = 150*4 ~ 600 square units
4MB eDRAM as % of PSX core = 600/3772 *100 ~ 15.91 % of PSX core
PSX core 86 mm^2
4 MB eDRAM = 15.91/100 * 86 ~ 13.68 mm^2 at 90 nm process
The area gained by dropping to 65 nm = (90/65)^2 ~ 1.92 more area available, assuming tools scale accordingly.
Area of 4MB eDRAM at 65 nm = 13.68/1.92 ~ 7.12 mm^2
We need 64 MB eDRAM = 64/4 * 7.12 ~ 114 mm^2 at 65 nm
Area at 65 nm of 32 APUs and 64 MB eDRAM = 170 + 114 = 284 mm^2 ( out of 300 mm^2)
Lets move onto PUs with L1 and L2 cache
The PU will really be a glorified core to schedule Apulets (software cells) to the APUs. And such, doesn't need to be anything fancy, IMO.
If we assume a PU to have 32KB L1 Cache and 128 KB L2 cache, and probably will need less execution unts than an APU, we can approximate a PU area equivalent to an APU, if not less.
PU at 65 nm = APU = 5.3 mm^2
We need 4 PUs, 5.3 * 4 ~ 21 mm^2 at 65 nm
Area of 32 APUs and 64MB eDRAM and 4 PUs = 170 + 114 + 21 ~ 305 mm^2 ( we had 300 mm^2 available, but I'll explain shortly...)
Lets move onto the DMACs and L3 cache
Lets recap,
We've used 305 mm^2 out of an available 300 mm^2. We've scaled down from 90 nm to 65 nm. We've accounted for,
32 APUs, 4 PUs and 64 MB eDRAM.
We need to add the DMACs and if we're lucky L3 cache for the PUs.
Comparing the PS2 block diagram and the PSX diaggram, the DMA functional units take up less space than the VU1 unit. But lets add extra complexity for the BEs DMACs and say they are the equivalent sizes to an APU.
DMAC at 65nm = APU = 5.3 mm^2
We need 4 DMACs = 5.3*4 ~ 21 mm^2
We now have used = 305 + 21 ~ 326 mm^2
And finally shall we add some L3 cache for the PUs for some good luck!
Well, L3 cache will be more complex than eDRAM, so lets assume L3 will be 2 times the area of equivalent eDRAM.
4 MB eDRAM at 65 nm = 7.12 mm^2
4 MB L3 cache at 65 nm = 7.12 * 2 ~ 14 mm^2, we will distribute the 4 MB between the 4 PUs, so each PU has 1 MB L3 cache.
We now have used = 326 + 14 ~ 340 mm^2 at 65 nm and our goal was 300 mm^2. We used a process drop from 90 nm to 65 nm
Well can you get dies as large as 340 mm^2 ? Nows a good time to show a 389 mm^2 die from IBM, the Power5 core below at 130 nm!
Source
Okay, the BE at 340 mm2 at 65 nm. Recall we dropped from 90 to 65 nm using the PSX core as reference. Also recall we assumed a hybrid 90/130 nm process for PSX. I'll calculate the two extremes.
From above, a drop from 90 to 65 nm gave us = (90/65)^2~ 1.92 area increase which we've included in our calculations.
If we drop from 130 to 90 nm = (130/90)^2 ~ 2.09 area increase and if we factor that into our calculations,
BE at 65nm assuming PSX die at 130nm and 86 mm^2 area = 340/2.09 ~ 163 mm^2
BE at 65nm assuming PSX die at 90nm and 86 mm^2 area = 340 mm^2
So the BE would range from 163 mm2 to 340 mm2 using a hybrid PSX core at 90/130 nm as reference. Assuming they use full 65nm process,
The average PS3 BE at 65nm = (340+ 163)/2 ~ 251 mm2
The PS2 EE at 250nm = 240 mm2
The BE includes;
32 APUs with a total of 4MB SRAM local storage,
4 PUs with a total of 128 KB L1 cache, 512 KB of L2 cache and 4 MB of L3 cache,
and 64MB of eDRAM
on a die size of 251 mm2 at 65 nm !!!
and that's not including the rumours to use capacitorless eDRAM to save even more space!
QED
* Runs away into the sunset...*