CELL Patents (J Kahle): APU, PU, DMAC, Cache interactions?

Fafalada said:
gubbi said:
If they really apply to CELL, I can see snooping used in small scale CELL systems and directories used in large scale systems.
It could sort of make sense though couldn't it? PE only needs to scale up to 8 APUs, while external communication is expected to scale much higher.

Yeah that definitely sounds right, optimise L2 for the verical APUs and optimise L3 for the horizontal PUs. :p

Fafalada said:
Jaws, don't you think you're going a bit overboard with all the cache? If there'll be THAT much eDram don't expect large or lots of caches. I somehow doubt we'll see the massive eDram pool though.

The second patent mentions the directory as L3 cache. You're lucky I didn't add L4 cache! :D Anyway I didn't supply figures for L1, L2, L3 caches, they may not have to be that large.

The way I look at it , but correct me if I'm warong, is that L2 is feeding the APUs LS, all 8 APUs, so it's gonna run dry, so L3 is there to top it up.

Also as the Apulets contain both data/ program and instructions, the way I see it is that local variables are stored amonst the LS of the APUs in a PE and L2 cache of the PU. And global variables are stored amongst PEs in the L3 caches of the PUs. So local type Apulets get scheduled vertically along a PE and global type Apulets get scheduled horizontally across PEs, wouldn't that hide latencies with the above patents? Or is it the other way round! :?
 
Re: CELL Patents (J Kahle): APU, PU, DMAC, Cache interaction

DeanoC said:
.....
The fact these are IBM patents is fairly important, apart from DMAC this describes 'another' system pretty well as well.

BTW Latencies are still stupidly high even with lots of cache.

So these patents describe in a smilar fashion, Xe GPUs SIMDs local memories / texture memories accessing the Xe CPUs L2 cache?

In your opinion what would you change so that latencies aren't stupidly high? :p

On a side note, if these patents do apply to both Xe and Cell, are MS and Sony both privy to any patents/ IPs that are discovered in the STI R&D lab at Austin, Texas?
 
Re: CELL Patents (J Kahle): APU, PU, DMAC, Cache interaction

Jaws said:
So these patents describe in a smilar fashion, Xe GPUs SIMDs local memories / texture memories accessing the Xe CPUs L2 cache?
A embedded multi-core CPU has to have a flexible cache architecture. You need to synchronize the cores, ideally have scratchpad like functionality and access from remote processors/GPUs.

I'm not saying its exactly the same, but if these patents are granted both CPUs would probably infringe.

Jaws said:
In your opinion what would you change so that latencies aren't stupidly high? :p
Read/write linearly. Even on-chip RAM has double figure latency at high clock speeds. Currently there are no solutions on the horizen.
 
If you know which pieces of data will be needed next they can be preread and put in scratchpad memory. SRAM L1 still has single-figure access time (in clocks) even in the fastest chips today, and so should SRAM scratchpad have. It might even be faster, since the memory doesn't have to be 'searched' for the address requested due to the set associativity and 'row' layout of cache memories.
 
Guden Oden said:
If you know which pieces of data will be needed next they can be preread and put in scratchpad memory. SRAM L1 still has single-figure access time (in clocks) even in the fastest chips today, and so should SRAM scratchpad have. It might even be faster, since the memory doesn't have to be 'searched' for the address requested due to the set associativity and 'row' layout of cache memories.

But your be lucky to get 128Kb of fast SRAM total at >1 Ghz in a chip today, are you suggesting that each APU will get 128Kb of fast SRAM? :oops: (For 32 APU's, thats a cool 4Mb of >1 Ghz SRAM, Intel and IBM would KILL for that :) )

Its seems likely that they will either have small amount say 8-16K of very fast RAM (either L1 cache or scratchpad) backed up by L2 cache, like normal processors or have a largish amount (say 128K) slower (L2 speed) with lots of registers.

Note:
Even a Power5 huge caches are mostly L2 and L3. Its 'only' 64K L1 per core. Its cache monster configuration of 144Mb cache has less the 1Mb of L1 cache
 
I do not expect the APUs to have less than 128 KB of Local Storage.

I am not expecting any more than 8 APUs at 4 GHz (1 PE) or 16 APUs (2 PEs) at 2 GHz for the CPU.

Vince come on, do not growl ;).
 
Deano,

I think that having several isolated SRAM pools (a little for each APU) is easier than a single monolithic L1 pool shared by a lot of units like in a Power5.

You need, for the SRAM to be accessed quickly, to make sure it is as close as possible to the units which will fetch from it or write to it: 128 KB are tiny enough that can fit very well inside each APU.

Each local storage is not a cache: it is the RAM of the APU (for what we know so far).

They might add a small block of cache in the APU, but so far the LS is not proper cache.

That means that a good amount of complex logic ( cache tags, etc... ) is not needed.
 
DeanoC said:
Guden Oden said:
If you know which pieces of data will be needed next they can be preread and put in scratchpad memory. SRAM L1 still has single-figure access time (in clocks) even in the fastest chips today, and so should SRAM scratchpad have. It might even be faster, since the memory doesn't have to be 'searched' for the address requested due to the set associativity and 'row' layout of cache memories.

But your be lucky to get 128Kb of fast SRAM total at >1 Ghz in a chip today, are you suggesting that each APU will get 128Kb of fast SRAM? :oops: (For 32 APU's, thats a cool 4Mb of >1 Ghz SRAM, Intel and IBM would KILL for that :) )

Shurely some mistake Deano. Did you mistake KB for MB ?

Athlons has 128KB level 1 cache (split between I and D caches, yes) clocking at 2.4GHz (0.416ns cycle time) with a load to use penalty of 3 cycles (1.25 random access latency).

Stuffing lots of RAM on chip seems to be a good way to boost performance in future CPUs.

Cheers
Gubbi
 
DeanoC said:
But your be lucky to get 128Kb of fast SRAM total at >1 Ghz in a chip today

Not sure what you mean. SRAM is not the limiting factor in chips today, actually it should be anything BUT the limiting factor; logic seems to be much more problematic.

are you suggesting that each APU will get 128Kb of fast SRAM?

That's the word if the patents are to be believed. Then again, they might not be worth more than the paper they're written on. Who knows, we gotta wait and see.
 
Panajev2001a said:
I do not expect the APUs to have less than 128 KB of Local Storage.

I am not expecting any more than 8 APUs at 4 GHz (1 PE) or 16 APUs (2 PEs) at 2 GHz for the CPU.

Vince come on, do not growl ;).

Are we now doubting the BroadBand Engine is not the CPU in PS3 :?: This has to be the most fickle subject in the history of fickle subjects! :? :D

Okay, what's wrong with the BE now :?:

Cell-mem.jpg
 
Jaws said:
Panajev2001a said:
I do not expect the APUs to have less than 128 KB of Local Storage.

I am not expecting any more than 8 APUs at 4 GHz (1 PE) or 16 APUs (2 PEs) at 2 GHz for the CPU.

Vince come on, do not growl ;).

Are we now doubting the BroadBand Engine is not the CPU in PS3 :?: This has to be the most fickle subject in the history of fickle subjects! :? :D

Okay, what's wrong with the BE now :?:

Cell-mem.jpg

No, the Broadband Engine is still the CPU: what is the broadband engine though ?

;).
 
Panajev2001a said:
No, the Broadband Engine is still the CPU: what is the broadband engine though ? ;)

Hmmm, we all know what the Suzuoki definition of a broadband engine is ! ;)

Those not familiar, it's the above diagram, 4 PEs, with 32 APUs, 4 PUs and 64 MB of eDRAM...

Okay, I didn't want to do this but hey, I'll attempt to show that order of magnitude is possible! :D ...the following is key,

gp.gif


Source...

The above is the EE+GS die for the PSX, manufactured at a hybrid 90/ 130 nm and the die size is 86 mm^2.

Okay, if anyone can find a higher res image of that die, I'd be extremely grateful. Also if anyone can translate that Japanese site then I'd also be grateful. Babel Fish is not being friendly! :( ...

I'll be back to finish the attempt... ;)
 
What's really interesting in that die shot is that the eDRAM part of the GS side of the die has shrunken a lot more than the logic part. Look of an image of the original GS die and eDRAM takes up like 80% of the area.

Cell will - at least according to rumors - use capacitor-less eDRAM which will be much denser still. Maybe a substantial amount like 32 or even 64MB won't be such a pipe-dream after all. ;)
 
Guden Oden said:
What's really interesting in that die shot is that the eDRAM part of the GS side of the die has shrunken a lot more than the logic part. Look of an image of the original GS die and eDRAM takes up like 80% of the area.

And in that picture, the eDRAM is built on 130nm, only the logic of the SoC uses the Low-K 90nm process.

I don't know if it still holds, but as of early ~2003, Sony planned to start with sSOI 65nm logic and mix-load the eDRAM. The capacitor-less [FBC] eDRAM is built on 45nm due out in late 2005.
 
Okay, as promised here goes...

PS2-block.jpg


Above is the PS2s EE block diagram showing the major components


EE-GS.jpg


Above is an image of the combined cores of the PS2s EE + GS that is manufactured for the PSX. It is manufactured with a hybrid 90/ 130 nm process and the die size is 86 mm^2. The image is placed on a grid so that we could estimate unit areas for different components.

Okay, I hope to show that the BroadbandEngine as shown in prior posts is feasible using the PSX core. The main components of the BE are;

32 APUs, 4 PUs, 4 DMACs, 64 MB eDRAM and L1, L2 and possibly L3 cache for the PUs.

Before I start, I'm going to mention some assumptions. The PS2s CPU, the EE was introduced at 250 nm, and a die area of 240 mm^2. The PSs graphics synthesizer (GS) was introduced at 250 nm and a die area of 279 mm^2. Shortly after PS2 release Sony went to 180 nm. PS3 will likely debut at 65 nm and will follow shortly afterwards to 45 nm. It is likely that they will introduce the CPU and GPU of PS3 at around 200-300 mm^2 die size again. I will work with a die size that 300 mm^2 would be the absolute upper limit and will be the basis to show that the BE is feasible at 65 nm.

Also note that by taking these unit areas from the PSX core, were also inheriting the areas of datapaths etc. which would scale with our calculations.

Also the PSX core is a hybrid 90/130 nm process. It is not fully 90 nm. Source : EEtimes . I'll assume it was at 90 nm (a pessimistic assumption for my calculations, as will be revealed later...)

I'll start with the 32 APUs

I'm going to use the VU1 as a guide to show the feasability of the number of APUs in the BroadBand Engine.

APU.jpg


Above shows the basic components of the APU, the registers are 128*128 bit, the local memory is 128 KB of SRAM, 4 FMACs and 4 IUs.

vu1.jpg



Above, VU1 core in the EE, there are 32*128 bit registers, 32 KB of local memory, 5 FMACs, 2 FDIVs and other units etc.

The APU and the VU1 are comparable in terms of the number of execution units (VU1 has more) but the APUs have larger registers and local memories. However, this IBM patent : Processor implementation having unified scalar and SIMD datapath describes the APUs with shared datapaths for space saving and power saving features. I'll base the assuptionn that the APUs are 1.5 times larger than the VU1.

1.5 VU1 ~ APU in terms of die area.

The PSX die is 82*46 units in the diagram ~ 3772 square unit area

The VU1 core is 11*27 units in the diagram ~ 297 square unit area

APU = 1.5 * 297 ~ 446 square unit are

APU as a % os PSX core= 446/3772 *100 ~ 11.81 % of PSX core.

PSX core = 86 mm^2

APU area = 11.81/100 * 86 ~ 10.16 mm^2 and remember this would at 90 nm process.

The area gained by dropping to 65 nm = (90/65)^2 ~ 1.92 more area available, assuming tools scale accordingly.

Therefore, the equivalent area for APU at 65 nm = 10.16/1.92 ~ 5.3 mm^2.

We need 32 APUs = 5.3 * 32 ~ 170 mm^2 (remember we have 300 mm^2 available


Lets move on to 64 MB of eDRAM

Looking at the diagram, each yellow core on the GS side of the PSX represents 1 MB,

1 MB eDRAM = 15*10~ 150 square units.
4 MB eDRAM = 150*4 ~ 600 square units

4MB eDRAM as % of PSX core = 600/3772 *100 ~ 15.91 % of PSX core

PSX core 86 mm^2

4 MB eDRAM = 15.91/100 * 86 ~ 13.68 mm^2 at 90 nm process

The area gained by dropping to 65 nm = (90/65)^2 ~ 1.92 more area available, assuming tools scale accordingly.

Area of 4MB eDRAM at 65 nm = 13.68/1.92 ~ 7.12 mm^2

We need 64 MB eDRAM = 64/4 * 7.12 ~ 114 mm^2 at 65 nm

Area at 65 nm of 32 APUs and 64 MB eDRAM = 170 + 114 = 284 mm^2 ( out of 300 mm^2)


Lets move onto PUs with L1 and L2 cache

The PU will really be a glorified core to schedule Apulets (software cells) to the APUs. And such, doesn't need to be anything fancy, IMO.

If we assume a PU to have 32KB L1 Cache and 128 KB L2 cache, and probably will need less execution unts than an APU, we can approximate a PU area equivalent to an APU, if not less.

PU at 65 nm = APU = 5.3 mm^2

We need 4 PUs, 5.3 * 4 ~ 21 mm^2 at 65 nm

Area of 32 APUs and 64MB eDRAM and 4 PUs = 170 + 114 + 21 ~ 305 mm^2 ( we had 300 mm^2 available, but I'll explain shortly...)


Lets move onto the DMACs and L3 cache

Lets recap,

We've used 305 mm^2 out of an available 300 mm^2. We've scaled down from 90 nm to 65 nm. We've accounted for,

32 APUs, 4 PUs and 64 MB eDRAM.

We need to add the DMACs and if we're lucky L3 cache for the PUs.

Comparing the PS2 block diagram and the PSX diaggram, the DMA functional units take up less space than the VU1 unit. But lets add extra complexity for the BEs DMACs and say they are the equivalent sizes to an APU.

DMAC at 65nm = APU = 5.3 mm^2

We need 4 DMACs = 5.3*4 ~ 21 mm^2

We now have used = 305 + 21 ~ 326 mm^2

And finally shall we add some L3 cache for the PUs for some good luck! :D

Well, L3 cache will be more complex than eDRAM, so lets assume L3 will be 2 times the area of equivalent eDRAM.

4 MB eDRAM at 65 nm = 7.12 mm^2

4 MB L3 cache at 65 nm = 7.12 * 2 ~ 14 mm^2, we will distribute the 4 MB between the 4 PUs, so each PU has 1 MB L3 cache.

We now have used = 326 + 14 ~ 340 mm^2 at 65 nm and our goal was 300 mm^2. We used a process drop from 90 nm to 65 nm

Well can you get dies as large as 340 mm^2 ? Nows a good time to show a 389 mm^2 die from IBM, the Power5 core below at 130 nm!


nove3.jpg


Source

Okay, the BE at 340 mm2 at 65 nm. Recall we dropped from 90 to 65 nm using the PSX core as reference. Also recall we assumed a hybrid 90/130 nm process for PSX. I'll calculate the two extremes.

From above, a drop from 90 to 65 nm gave us = (90/65)^2~ 1.92 area increase which we've included in our calculations.

If we drop from 130 to 90 nm = (130/90)^2 ~ 2.09 area increase and if we factor that into our calculations,

BE at 65nm assuming PSX die at 130nm and 86 mm^2 area = 340/2.09 ~ 163 mm^2

BE at 65nm assuming PSX die at 90nm and 86 mm^2 area = 340 mm^2

So the BE would range from 163 mm2 to 340 mm2 using a hybrid PSX core at 90/130 nm as reference. Assuming they use full 65nm process,

The average PS3 BE at 65nm = (340+ 163)/2 ~ 251 mm2

The PS2 EE at 250nm = 240 mm2

The BE includes;

32 APUs with a total of 4MB SRAM local storage,

4 PUs with a total of 128 KB L1 cache, 512 KB of L2 cache and 4 MB of L3 cache,

and 64MB of eDRAM

on a die size of 251 mm2 at 65 nm !!!
:D

and that's not including the rumours to use capacitorless eDRAM to save even more space! :D

QED

* Runs away into the sunset...*
 
Vince said:
....
And in that picture, the eDRAM is built on 130nm, only the logic of the SoC uses the Low-K 90nm process.
....

Thanks Vince for that extra info! :p

BE at 65nm with PSX at 90nm = APU + eDRAM + PU + DMAC + L3 Cache = 170 + 114 + 21 + 21 + 14 = 340 mm2

eDRAM was a drop from 130 nm to 65 nm = 114/2.09 ~ 55 mm2

Therefore BE at 65nm = 170 + 55 + 21 + 21+ 14 ~ 281 mm2

Cell-mem.jpg


The BE includes;

32 APUs with a total of 4MB SRAM local storage,

4 PUs with a total of 128 KB L1 cache, 512 KB of L2 cache and 4 MB of L3 cache,

and 64MB of eDRAM

on a die size of 281 mm2 at 65 nm !!! :D


and that's still not including the rumours to use capacitorless eDRAM to save even more space!

QED

* Runs away into the sunset again...*
 
You missed all the routing logic (busses, etc... ), forgot that e-DRAM scales better than linearly (good thing), control logic, etc...

You forgot heat: that beast would be very hot, especially at 4 GHz.

Still, we will see.

Until I hear Sony saying the contrary I will go by the birds and say 256 GFLOPS.

I much rather a single PE running at 4 GHz than 4 PEs running at 1 GHz btw.
 
Panajev2001a said:
You missed all the routing logic (busses, etc... ), forgot that e-DRAM scales better than linearly (good thing), control logic, etc...

I did mention those in my assumptions, .i.e.

Jaws said:
Also note that by taking these unit areas from the PSX core, were also inheriting the areas of datapaths etc. which would scale with our calculations.


Panajev2001a said:
You forgot heat: that beast would be very hot, especially at 4 GHz.

I wasn't disputing any clockrates and therefore heat. I was merely trying to show that the BE is feasable at 65nm. But yes that's a very important point. The BE could be nicknamed the CE,... CombustionEngine! :p

Panajev2001a said:
Still, we will see.

Hopefully an indication with the Cell workstations Q4 2004 and PS3 revelations, March 2005.

Panajev2001a said:
Until I hear Sony saying the contrary I will go by the birds and say 256 GFLOPS.

If the BE clock between 1-4 Ghz, it still gives scope from 256-1024 GFlops. Hope they can use liquid Nitrogen for cooling! :p

Panajev2001a said:
I much rather a single PE running at 4 GHz than 4 PEs running at 1 GHz btw.

Scared of multi-processors! ;)
 
Scared of multi-processors!

No, I want better cache for the APUs and the PU.

I want more general purpose performance: remember that according to that patent, the APUs have 1/4th of the peak throughput (SIMD/Vector processing) doing scalar operations.

I want the PU to be easier to program and to be able to better direct the APUs.
 
Back
Top