How large will Cell be ?

...

PSX2OAC @ 90 nm : 55 million transistors @ 86 mm2
Dothan @ 90 nm : 150+ million transotors @ 88 mm2

Don't run away from the truth, SCEI fabs are WORSE, not better, than Intel's fabs.
 
Re: ...

DeadmeatGA said:
Are you going insane ?
That's what I would like to ask you.

I am not the person who showed such reading comprehension problems.

1.) You can have a slow clock for the e-DRAM and higher for the APUs.
Show me one example of eDRAM chip ticking over 1 Ghz.

Ok, let's re-read what I said:

You can have a slow clock for the e-DRAM and higher for the APUs.

Slow for the e-DRAM... higher for the APUs.

Your answer:

Show me one example of eDRAM chip ticking over 1 Ghz.

I rest my case.

2.) The PUs will not be G4s: the PU does not even need a Vector Unit like the Altivec Unit used in the G4, do you even think before you type lately ?
How much do you save from G4 by eliminating AltiVec? Not much. The fact is that a very powerful CPU is needed to service 8 data-hungry APUs so SCEI cannot afford to put a simple CPU in there. Whatever it is, it should be desktop grade.

I think that even a 2 GHz ARM11 core could do a good enough job: shrink it to 65 nm and add a 64-128 KB L2 Cache ( in addition to the 32 KB L1 Cache ).

The PUs do not have to do the bulk of processing: a complex super-scalar ( more than 2-way super-scalar ), OOOe and with complex branch prediciton core is not needed.

You could have said they needed a POWER5 in each PU, but I am sure you are saving this argument for a later date.

3.) About the APU's size... if you EVEN READ my post before pressing the submit button you would have read that I have set 22 mm^2 for all the SRAM ( 4 MB ) according to CMOS5 SRAM cell size provided
You are forgetting the spaces between the SRAM cell and the bus wire...

You are forgetting that the real number was ~20 mm^2 and I added 2 mm^2 thinking about things like that too.

You really seem trying to artificially bloat the size of the APUs.
No I am not, APU will be significantly bigger than VU1 fabbed on same process, simply because it is far more complex and has a lot more. Since VU1 is estimated to be around 4~5 mm2 at 65 nm, APU will be at lest double to triple the size of VU1.

I see how you calculated APUs' size and how you refused my figures: double to triple the size of VU1 ( btw, you should also take away the EFU though, to be honest... I bet you "forgot" about the extra FMAC and FDIV, didn't you ? ;) )

Use your common sense. PSX2OAC and PSP tells you that SCEI's fab technology is not any better than its rivals(In fact Dothan has three times the transistor density of PSX2OAC), then why are you expecting them to cram half a buillion transistors on a die and have it clock at 4 Ghz while burning little power? SCEI can't do what Intel and IBM can't do. It is as simple as that.

1.) Often two chips produced by the same company can show quite different transistor density figures. What does that tell you ? Well to you, while talking about SCE, it says nothing, but that is to be expected.

2.) I expect SCE + Toshiba + IBM to do what IBM can't do ( SCE + Toshiba + IBM > IBM ) and basic math should provide an easy plausibility argument for the disequation I provided.

3.) I suggest you to read about multiple clock domains, after you finish reading about different clock domains on the Pentium 4.
 
Re: ...

DeadmeatGA said:
PSX2OAC @ 90 nm : 55 million transistors @ 86 mm2
Dothan @ 90 nm : 150+ million transotors @ 88 mm2

Don't run away from the truth, SCEI fabs are WORSE, not better, than Intel's fabs.

Hey, I am not the one running with my fingers in my ears and screaming on top of my lungs "It ain't TRUEEEE, DEATH TO SCE AND KUTARAGI-SAMAAAAA".

You arguments are way too biased by your LOATHING of SCE: if you just hated SCE it would be ok ( there are people who love SCE and can still use their brains ), but you passed to the next step, you LOATHE SCE and anyone can see it.

Keep stretching reality thin and half-truths and your quest of teaching the ignorant fanboys will not proceed very fast.
 
Deadmeat said:
How much do you save from G4 by eliminating AltiVec? Not much. The fact is that a very powerful CPU is needed to service 8 data-hungry APUs so SCEI cannot afford to put a simple CPU in there. Whatever it is, it should be desktop grade.
Eh, it has been suggested APUs have the ability to feed themselves so to speak, being able to start DMA from their side (I actually think this is something necessary for things to work well but I could be wrong of course).
Either way I don't see the PU being desktop grade or whatever being all that much of a fact as you suggest.
 
Hey Panajev,

the register had a diagram of that SPARC64 VI, if your thread wasn't locked I would post it there, so instead I'll post it here.

sparcthree.gif
 
Re: ...

DeadmeatGA said:
PSX2OAC @ 90 nm : 55 million transistors @ 86 mm2
Dothan @ 90 nm : 150+ million transotors @ 88 mm2

Don't run away from the truth, SCEI fabs are WORSE, not better, than Intel's fabs.

Apples and oranges.

a DRAM cell is larger than one SRAM transistors (because you need a capacitor, either trenched or stacked). SRAM however uses 6 transistors per bit while DRAM uses one.

Cheers
Gubbi
 
...

Eh, it has been suggested APUs have the ability to feed themselves so to speak, being able to start DMA from their side (I actually think this is something necessary for things to work well but I could be wrong of course).
Either way I don't see the PU being desktop grade or whatever being all that much of a fact as you suggest.
The basic principle behind IBM's cellular architecture is the division of computation and I/O; BlueGene/L has two processors per node, one doing the computing and the other running Linux to perform I/O chores for the compute processor, this includes file I/O as well as message passing.

This design principle is carried over to CELL and you again see a clear division between processors, the APUs doing the number crunching and PPC running the Linux and some form of message passing mechanism to service the APUs. In essense, end developers don't deal with PPC core directly, all their process source code will be compiled to APU code and developers use OS call to pass messages to each other. So yes, the burden on PPC core is substential; while it doesn't run any user process code it still has to handle I/O and message passing.

the register had a diagram of that SPARC64 VI, if your thread wasn't locked I would post it there, so instead I'll post it here.
L2 cache eats up 500 million transistors, leaving 170 million transistors to form two SPARC cores. Look how 170 million transistor processor block is equal in size to the 500 million transistor L2 cache block, the transistor density of processor block portion itself is distorted because of the inclusion of L1 cache.

For architectures with large number of logic gates and few SRAM gates like CELL, a low transistor density and a massive die size is inevitable. Even two PUs are a tough fit on 250 mm2 EE3.
 
Re: ...

Gubbi said:
DeadmeatGA said:
PSX2OAC @ 90 nm : 55 million transistors @ 86 mm2
Dothan @ 90 nm : 150+ million transotors @ 88 mm2

Don't run away from the truth, SCEI fabs are WORSE, not better, than Intel's fabs.

Apples and oranges.

a DRAM cell is larger than one SRAM transistors (because you need a capacitor, either trenched or stacked). SRAM however uses 6 transistors per bit while DRAM uses one.

Cheers
Gubbi

True Gubbi, plus I have never stated they had record size SRAM and e-DRAM cells in 90 nm...

However, he process node I expect is SCE and Toshiba 45 nm SOI one: capacitor-less e-DRAM :)
 
Re: ...

DeadmeatGA said:
This design principle is carried over to CELL and you again see a clear division between processors, the APUs doing the number crunching and PPC running the Linux and some form of message passing mechanism to service the APUs. In essense, end developers don't deal with PPC core directly, all their process source code will be compiled to APU code and developers use OS call to pass messages to each other.

I don't think the message passing is exposed to developers. The tools/compilers is of course going to generate code that send messages (CELLs) around, but developers are not likely to be bothered with this.

DeadmeatGA said:
So yes, the burden on PPC core is substential; while it doesn't run any user process code it still has to handle I/O and message passing.
All it has to do is to lock regions and setup DMA channels to move blocks of data. My guess is that this has been vigorously simulated in order to ensure that this core is well capable of this task while not overkill.

DeadmeatGA said:
For architectures with large number of logic gates and few SRAM gates like CELL, a low transistor density and a massive die size is inevitable. Even two PUs are a tough fit on 250 mm2 EE3.

The entire reason to include eDRAM on the die is in order to optimize capacity vs. speed. This happens for all modern architectures. The Itanium 2 has three levels of cache, lvl 1 optimized for latency, lvl 2 optimized for bandwidth and lvl 3 optimized for capacity (size).

This is exactly the same rationale that is behind the structure of CELL. Local memory ensures low latency, high bandwidth operation of each APU, while the central eDRAM provides massive capacity and (very) decent bandwidth.

Cheers
Gubbi
 
Re: ...

Panajev2001a said:
<snip>
However, he process node I expect is SCE and Toshiba 45 nm SOI one: capacitor-less e-DRAM :)

Unless they are going to use MRAM there'll be a capacitor in there. :)

In SOI it might very well be embedded in the SOI layer, but it will be in there (and influence size/speed/cost of the entire die).

Cheers
Gubbi
 
Re: ...

DeadmeatGA said:
Eh, it has been suggested APUs have the ability to feed themselves so to speak, being able to start DMA from their side (I actually think this is something necessary for things to work well but I could be wrong of course).
Either way I don't see the PU being desktop grade or whatever being all that much of a fact as you suggest.
The basic principle behind IBM's cellular architecture is the division of computation and I/O; BlueGene/L has two processors per node, one doing the computing and the other running Linux to perform I/O chores for the compute processor, this includes file I/O as well as message passing.

This design principle is carried over to CELL and you again see a clear division between processors, the APUs doing the number crunching and PPC running the Linux and some form of message passing mechanism to service the APUs. In essense, end developers don't deal with PPC core directly, all their process source code will be compiled to APU code and developers use OS call to pass messages to each other. So yes, the burden on PPC core is substential; while it doesn't run any user process code it still has to handle I/O and message passing.

the register had a diagram of that SPARC64 VI, if your thread wasn't locked I would post it there, so instead I'll post it here.
L2 cache eats up 500 million transistors, leaving 170 million transistors to form two SPARC cores. Look how 170 million transistor processor block is equal in size to the 500 million transistor L2 cache block, the transistor density of processor block portion itself is distorted because of the inclusion of L1 cache.

For architectures with large number of logic gates and few SRAM gates like CELL, a low transistor density and a massive die size is inevitable. Even two PUs are a tough fit on 250 mm2 EE3.

1.) they can go up to 300 mm^2 and a little beyond if they so choose: I thought any point I made in regard to that before were read by you, but I thought wrong.

2.) 4 MB of SRAM based Local Storages is not what I call few SRAM gates: even in your disastrous scenario we would have 2 MB of SRAM based LS.

3.) If you think that 2 PEs without ANY e-DRAM are going to be over 250 mm^2, I do not know what to tell you... wait I do know what to tell you: your estimates of PUs' size and APUs' size are way off ( we discussed it about the APU and still you keep going, without acknowledging what we just said, and keep up with your calculations even if the underlying assumptions had been shaken quite a bit ).

4.) CELL is a bit more than a quick-day re-design of Blue-Gene, just to let you know, the patents it draws from, the great guys who IBM assigned up-front to start the work on CELL ISA from "scratch" ( of course Blue-Gene and the whole Cellular architecture concept was taken into account ), etc.. are not signs that IBM basically sold Blue-Gene to SCE and SCE started adding VUs, but once you form an opinion ( amazing how it is always against SCE ? You cannot say anything positive even about PSP and in theory, by what you bsaid in the past, you should be drooling over it ), you will not listen to anybody.

5.) Blue-Gene does not need G4 class chips: using a PPC core can mean lots of things: also those processors were clocked at 0.5-1 GHz IIRC: I am talking about 2 GHz PUs and IMHO even a 2 GHz ARM11 core with 64-128 KB L2 could do the job of a PU in much less space than what you are speculating a PU will take.
 
Re: ...

Gubbi said:
Panajev2001a said:
<snip>
However, he process node I expect is SCE and Toshiba 45 nm SOI one: capacitor-less e-DRAM :)

Unless they are going to use MRAM there'll be a capacitor in there. :)

In SOI it might very well be embedded in the SOI layer, but it will be in there (and influence size/speed/cost of the entire die).

Cheers
Gubbi

There is no "real" capacitor, embedded or not... between the SOI layer and the gate we can form enough capacitance ( damn... here we have that word again.... ) to store charge/data: yes, we have a "sort" of capacitor( :p ), but the size of the DRAM cell is greatly reduced as the next diagram shows.

img1302.gif


The size and cost of the DRAM cell should be reduced, especially its size allowing for greater e-DRAM density: the over-all cost of the e-DRAM based chip would still fall down because of the smaller area taken by the e-DRAM portion which could lead to an over-all smaller chip, even if the cost of that MOS transistor has grown in itself.

We can make a case for the over-all price drop for the whole chip off-setting the increase of cost per e-DRAM cell or we can make a contrary case of course as this balance varies between scenario and scenario.

As far as the clock-speed is concerned: multiple clock-domains.

The EE+GS@90 nm has 4 MB of e-DRAM clocked at 150 MHz and the EE portion of the chip runs mostly at 300 MHz.

What I envision for the PlayStation 3 CELL CPU chip is something similar: e-DRAM running at 1/2 of the PUs' clock ( 2 GHz / 2 = 1 GHz ) which also means 1/4 of the APU's clock ( APUs would be clocked at 2x the PUs' clock ).

Maybe the e-DRAM could be even clocked at 1/8th of APUs' clock ( 4 GHz / 8 = 500 MHz ), but we could use DDR signalling.

I think 1 GHz for the e-DRAM is obtainable though, but they have more than one way open to them.
 
...

To Gubbi

All it has to do is to lock regions and setup DMA channels to move blocks of data. My guess is that this has been vigorously simulated in order to ensure that this core is well capable of this task while not overkill.
The problem is that message passing has to work consistently regardless of the location of target process, be it on the same system or another system located half-way across the globe.

The entire reason to include eDRAM on the die is in order to optimize capacity vs. speed.
The eDRAM forming process itself hurts clockspeed.

To Panajev

1.) they can go up to 300 mm^2 and a little beyond if they so choose
SCEI can't. Past 100 mm2, the yield rate drops to 1/4th every time die size doubles. At 300 mm2, SCEI can't even fabricate enough working chips to make a launch.

: I thought any point I made in regard to that before were read by you, but I thought wrong.
When your point is outright wrong, I do not need to keep that in my memory.

2.) 4 MB of SRAM based Local Storages is not what I call few SRAM gates
2 MB, Damn it! Not 4 MB!

your estimates of PUs' size and APUs' size are way off
Well, I calculate my numbers from SCEI's actual product. I don't know how you come up with your number.s

4.) CELL is a bit more than a quick-day re-design of Blue-Gene
I am not implying that, I am simply stating that it burrows heavily from BlueGene concept.

5.) Blue-Gene does not need G4 class chips
Of course not, since each I/O engine needs to serve only one compute engine. This is not the case with CELL.
 
SCE's fabs are better than when they were working with PSOne and when they started working on PlayStation 2 and your inductive approach is failing to measure up with this evidence.

PlayStation 2 launched with a 279 mm^2 GS and the 180 nm EE was 224 mm^2.

They were using 200 mm Wafers at best.

I will use some fuzzy math too since you love to use it to attack
SCE.

You must accept that yelds on the 180 nm EE were not bad and Sony did use that CPU for a while so it was viable.

Everything else except area of the chip staying equal going from 200 mm Wafers to 300 mm Wafers we could in theory have 1.5x bigger CPUs compared to the 180 nm EE.

224 * 1.5 = 336.

If I say 280-290 mm^2, this would mean about 83-86% of the potential area increase.

Also, my point about the issue regarding your APU's size estimate compared to my estimates was that you were taking into account the SRAM for the APU while I was already taking that into account separately: 20 mm^2 + 2 mm^2 were set aside for 4 MB of SRAM.

I can almost take your numbers for the APUs, but I have to take back those 22 mm^2 so our numbers do not change much.

Of course not, since each I/O engine needs to serve only one compute engine. This is not the case with CELL.

And perhaps you are underestimating what a 2 GHz ARM11 core with 64-128 KB of L2 Cache ( added to 32 KB of L1 Cache ) could do: I think it would be fast enough to orchestrate 8 APUs.
 
Panajev2001a said:
Everything else except area of the chip staying equal going from 200 mm Wafers to 300 mm Wafers we could in theory have 1.5x bigger CPUs compared to the 180 nm EE.

224 * 1.5 = 336.

it's 1.5*1.5 > 2.25x bigger surface
 
Quaz51 said:
Panajev2001a said:
Everything else except area of the chip staying equal going from 200 mm Wafers to 300 mm Wafers we could in theory have 1.5x bigger CPUs compared to the 180 nm EE.

224 * 1.5 = 336.

it's 1.5*1.5 > 2.25x bigger surface

Of course... duh.... dumb me who just divided 300 by 200, I forgot that I was dividing diameters and not surfaces :(

(221,841 (mm^2)) / (98,596 (mm^2)) = 2.25

I do not think that your correction worsen my argument though ;)

224 mm^2 * 2.25 = 504 mm^2

A 290-300 mm^2 chip would only be 55.5-59.5% of the theretical increase they could get and seems feasible taking into account complications which might occur at a chip size of this magnitude.

If we take a 300 mm^2 chip and 300 mm Wafers we could make ( if the Wafer had a rectangular shape ) around 739 chips... Let's say 600 to take a bit into account the fact the Wafer is circular and we do not use the surface very well ( I am making the case for approximately 139 * 300 mm^2 of Wafer's surface not to be utilizable ).

If we take the 180 nm EE measuring 224 mm^2 ( which is not asking you to start counting transistor density [btw, the official Kutaragi's PDFs state 10.5 MTransistors not 13 MTransistors] ) and 200 mm Wafers we could make around 440 chips ( if the Wafer had a rectangular shape ).

Let's only trim it to 400 chips to skew the comparison a little in favor of the 200 mm Wafer.

This is still giving us 200 more chips produced with the 300 mm Wafers and the chip size increased to 300 mm^2.

Even if shit happens and the yelds drop badly to only 405 chips ( yelds = 65% ) while assuming all 400 chips manufactured on the 200 mm Wafers ( 224 mm^2 chips ) are working up to specs ( again we are trying to skew the comparison against the 300 mm Wafers and 300 mm^2 chips argument ), this is still good news.

If we say let'sd be more negative, I would say well let's at least assume the 224 mm^2 chips on 200 mm Wafers will have yelds around 70% ( the 180 nm process is not at its peak [think year 1999] ) which means about 280 chips.

Let's say that on the 300 mm Wafers the 300 mm^2 chips we only get 282 chips working out of 600.

This means yelds around 47% and while this in itself is low, all things considered we still get the same number of chips + 2 as what we did on 200 mm Wafers and 224 mm^2 chips.
 
Cost per mm2 goes up significantly every generation ... 300 mm is just a blip on the trend, I doubt it even puts 65 nm at parity with 180 nm.
 
MfA said:
Cost per mm2 goes up significantly every generation ... 300 mm is just a blip on the trend, I doubt it even puts 65 nm at parity with 180 nm.

You are taking into account the costs related to the process and this is fair, but my argument still ponders the question: if a 224 mm^2 chip was economically viable with 200 mm Wafers is a 280-300 mm^2 chip viable with 300 mm Wafers ?

Also, with the 45 nm node coming around 2007 ( Q1-Q2 2007 for Toshiba and SCE ), which in some ways it is simplier than the solution Toshiba currently chose ( mixed SOI [logic] and bulk-CMOS [e-DRAM] ) the costs due to the chip size will fall quite a bit: the 45 nm node should use a full SOI solution for both logic and e-DRAM: as I posted above the e-DRAM cell will be significantly smaller in surface area terms allowing for an even greater reduction of the chip size than what shrinking from 65 nm to 45 nm woulkd normally bring you.

It means holding on for a year basically: I expect a Q4 2005 launch for PlayStation 3.

This is a point in favour to the integration of e-DRAM instead of saving 20-30 mm^2 by leaving it off.

It would also reduce the costs for the XDR/Yellowstone solution: we would not need a 128 bits Memory Controller with 256 data pins ( due to differential signalling for people who just joined in the discussion now ) and/or a 800 MHz base clock for the memory either ( base clock = clock before the 4x PLL multiplication and DDR signalling taken into account ).

The XDR related costs ( we also need a more expensive PCB solution ) would not be reduced much by shrinking the chips to 45 nm.
 
As I said before, the arguement doesnt hold much weight since they had a shrink shortly after releasing the PS2 too (and the density on their first generation chips was damn lousy too). Basically it is all guess work and there are too many arguements for either side to make a reasonable guess.

Vince asked before why I make my guesses based on such hugely simplified assumptions ... well it is because putting a large string of more well founded assumptions together doesnt really improve the guess IMO. Sure taken individually they might be reasonable, but when you string them the level of uncertainty still balloons rapidly. As long as Im making a wild guess I might as well save myself some work ...
 
Back
Top