CELL configuration revisited....

DeadmeatGA · Nov 8, 2003

Kutaragi's presentation PDF has given me a lot of thoughts about the true nature of CELL architecture, and has made me to reconsider some of my previous estimations based on the Suzuoki patent application.

All my estimations were based on the Suzuki patent application, which specified that APUs be "preferably rated at 32 GFLOPS", which made everyone jump and scream "4 Ghz!!!". I did my estimation around this clockspeed, expecting the pipeline to be stretched beyond 20 states and consume something like 200 Watts to support such a high-clock speed, which in turn led me to conclude that SCEI could include no more than 2 PE cores even at the massive die size of 250 mm2, but still unable to reach 4 Ghz at consumer level application, more like 2 Ghz to be realistic.

What changed my mind was Kutaragi's presentation of 2 Teraflop rack; Kutaragi intended to reach 2 teraflops by putting 64 chips on a rack, and you CANNOT POSSIBLY PUT 64 CHIPS EACH BURNING 200 watts on a rack!!!(That would be 13 kw per rack my friend).

Let us go back and rethink the original motivation behind IBM's cellular architecture; it was an attempt to extract big compute power by summing up lots of simple and inexpensive processors running at a relatively slow clockspeed and burning little power. Blue Gene/L, from which CELL is based on, in fact uses a couple of PPC440s packed into one die. Now, why would Kutaragi suddenly decide to go against the very philosophy behind cellular computing and build his dream chip around a massive hyperpipelined processor burning 200 watts to clock at 4 Ghz?? It didn't make sense.

Now, if we were to understand that Suzuoki's APU GFLOPS rating was a "preference", in other word a long term goal, and not the rating of what's to come next year, then everything start falling in place. CELL, like BlueGene/L, would be built around something simple like PPC440, to which 8 recycled EE2-style VUs are attached. Such PE won't take up hundreds of millions of transistor to build because they are very simple; I expect such PE to be no larger than PSX2OAC. You can in fact cram 4 of these on a die @ 65 nm. Such a device should be able to peak around 1 Ghz at leasonable power consumption of say, 10~15 watts. The peak flops rating will be around 250 GFLOPS per chip. A very respectable number indeed.

So which sounds more realistic to you, a 20+ stage hyperpipelined processor chip burning 200 Watts at 4 Ghz, or a 7 stage pipelined processor chip burning 10~15 watts @ 1 Ghz???

Megadrive1988 · Nov 8, 2003

ok then lets pack 16 of these PEs in PS3 across 4 dies and get 4 Teraflops.

PS3's 'CPU' will be spread across 4 dies, each with 4 PEs. so thats 128 APUs for the 'CPU' - 4 TFLOPs

Then the 'GPU' will be 4 dies, each with 4 VSs. another 64 APUs. plus 16 Pixel Engines. 2 more TFLOPs

there's your 6 TFLOPs and '1000x more power than PS2'

(half serious)

IBL

Panajev2001a · Nov 8, 2003

DeadmeatGA said:
Kutaragi's presentation PDF has given me a lot of thoughts about the true nature of CELL architecture, and has made me to reconsider some of my previous estimations based on the Suzuoki patent application.

All my estimations were based on the Suzuki patent application, which specified that APUs be "preferably rated at 32 GFLOPS", which made everyone jump and scream "4 Ghz!!!". I did my estimation around this clockspeed, expecting the pipeline to be stretched beyond 20 states and consume something like 200 Watts to support such a high-clock speed, which in turn led me to conclude that SCEI could include no more than 2 PE cores even at the massive die size of 250 mm2, but still unable to reach 4 Ghz at consumer level application, more like 2 Ghz to be realistic.

What changed my mind was Kutaragi's presentation of 2 Teraflop rack; Kutaragi intended to reach 2 teraflops by putting 64 chips on a rack, and you CANNOT POSSIBLY PUT 64 CHIPS EACH BURNING 200 watts on a rack!!!(That would be 13 kw per rack my friend).

Let us go back and rethink the original motivation behind IBM's cellular architecture; it was an attempt to extract big compute power by summing up lots of simple and inexpensive processors running at a relatively slow clockspeed and burning little power. Blue Gene/L, from which CELL is based on, in fact uses a couple of PPC440s packed into one die. Now, why would Kutaragi suddenly decide to go against the very philosophy behind cellular computing and build his dream chip around a massive hyperpipelined processor burning 200 watts to clock at 4 Ghz?? It didn't make sense.

Now, if we were to understand that Suzuoki's APU GFLOPS rating was a "preference", in other word a long term goal, and not the rating of what's to come next year, then everything start falling in place. CELL, like BlueGene/L, would be built around something simple like PPC440, to which 8 recycled EE2-style VUs are attached. Such PE won't take up hundreds of millions of transistor to build because they are very simple; I expect such PE to be no larger than PSX2OAC. You can in fact cram 4 of these on a die @ 65 nm. Such a device should be able to peak around 1 Ghz at leasonable power consumption of say, 10~15 watts. The peak flops rating will be around 250 GFLOPS per chip. A very respectable number indeed.

So which sounds more realistic to you, a 20+ stage hyperpipelined processor chip burning 200 Watts at 4 Ghz, or a 7 stage pipelined processor chip burning 10~15 watts @ 1 Ghz???

I think they can push more than 10-15 Watts.

EE+GS@90 nm burns 8 Watts: KI think that they can safely burn up to 45+ Watts.

This means that 2.5 GHz should be possible ( given the use of SOI and a bit more work on the chip's pipelining ) given your estimation: this means 2.5 x the speed compared to the 1 GHz estimate you had.

250 GFLOPS * 2.5 = 625 GFLOPS

Respectable indeed.

Still I want to highlight somethign you said:

CELL, like BlueGene/L, would be built around something simple like PPC440, to which 8 recycled EE2-style VUs are attached. Such PE won't take up hundreds of millions of transistor to build because they are very simple; I expect such PE to be no larger than PSX2OAC. You can in fact cram 4 of these on a die @ 65 nm.

I think that your 8 recycled EE2-style VUs are what Suzuoki's patent and most of all IBM's patent desacribe as APUs: SIMD/Scalar Processors capable of executing 1 FP/FX Vector Instruction or 1 FP/FX Scalar Instruction per cycle.

I also think that you now agree as well that the PU is not a monster the size of a G4, but it is quite smaller than tha.

I see you also admit the possibility of having 4 PEs in the same die using 65 nm manufacturing technology.

So, architecturally we have come full circle regarding CELL design.

I think we can agree on 2-2.5 GHz and 500-625 GFLOPS of performance being a possible target with a power consumption estimate of around 45-50 Watts ( your estimate was 10-15 Watts with the chip clocked at 1 GHz ).

I know I will not get something like "maybe on 'this' you guys were right", but you ending the PU HAS to be a n EV7 class processor ( yeah, yeah you said G4 class ) is enough.

Before yyou reply: no, the e-DRAM would still be clocked, if present ( if they pushed hard for a 128 bits XDR with 800 MHz of base clock-speed they could avoid using e-DRAM as they would have 100 GB/s with XDR already [still there are other advantages that call for e-DRAM: namely Toshiba's 45 nm SOI technology which allows capacitor-less e-DRAM] ) at 1 GHz maximum ( or 500 MHz and using DDR signalling ).

nondescript · Nov 8, 2003

All my estimations were based on the Suzuki patent application, which specified that APUs be "preferably rated at 32 GFLOPS", which made everyone jump and scream "4 Ghz!!!".

As far as I know, I sure didn't see "everyone jump up and scream '4 Ghz!!!' ". Believe whatever you want about CELL, but this characterization is obviously false.

I did my estimation around this clockspeed, expecting the pipeline to be stretched beyond 20 states and consume something like 200 Watts to support such a high-clock speed, which in turn led me to conclude that SCEI could include no more than 2 PE cores even at the massive die size of 250 mm2, but still unable to reach 4 Ghz at consumer level application, more like 2 Ghz to be realistic.

I haven't been following the slugging contests between Pana and you, but your "estimations" are not exactly authorative. Do your estimations come from this thread ?

I do agree that CELL will have a clock speed of more like 2 GHz than 4 GHz - as I stated here , I also don't think that CELL will require the deep pipelining that the P4 has...it doesn't require the massive decoding and scheduling infrastructure needed to deal with memory bottlenecks, if claims of 16-32 megs eDRAM on-chip are true. It is also dealing with more inherently parallel code, for physics and graphics, which exhibit strong data locality. A fuller treatment of this line of argument can be found here .

What changed my mind was Kutaragi's presentation of 2 Teraflop rack; Kutaragi intended to reach 2 teraflops by putting 64 chips on a rack, and you CANNOT POSSIBLY PUT 64 CHIPS EACH BURNING 200 watts on a rack!!!(That would be 13 kw per rack my friend).

Let us go back and rethink the original motivation behind IBM's cellular architecture; it was an attempt to extract big compute power by summing up lots of simple and inexpensive processors running at a relatively slow clockspeed and burning little power. Blue Gene/L, from which CELL is based on, in fact uses a couple of PPC440s packed into one die. Now, why would Kutaragi suddenly decide to go against the very philosophy behind cellular computing and build his dream chip around a massive hyperpipelined processor burning 200 watts to clock at 4 Ghz?? It didn't make sense.

Yup, doesn't make sense. We're not gonna see deeply pipelined PUs (or "hyperpipelined", as you like to call it) in CELL. I don't recall anyone other than you hypothesizing a hyperpipelined processor tho...

Now, if we were to understand that Suzuoki's APU GFLOPS rating was a "preference", in other word a long term goal, and not the rating of what's to come next year, then everything start falling in place. CELL, like BlueGene/L, would be built around something simple like PPC440, to which 8 recycled EE2-style VUs are attached. Such PE won't take up hundreds of millions of transistor to build because they are very simple; I expect such PE to be no larger than PSX2OAC. You can in fact cram 4 of these on a die @ 65 nm. Such a device should be able to peak around 1 Ghz at leasonable power consumption of say, 10~15 watts. The peak flops rating will be around 250 GFLOPS per chip. A very respectable number indeed.

Sounds good to me. I envision about the same thing - just as the CELL patent states, we've got more-general purpose PPC-like cores doing the coordinating, and simple PUs crunching out those FMACS. Isn't this exactly the computing model we've been speculating on for months now?

I don't understand your insistence that CELL be < 1TFLOPS. Personally, I don't think in matters whether CELL is 250 GFLOPS or 1 TFLOPS. Either way, its just a peak performance numbers game. But if you're willing to accept the possiblity of a 250 GFLOPS chip, I don't see why you think 1 TFLOPS is impossible.

So which sounds more realistic to you, a 20+ stage hyperpipelined processor chip burning 200 Watts at 4 Ghz, or a 7 stage pipelined processor chip burning 10~15 watts @ 1 Ghz???

Again, show me where the usual suspects (Pana, Paul, Vince) were hypothesising "a 20+ stage hyperpipelined processor chip burning 200 Watts at 4 Ghz." This is hardly a revisitation of CELL...its more like a confirmation of the basic consensus of the board.

nondescript · Nov 8, 2003

Panajev2001a said:
So, architecturally we have come full circle regarding CELL design.

Exactly.

(Damn, you people type fast...its always a little disconcerting when between pressing the "reply" and hitting the "submit" button, two people have already replied before me)

Paul · Nov 8, 2003

I'll restate what I said in another post.

PS3's version of Cell will top out at a peak of around 256GFLOPS, whlist the GPU aka GS3 or VS X 4,whatever you want to call it; will peak at around 128GFLOPS.

300+ GFLOPS in total, simply stagering computing power we are talking about here.

With Embedded Dram feeding this directly, I would say it wouldn't be imposible to sustain around 200GFLOPS.

PS3 Cell clock speed will top at my guess 2-2.5Ghz with the GPU coming in at around maybe close to a Ghz, MAYBE reaching that Ghz thresh hold.

Don't take 300GFLOPS lightly in a two chip scenario, this is some MASSIVE COMPUTING POWER. And is not something that should be tossed around lightly.

Maybe I'm being conservative, maybe not.

DeadmeatGA · Nov 8, 2003

...

This means that 2.5 GHz should be possible

Nope. You just can't rev very fast with a short pipe design. 1 Ghz should be the upper limit. Even 12~13 stage designs like Dothan and Opteron have trouble reaching 2 Ghz at an acceptable yield rate, only 20-stage designs like Pentium4 breaks past 2 Ghz at an acceptable tield rate.. In other word, you would be trading in higher clockspeed for more units packed in.

given the use of SOI and a bit more work on the chip's pipelining
) given your estimation: this means 2.5 x the speed compared to the 1 GHz estimate you had.

You talk as if reaching 2 Ghz is easy, it is not. Only Pentium4 does so reliably, while all others struggle somewhere around 1~2 Ghz(Power4+, Itanium, Opteron, Efficeon, UltraSparciV, Banias, etc).

I also think that you now agree as well that the PU is not a monster the size of a G4, but it is quite smaller than tha.

And doesn't clock very fast.

I see you also admit the possibility of having 4 PEs in the same die using 65 nm manufacturing technology.

You can either pack 2 deeply piped PEs or 4 short piped PEs on a die, but the peak output would be about the same @ 250 GFLOPS. What changes is the power consumption.

I don't see why you think 1 TFLOPS is impossible.

Cost considerations and law of physics does not allow such device. If SCEI did manage to build a 1 teraflop device that burns 10 watts, then it would put workstation precessor venders out of business since its performance/price ratio is like 100x better than any of existing architectures.

DeadmeatGA · Nov 8, 2003

...

Here is now my new position on CELL.

It might indeed pack 4 PEs per die, but its clock limit will be around 1 Ghz. This is the maximum you can reliably go with a non-superpipelined design.

It makes me wonder how CELL will compare to Xbox2 CPU, which appears to be going with a Power5+ style dual-core device. Oviously CELL blows Xbox2 CPU away in peak FLOPS(256 GFLOPS Vs around 32~40 GFLOPS), but the sustained FLOPS figure is about the same.(CELL will sustain somewhere around 15% of its peak FLOPS rating whereas POWER-series sustains very high in real world code)

randycat99 · Nov 8, 2003

Re: ...

DeadmeatGA said:
This means that 2.5 GHz should be possible

Click to expand...

Nope. You just can't rev very fast with a short pipe design. 1 Ghz should be the upper limit.

It's a limit using current die process. However, we are talking about something to be produced on the .06 process, not .13.

Even 12~13 stage designs like Dothan and Opteron have trouble reaching 2 Ghz at an acceptable yield rate, only 20-stage designs like Pentium4 breaks past 2 Ghz at an acceptable tield rate..

If what you are suggesting was really a hard limit, regardless of die process, then the P4 would still be stuck at 1.6 GHz (and G4's would be stuck at 500 MHz). Clearly, this is not the case.

You talk as if reaching 2 Ghz is easy, it is not. Only Pentium4 does so reliably, while all others struggle somewhere around 1~2 Ghz(Power4+, Itanium, Opteron, Efficeon, UltraSparciV, Banias, etc).

If 1 GHz is possible now, then 2 GHz is certainly within the realm of possibility in 2005.

DeadmeatGA · Nov 8, 2003

...

It's a limit using current die process. However, we are talking about something to be produced on the .06 process, not .13.

Clockspeed is a design feature, not a process feature.

If what you are suggesting was really a hard limit, regardless of die process, then the P4 would still be stuck at 1.6 GHz (and G4's would be stuck at 500 MHz). Clearly, this is not the case.

P4 WAS designed to clock high. G4 tops out at 1.33 Ghz, but the yield rate is low at that clockspeed and most averages 1 Ghz

The best indicator of clockspeed/pipeline length is Xscale, which manages 1 Ghz @ 7-stage pipes. And no one has a better fab cabability than Intel does.

Paul · Nov 8, 2003

Why are you banking on 10 watts? Don't be surprised when you see PS3 Cell at around 30+.

(CELL will sustain somewhere around 15% of its peak FLOPS rating whereas POWER-series sustains very high in real world code)

I'm sorry? How is it that you know the specifics of what Cell will be able to sustain? Do you have some exclusive insider information not yet released on The Inquirer?

Did you forget about Embedded DRAM? Or do you think that Cellular engineers are going to have a chip with some insane floating ops and forget about the e-DRAM? Which is crucial for pushing data around and maintaining some type of decent numbers in a system such as a PS3.

DeadmeatGA · Nov 8, 2003

...

I'm sorry? How is it that you know the specifics of what Cell will be able to sustain?

10~15% of theoretical peak is what message passing supercomputers sustain.

Did you forget about Embedded DRAM?

I am not expecting any eDRAM.

randycat99 · Nov 8, 2003

Re: ...

DeadmeatGA said:
CELL will sustain somewhere around 15% of its peak FLOPS rating whereas POWER-series sustains very high in real world code)

Who even knows how you came up with your 15% figure.

Given the size and bandwidth of local storage and bandwidth to main memory in the Cell compared to the Power-series, real world numbers could end up quite comparable, if not, largely to favor of the Cell.

Paul · Nov 8, 2003

10~15% of theoretical peak is what message passing supercomputers sustain.

Then again efficiency doesn't matter when it comes to super computing. It's not an issue of concern. 10-15% will be unacceptable coming down to video game console, and I'm sure STI knows this.

I am not expecting any eDRAM.

SCEI does.

Tahir2 · Nov 8, 2003

The AthlonXP reliably and by quite a margin beats the 'magical' 2GHz often. This is the Thoroughbred core and even the topend Bartons beat 2GHz.

Intel with its 20 stager is not the only guy who beats 2GHz. 3GHz is a different story.

Yet come on Deadmeat this anaology is useless as AthlonXP is on a .13nm die and doesnt even use SOI. My AthlonXP 2500 is running at 2.1GHz reliably. It goes up to 2.4GHz (not so reliably).

You want to put another obstacle in the way of Cell so now it is clockspeed. This is IMHO the LEAST worry Sony has. You can be pretty irrational sometimes - did Cell eat your hamster (too)???

There is no need for you to put obstacles in the way of Cell - you dont matter to Cell - you have no bearing whatsoever on what Cell will actually be - Cell will continue to exist even if you do not exist, Cell is not reliant on you making it feasible or unfeasible with your posts, Cell will not cure rabies and dementia at the same time, it is not the next holy grail, nor is it the answer to world peace. It's just a bit of fun which will make some people seriously rich, others cry, others poor and others seriously happy.

There will be lots of people in the world that have never heard of Cell, those that will never hear of Cell, those that thing a Cell is a miracle, and those that belong in a Cell.

However your posts are fun if a little tiring - I think in the long run and for the benefit of mankind as a whole you are the latter - you belong in a Cell. Nice padded too, with lots of cache.

Panajev2001a · Nov 8, 2003

Deadmeat, going by what you said, seeing a 2003 Athlon 64 FX with pseudo-standard cooling solution ( go to Ace's Hardware ) running at 2.8 GHz makes me smile.

Sure, you are partially right saying that clock-speed is a design feature as when you design the processor's circuitry you have to have a target manufacturing process in mind.

PlayStation 3 CELL CPU targeted for 90 nm would push for a clock-speed definately lower than if the CPU designers had 65 nm SOI and a target of 45-50 Watts in mind.

I think we can push it at around 2 GHz if we allow power consumption to rise from your 10-15 Watts to 45-50 Watts.

randycat99 · Nov 8, 2003

Re: ...

DeadmeatGA said:
It's a limit using current die process. However, we are talking about something to be produced on the .06 process, not .13.

Click to expand...

Clockspeed is a design feature, not a process feature.

The ultimate clockspeed you get relies on both, however.

P4 WAS designed to clock high.

...and it did, relative to the PIII.

You just got done saying it is only a design feature. If the design hasn't changed, then you would still have a 1.6 GHz P4 today. Fortunately, we are all aware that process has changed over the lifespan of P4, hence we have observed a scaling from 1.6 to 3.0+. It works like that for all processors, in general (but not necessarily the same degree of scaling, of course).

Panajev2001a · Nov 8, 2003

Re: ...

randycat99 said:
DeadmeatGA said:

It's a limit using current die process. However, we are talking about something to be produced on the .06 process, not .13.

Click to expand...

Clockspeed is a design feature, not a process feature.

Click to expand...

The ultimate clockspeed you get relies on both, however.

P4 WAS designed to clock high.

Click to expand...

...and it did, relative to the PIII.

You just got done saying it is only a design feature. If the design hasn't changed, then you would still have a 1.6 GHz P4 today. Fortunately, we are all aware that process has changed over the lifespan of P4, hence we have observed a scaling from 1.6 to 3.0+. It works like that for all processors, in general (but not necessarily the same degree of scaling, of course).

He has some good ideas in there though as the drive stages in Pentium 4's pipeline were thought after such a clock scalability, but he cannot eliminate the influence on manufacturing processes in clock scalability.

Still current 3.2 GHz Pentium 4 processors already run significant portions of the CPU at 6.4 GHz.

Saem · Nov 8, 2003

The thing to note is that clock speed is base on design. Other physical factors can stop the design from reaching it's maximum operational frequency, this however doesn't mean that the design isn't the actual limiting factor ultimately. Transistor switching speed is very high, having them switching syncronously is the bitch that design must smack.

At the really low micron ligthography processes (such as the one Cell is expected to utilise) you'll have wire delays likely being the bigger issue over stages or at least that's my take on the matter. Additionally, the current necessary to get over the temprature induced high resistance for propagating a signal -- there is a technical name for this and I read an article on it a while back on ee times, I can't remember ATM.

That said, I think Deadmeat your clock speed assessments need some revision.

nondescript · Nov 8, 2003

Clockspeed is a design feature, not a process feature.

Half-right, I think. Clockspeed is constrained by the switching and hold times of the transistor (a physical property of the transistor that is determined on processing techniques. I just had a midterm on this...), and it is constrained by the actual logic in the IC, to make sure that the output of each stage stabilizes before the next clock cycle (design feature). It's half-and-half. Put in other words, a better process will yield transistors with superior performance, which allows each stage to complete faster, which allows a higher clock. Smaller stages will also allow the stages to complete more quickly, allowing a higher clock. Higher clockspeed is attained both through process and design.

Cost considerations and law of physics does not allow such device. If SCEI did manage to build a 1 teraflop device that burns 10 watts, then it would put workstation precessor venders out of business since its performance/price ratio is like 100x better than any of existing architectures.

You gotta stop putting words in other people's mouths. Not "everyone" was "jumping and screaming '4 GHz' ", and I certainly didn't say anything about a "1 teraflop device that burns 10 watts." I said that if you think 250 GFlops is possible, why do you think 1 Tflops is impossible.

Cost considerations? Ok, I can buy that. Laws of Physics? Ridiculous. We're still at least 10 orders of magnitude from the entropy heat limit set by Boltzmann's law. UIUC has demonstrated a 509Ghz transistor. Physics is not the constraint here - at least, not yet.

Here is now my new position on CELL.

It might indeed pack 4 PEs per die, but its clock limit will be around 1 Ghz. This is the maximum you can reliably go with a non-superpipelined design.

It makes me wonder how CELL will compare to Xbox2 CPU, which appears to be going with a Power5+ style dual-core device. Oviously CELL blows Xbox2 CPU away in peak FLOPS(256 GFLOPS Vs around 32~40 GFLOPS), but the sustained FLOPS figure is about the same.(CELL will sustain somewhere around 15% of its peak FLOPS rating whereas POWER-series sustains very high in real world code)

Alright, that sounds reasonable - I'll meet you halfway. For all I know, you could be right - CELL at 1 GHz, 250 GFlops, no eDRAM.

I disagree that 1 GHz is the maximum you can go without a deep pipeline (given that even lowly DSP chips such as TI's C6400 series can hit 1GHz with virtually no pipelining). But these are comparatively small issues. If you're willing to accept the basic CELL architecture as a viable one, that's good enough for me.

As for peak-to-sustained Flops ratio, here's a quick guesstimate to give us ballpark numbers. The logic is kinda flaky, but again, just as a quick-and-dirty comparison.

The G5 supercomputer they're building for Virginia Tech has a theoretical peak performance of 17.6 Tflops, and they hit 9.56 Tflops, or about 58%. That's impressive for off-the-shelf chips that were never designed to be clustered for this kind of supercomputing.

On the other hand, the Earth Simulator, whose chips are designed from the ground up to be supercomputer processing chips, hits 86% of peak performance. That's 14% wasted vs 42% wasted.

I know this is a weak comparison, but, by analogy, I think CELL, which was designed from the start to be massively scalable, probably would have a better peak-to-sustained ratio than the Power5, which has an inherently PC-centric lineage.

CELL configuration revisited....

Similar threads