Clearspeed announces CELL-like processor

Panajev2001a · Oct 16, 2003

Re: ...

DeadmeatGA said:
Well, I got the die size data of CS301.

41 million transistors take up 72 square mm using an IBM 0.13 silicon-on-insulator process

Click to expand...

For CELL fans, don't cheer yet because CS301 is a SIMD processor; only one instruction decoder, one control unit and one instruction cache shared among 64 FPUs, and is not heavily pipelined to support higher clock. EE3 on the other hand is a 18-way MIMD(2 PPC cores + 16 active VUs + 2 spare VUs) plus 2 MB of SRAM cache, so it will no doubt be massive in dize size and give a poor yield, plus a programming model that makes CS301 programming look like a kiddy stuff.

Sorry, Kutaragi's dream of 1 TFLOPS per chip still has to wait until 2007 or 2008 and still cost a bucketload of cash to afford....

Even keeping their current implementation, but going for a 280 mm^2 die size and a 65 nm manufacturing process ( which would produce a total die shirnk to about 1/4th of the toal chip size in 180 nm: the EE and GS in 180 nm took ~4.35x more space than the EE and GS combined using 90 nm technology, so I am not going for the best case scenario ) they should be able to fit 15-16 of those chips in a single die.

This would mean ~400 GFLOPS at 200 MHz.

Bump the clock-speed of the chip to 500 MHz and you get 1 TFLOPS.

You use F.U.D.-dy math, but that allows me to do the same

You do not think 65 nm can make the processor around 4x smaller ?

Fine, let's assume it makes it only 2x smaller.

We could fit only 8 of those CPUs ina single 280 mm^2 die obtaining 200 GFLOPS at 200 MHz.

Clock the puppy at 1 GHz and again you obtain 1 TFLOPS

1 GHz = 5 * 200 MHz

5 * 200 GFLOPS = 1 TFLOPS

Brimstone, the IRAM idea also reminds me a bit of Mitsubishi's 3DRAM, but doesn't reminds you of CELL as well ? ( no pun intended with the rhyme ).

Embedding DRAM and logic ( CPU + embedded DRAM ) is something that has been present in IBM's R&D labs for quite a while ( Prof. Nair papers are a good indication of this ).

nondescript · Oct 16, 2003

Re: ...

DeadmeatGA said:
Well, I got the die size data of CS301.

41 million transistors take up 72 square mm using an IBM 0.13 silicon-on-insulator process

Click to expand...

Thanks for the die specs. (Link please?) I personally think 25Gflops performance w/ 41 million transistors is a good sign that CELL doesn't need to be huge. You obviously disagree, I'm not trying to convince you.

DeadmeatGA said:
For CELL fans, don't cheer yet because CS301 is a SIMD processor; only one instruction decoder, one control unit and one instruction cache shared among 64 FPUs, and is not heavily pipelined to support higher clock.

Bingo. Didn't think about pipelining when I was doing those napkin calculations. But I think the impact will be small. Part of the reason the P4 so huge is because it has like 22 stages (I remember hearing that somewhere, might be wrong). But the P4 isn't RISC, so it needs elaborate decoding, prefetch, and multi-thread structures, in an effort to hide the memory latency and bandwidth constrants. CELL, with memory and logic sitting so close together, perhaps even on the same clock, won't need nearly as many stages. IMHO.

But you're right about the Clearspeed CPU not being about to directly clock up to 2Ghz.

Saem · Oct 17, 2003

Actually I believe that the P4 has 27 stages -- 7 of them aren't in the critical path.

DeadmeatGA · Oct 17, 2003

...

nondescript

Let me illustrate my point.

A Radical architecture : CS301

- 42 million transistors
- 72 mm2
- 200 Mhz
- 25 GFLOPS peak
- 0.13 micron IBM SOI process

A traditional architecture : PPC970

- 52 million transistors
- 121 mm2
- 2 Ghz
- 16 GFLOPS peak
- 0.13 micron IBM process

Yes, CS301 appears a bit more efficient on paper, but it does not blow away its traditional architectured competitor fabbed on same process and certainly does not justify the high software rewriting cost.

Some Sony fans here have a false illusion that IBM's SOI process somehow magically allows SCEI to pack in half a billion transistors on a chip and have any design to clock at 4 Ghz while burning little power. CS301 is your proof that things doesn't happen like that.

High clock speed is a design feature and not a process feature. For EE3 to clock at 4 Ghz, SCEI would have to superpipeline the PPC core and VU to 20+ stages at the cost of massive transistor count bloat.

randycat99 · Oct 17, 2003

Re: ...

DeadmeatGA said:
For EE3 to clock at 4 Ghz, SCEI would have to superpipeline the PPC core and VU to 20+ stages at the cost of massive transistor count bloat.

...using the same .13 micron process, perhaps. What about at .06? It seems they are on-track if the PPC970 is to pull 2 GHz. Does your statement imply that the P4 will be stuck forever at 2.8 GHz (or whatever clock it is at this month), regardless of die process? ...or should it have been stuck at 1.6 GHz, for that matter?...

Tahir2 · Oct 17, 2003

To me 1TFLOP seems too easy to acheive now. I think Sony should revise its estimate

About Clearspeed what are its cooling requirements in any case? Of course we need to know how many watts it is outputting first.

The only thing in Sony's and MS and Nintendo's way is heat dissipation rather than outright speed. Curing the heat problem is more expensive than making things 'go faster' at this time.

randycat99 · Oct 17, 2003

Tahir said:
To me 1TFLOP seems too easy to acheive now. I think Sony should revise its estimate

Shhhh, damn you! The competitors need only worry about 1 TFLOP. Yeah, that's the ticket.

nondescript · Oct 17, 2003

Re: ...

DeadmeatGA said:
nondescript

Let me illustrate my point.

A Radical architecture : CS301

- 42 million transistors
- 72 mm2
- 200 Mhz
- 25 GFLOPS peak
- 0.13 micron IBM SOI process

A traditional architecture : PPC970

- 52 million transistors
- 121 mm2
- 2 Ghz
- 16 GFLOPS peak
- 0.13 micron IBM process

Yes, CS301 appears a bit more efficient on paper, but it does not blow away its traditional architectured competitor fabbed on same process and certainly does not justify the high software rewriting cost.

Some Sony fans here have a false illusion that IBM's SOI process somehow magically allows SCEI to pack in half a billion transistors on a chip and have any design to clock at 4 Ghz while burning little power. CS301 is your proof that things doesn't happen like that.

High clock speed is a design feature and not a process feature. For EE3 to clock at 4 Ghz, SCEI would have to superpipeline the PPC core and VU to 20+ stages at the cost of massive transistor count bloat.

Hey DMGA, I'll respond in full to your post tomorrow - kinda busy today - but yeah, I understand where you're coming from.

I think the key issue here is what do you think the reason for the massive pipelining is - I believe that it is to compensate for memory latency/bandwidth constraints, and it becomes more critical as clock rates increase, not because of the clock rate itself, but because of the CPU Hz/memory Hz ratio - the CPU has to wait more cycles for each read from memory, making prefetch and other such things more critical. But if CELL's memory is sufficently fast, than there is no need for such elaborate pipelining. Which, again, is why eDRAM is so important.

That's it for today.

nonamer · Oct 17, 2003

Re: ...

randycat99 said:
DeadmeatGA said:

For EE3 to clock at 4 Ghz, SCEI would have to superpipeline the PPC core and VU to 20+ stages at the cost of massive transistor count bloat.

Click to expand...

...using the same .13 micron process, perhaps. What about at .06? It seems they are on-track if the PPC970 is to pull 2 GHz. Does your statement imply that the P4 will be stuck forever at 2.8 GHz (or whatever clock it is at this month), regardless of die process? ...or should it have been stuck at 1.6 GHz, for that matter?...

These things usually go about 30-50% faster on every successive process. However, this is assuming no change in architect and the process works just as well as the previous, which won't be the case as we move to 90nm and 65nm. So a PPC970 may go 4Ghz by 65nm, but then again it may not. I'd strongly doubt it though due to heat & power issues and cost.

Panajev2001a · Oct 17, 2003

Some Sony fans here have a false illusion that IBM's SOI process somehow magically allows SCEI to pack in half a billion transistors on a chip and have any design to clock at 4 Ghz while burning little power. CS301 is your proof that things doesn't happen like that.

No...

Panajev2001a said:
I see them opting for this kind of configuration: 1 GHz for busses ( or 500 MHz DDR signaling ), 1 GHz for the e-DRAM ( or 500 MHz DDR signaling ), 4 GHz for the Register file, 4 GHz for the FP/FX Units, 2-4 GHz for the SRAM based LS ( 2 GHz would mean that there would be a bit more of latency involved with LS reads and writes from the Register file: 1 LS clock would be 2 FP/FX Units clocks so there would be an extra latency cycle for LOAD/STORE instructions which touch the LS, which is acceptable ) and 1-2 GHz for the PUs.

This way you see that we do not need the whole chip running at 4 GHz and this has a direct impact on the chips' overall power consumption.

Having multiple clock domains on the chip is not impossible ( Intel has been doing it for years and they are not the only ones ) and is one of the tricks I can see the STI guys pulling off ( plus tons of other tricks I am not even thinking about at the moment )

What you showed me with your analysis of this new processor is that a 65 nm implementation ( 65 nm is supposed to bring 500+ Million transistors juding from how far can companies like ATI push 150 nm: 100+ MTransistors with 150 nm technology and with 65 nm technology and a ~280 mm^2 CPU [more or less] you cannot pack 500+ MTransistors ? ) modified a bit for better pipelining and clocked to 0.5-1 GHz could yeld 1 TFLOPS ( dependign if you think that in 65 nm they could fit 8 or 16 cores ).

MfA · Oct 17, 2003

Heat is not a cureable problem, SOI helps a little and it controls the spectre of leakage current ... but if you want max performance out of your mm2 of die you are going to burn a lot of energy, much more than a desktop processor of the same size and at the same clock which "wastes" most of it silicon. This gets worse with every generation in fact.

Heat generation is a problem which needs to be tempered, but it cannot be solved ... the solutions have to be sought in better heat-removal&dispersion.

Zalman's fully passively cooled case for high performance processor is a nice example of just what is possible with heatpipes these days ...

Panajev2001a · Oct 17, 2003

Another thing, nonamer: the PUs in CELL will be nowhere near the complexity of a highly parallel ( 5 instructions per clock ) OOOe machine like the PPC970 is.

They will probably be fully pipelined scalar single or dual issue RISC processors with very tight ISA and they will be probably be clocked at 1-2 GHz.

Deadmeat's frequent comparisons with clock scaling of the PPC970 are fallacious and motivated towards spreading more and more F.U.D.

Panajev2001a · Oct 17, 2003

MfA said:
Heat is not a cureable problem, SOI helps a little and it controls the spectre of leakage current ... but if you want max performance out of your mm2 of die you are going to burn a lot of energy, much more than a desktop processor of the same size and at the same clock which "wastes" most of it silicon. This gets worse with every generation in fact.

Heat generation is a problem which needs to be tempered, but it cannot be solved ... the solutions have to be sought in better heat-removal&dispersion.

Zalman's fully passively cooled case for high performance processor is a nice example of just what is possible with heatpipes these days ...

I think PlayStation 3 designers are currently hard at work on cooling mechanisms for the machine's main processors.

Brimstone · Oct 17, 2003

Re: ...

Panajev2001a said:
Brimstone, the IRAM idea also reminds me a bit of Mitsubishi's 3DRAM, but doesn't reminds you of CELL as well ? ( no pun intended with the rhyme ).

Embedding DRAM and logic ( CPU + embedded DRAM ) is something that has been present in IBM's R&D labs for quite a while ( Prof. Nair papers are a good indication of this ).

Click to expand...

Yes. On page 17 of this PDF, Mitsubishi 3DRAM is classified as being related.

http://iram.cs.berkeley.edu/papers/IRAM.micro.pdf

Why Micron refers to the dram on its Yukon processor as active memory instead of emedded ram I'm not sure.

nondescript · Oct 18, 2003

Re: ...

nondescript said:
DeadmeatGA said:

nondescript

Let me illustrate my point.

A Radical architecture : CS301

- 42 million transistors
- 72 mm2
- 200 Mhz
- 25 GFLOPS peak
- 0.13 micron IBM SOI process

A traditional architecture : PPC970

- 52 million transistors
- 121 mm2
- 2 Ghz
- 16 GFLOPS peak
- 0.13 micron IBM process

[...]

High clock speed is a design feature and not a process feature. For EE3 to clock at 4 Ghz, SCEI would have to superpipeline the PPC core and VU to 20+ stages at the cost of massive transistor count bloat.

Click to expand...

Hey DMGA, I'll respond in full to your post tomorrow - kinda busy today - but yeah, I understand where you're coming from.

Here we go:
Let's look at another architecture - the P4 Northwood core - something the Xbox CPU is much more likely to look like, and much more similar to a traditional PC architecture.

Pentium 4 Northwood 3.06GHz
.13-micron
146mm square
55 million transistors
6.12 Gflops peak

Instead of describing it myself, here's a link to Hannibal's excellent and accessible description of both the G4e (PPC) and P4.
http://arstechnica.com/cpu/01q2/p4andg4e/p4andg4e-2.html
Key points (I cut out some stuff to keep the length reasonable):

The G4e breaks the classic, four-stage pipeline into seven stages in order to allow it to run at increased clock speeds on the same manufacturing process. Less work is done in each of the G4e's shorter stages but each stage takes less time to complete. Since each stage always lasts exactly one clock cycle, shorter pipeline stages mean shorter clock cycles and higher clock frequencies. The P4, with a whopping 20 stages in its basic pipeline, takes this tactic to the extreme.

As we'll see, the Pentium 4 makes quite a few sacrifices for clock speed, and although Intel tries to spin it differently, an extraordinarily deep pipeline is one of those sacrifices.

The drastic difference in pipeline depth between the G4e and the P4 actually reflects some very important differences in the design philosophies and goals of the two processors. The G4e's approach can be summarized as "wide and shallow." Its designers added more functional units to its back end for executing instructions, and its front end tries to fill up all these units by issuing instructions to each functional unit in parallel. In order to extract the maximum amount of instruction-level parallelism (ILP) from the (linear) instruction stream the G4e's front end first moves a small batch of instructions onto the chip. Then, its out-of-order (OOO) execution logic examines them for interdependencies, spreads them out to execute in parallel, and then pushes them through the execution engine's nine functional units. Each of the G4e's functional units has a fairly short pipeline, so the instructions take very few cycles to move through and finish executing.

At any given moment the G4e can have up to 16 instructions spread throughout the chip in various stages of execution simultaneously. The end result is that the G4e focuses on getting a small number of instructions onto the chip at once, spreading them out widely to execute in parallel, and then getting them off the chip in as few cycles as possible.

The P4 takes a "narrow and deep" approach to moving through the instruction stream. It has fewer functional units, but each of these units has a deeper, faster pipeline. The fact that each functional unit has a very deep pipeline means that each unit has a large number of available execution slots and can thus work on quite a few instructions at once.

It's important to note that in order to keep the P4's fast, deeply pipelined functional units full, the machine's front end needs deep buffers that can hold and schedule an enormous number of instructions. The P4 can have up to 126 instructions in various stages of execution simultaneously. This way, the processor can have many more instructions on-chip for the out-of-order execution logic to examine for dependencies and then rearrange to be rapidly fired to the execution units.

As we can see, in either case, the bulk of the pipeline is not the execution itself, but the stages before and after - in both cases, traditional processors devote significant transistor budgets to speculative processing, examining interdependencies, and other structures to support parallel processing of essentially serial code. In Hannibal's words, "to extract the maximum amount of instruction-level parallelism (ILP) from the (linear) instruction stream."

In the P4, five stages are used to fetch instructions from the L1 cache and the Branch Processing Unit, including one stage that does nothing but wait for signals to arrive. Seven stages are used to process and direct out-of-order execution. Two more stages to actually send the signal to the processing units. Two more to get into the registers of the execution units. One stage to execute. Three more stages to verify and reorder the results. 20 stages total.

CELL will need some pipelining, since the cores will be running much faster (2x-8x) than the memory, but it will be nothing like the P4 - the jobs are inherently different. Graphics and physics processing are highly and easily parallel. Just look at the design of a modern GPU. My guess is the PPC cores will handle the inherently linear code, using the process described above, while the bulk of the graphics and physics processing will be done by the highly parallel architecture similar to the Clearspeed chip - that, as I hope I've shown, does not need such elaborate pipelining.

Last bit:

I agree that 1Tflops will be very hard to achieve. 1Tflops is a number that shouldn't be tossed around lightly, and I'm not about to stick my neck out and say CELL will be 1Tflops, or that the PS3 will be 1Tflops. But I think it is possible, and I don't think it is an overestimation to say that STI is capable of a 1Tflops processor.

Panajev2001a · Oct 18, 2003

According to IBM patents APUs can do the following:

1 FP/FX Vector Instruction/cycle: 8 ops/cycle is the peak figure ( Vector MADD ).

1 FP/FX Scalar Instruction/cycle: 2 ops/cycle is the peak figure ( MADD ).

The APU can process Vector or Serial Instructions.

including one stage that does nothing but wait for signals to arrive.

There are two stages like that, they are called DRIVE stages.

Gubbi · Oct 19, 2003

Most processor will have DRIVE stages at the 0.065um node, simply because wire delay will be more and more dominant.

Cheers
Gubbi

Panajev2001a · Oct 19, 2003

Gubbi said:
Most processor will have DRIVE stages at the 0.065um node, simply because wire delay will be more and more dominant.

Cheers
Gubbi

Nothing wrong with that

Josiah · Oct 19, 2003

Whatever issues PS3 may have, I doubt a lack of FP power will be one of them. It could be a fraction of a teraflop and still have enormous power (imagine 500 gigaflops, like that's NOT fast??). Some more significant issues:

- good compilers/developer tools (although unfriendly hardware hasn't hurt PS2 at all)
- image quality (good AA is needed, especially at HDTV resolution)
- vertex lighting, or realistic lighting? (cg quality?)
- profficiency at running shaders
- ability to texture (displacement maps, procedural noise)
- better animation/physics/collision (a software issue, the resources for it are there)

PS3 can't just offer PS2 games with 1000x more polygons, an overall increase in realism is needed.

Paul · Oct 19, 2003

Your right Josiah, It can't be a PS2 where the main difference was just pure polygon power.

However, I am sure that Sony has learned their mistakes(PSP is showing this) and PS3 will be chock full of features and not just the ability to push a billion polygons around.

On the Shader(effects) side of things Sony's GPU will have it's own APU's, meaning that developers can just program their own effects in(think per pixel lighting) without them having to be stuck with what's locked in hardware.

Kutaragi has also said that there will be developers libraries, and a API for ps3; I am certain that there was no graphics API for ps2.

Clearspeed announces CELL-like processor

Panajev2001a

nondescript

Saem

DeadmeatGA

randycat99

Tahir2

randycat99

nondescript

nonamer

Panajev2001a

MfA

Panajev2001a

Panajev2001a

Brimstone

B3D Shockwave Rider

nondescript

Panajev2001a

Gubbi

Panajev2001a

Josiah

Paul

Similar threads