PDA

View Full Version : Does the Cell processor have a chance?


stof
25-Nov-2007, 04:00
I occaisionally hear hype about the Cell processor, but I wonder if it really has a chance. It seems to have fatal flaws.

The upside of the Cell is that it has 200 GFLOPS peak performance per chip. This performance number comes from each SPU running at 3.2 GHz, able to perform 4 multiplies & 4 adds simultaneously, which is 25 GFLOPS per SPU, times the 8 SPUs on the chip.

I wonder if you can really get to 50% of peak performance.

A single modern Intel core running at 3 GHz can do 4 multiplies or adds at the same time, which is 12 GFLOPS. It's not that hard to get to peak performance of an x86-64 CPU (http://forums.guru3d.com/showthread.php?p=2509851#). That's 100 GFLOPs too.
And you can put 8 of them in a cheap box.

A major problem with the Cell is that it uses expensive XDR memory and you can only put 2 Gbytes on a node. That is very limiting. A Cell blade is very expensive, ~$10,000. And, Cell isn't improving as fast as Intel/AMD is.

So, the Cell doesn't look that great with price/performance, it has limited memory, it has little software infrastructure, and uncertainty with its future.

Does it have a chance?

Vitaly Vidmirov
27-Nov-2007, 21:41
I wonder if you can really get to 50% of peak performance.
It is possible to get 99% of peak performance on certain tasks like matrix multiply.

and uncertainty with its future.
So what did you expect? CELL in place of x86?
x86 is not the most popular processor in the world, anyway.

3dilettante
27-Nov-2007, 22:11
[FONT=Verdana][SIZE=2]I occaisionally hear hype about the Cell processor, but I wonder if it really has a chance. It seems to have fatal flaws.

The upside of the Cell is that it has 200 GFLOPS peak performance per chip. This performance number comes from each SPU running at 3.2 GHz, able to perform 4 multiplies & 4 adds simultaneously, which is 25 GFLOPS per SPU, times the 8 SPUs on the chip.

I wonder if you can really get to 50% of peak performance.

This is highly dependent on workload, and for many problem types and system sizes, 50% utilization would be something any architecture would kill for.


A single modern Intel core running at 3 GHz can do 4 multiplies or adds at the same time, which is 12 GFLOPS. It's not that hard to get to peak performance of an x86-64
That's 100 GFLOPs too.
And you can put 8 of them in a cheap box.

It's also not too hard to make x86 run below peak. Even on Linpack, the going rate is something like 80% peak, and Linpack is a standard benchmark everyone targets.
Since memory latency and bandwidth has become so important, the greater control Cell has for memory access is in many areas far superior to the current broadcast coherency schemes of x86 chips.

It should also be noted that the x86 system that can hit 100 GFLOPS does so with two chips with TDPs of 120W.
That's several times the TDP of one Cell.
Power concerns are going to be dominant from now on, as it now constrains clock speeds, system footprint, and operating costs for a system.


A major problem with the Cell is that it uses expensive XDR memory and you can only put 2 Gbytes on a node. That is very limiting. A Cell blade is very expensive, ~$10,000. And, Cell isn't improving as fast as Intel/AMD is.

A valid point, which is why the HPC variant of Cell uses DDR2.
I'll cover the volume and price considerations at the end of this.


So, the Cell doesn't look that great with price/performance, it has limited memory, it has little software infrastructure, and uncertainty with its future.

Does it have a chance?
Does it have a chance in what field?

The desktop? Basically none.
Future game consoles? Maybe one of them.
HPC? Probably the best chance it has for creating a niche, much in the way Blue Gene's processors have their own small space.
Other fields? Maybe something here or there, but the support isn't all that enthusiastic.

The primary reasons for doubt is that Cell so far has not realized the volume that commodity x86 has attained.
Given market trends and costs, this may prove telling.
The more likely outcome is that future x86 chips are going to copy most of what makes Cell perform so well, leaving Cell with little to offer.

pjbliverpool
27-Nov-2007, 23:45
It should also be noted that the x86 system that can hit 100 GFLOPS does so with two chips with TDPs of 120W.

I thought Core2 could perform 4 dual precision but 8 single precision operations per cycle (per core that is)?

3dilettante
27-Nov-2007, 23:55
I was going by the DP throughput of a two-socket Yorkfield system which is roughly 100 GFLOPS, while the HPC Cell with enhanced DP throughput also tops out at ~100 DP GFLOPS.

edit:
Cell would also have double the SP throughput over DP for the HPC version.

pjbliverpool
27-Nov-2007, 23:59
I was going by the DP throughput of a two-socket Yorkfield system which is roughly 100 GFLOPS, while the HPC Cell with enhanced DP throughput also tops out at ~100 DP GFLOPS.

Ah cool. Just wanted to make sure I wasn't mistaken. So the HPC Cell pretty much doubles Yorkfields peak throughput in either SP or DP.

I wonder if we'll see a new, beefier Cell before Nehalem arrives. I expect so but it would be strange to see a single socket x86 matching or exceeding Cell in peak floating point.

stof
28-Nov-2007, 00:10
Where can I find more information on Cell boards with DDR2? The IBM web site doesn't have any.

The Core2 can do only 4 single precision operations per cycle and 2 double precision. It can't do simultaneous multiply & add, like the SPE on the Cell. But, it's hard to keep simultaneous multiply & adds busy.

Carl B
28-Nov-2007, 01:15
Where can I find more information on Cell boards with DDR2? The IBM web site doesn't have any.

What are you going to buy some?

Anyway this is the thread that would probably be your best introduction to the DDR2/HPC Cell: http://forum.beyond3d.com/showthread.php?t=40661

I'll mention also that it's this version of Cell that's going to go into Roadrunner. It's not available to the 'general' public right now, but as time goes on I'm sure you'll see it pop up. As to the original point of the thread, frankly I think Cell has done very well for itself considering it's a new architecture.

pjbliverpool
28-Nov-2007, 17:22
Where can I find more information on Cell boards with DDR2? The IBM web site doesn't have any.

The Core2 can do only 4 single precision operations per cycle and 2 double precision. It can't do simultaneous multiply & add, like the SPE on the Cell. But, it's hard to keep simultaneous multiply & adds busy.

According to this Core2 is capable of 8 SP operations per cycle:

http://www.behardware.com/articles/623-5/intel-core-2-duo-test.html

"Core uses two floating point calculation units, one dedicated to addition and the other to multiplication and division. Theoretical calculation capacity is 2 x87 instructions per cycle and 2 SSE 128 bit floating point instructions per cycle (that is 8 operations on 32 bit simple precision floating points, or 4 operations for double precision 64 bit floating points). Core is, in theory, two times faster for this type of instruction than Mobile, Netburst and K8."

That would result in a theoretical peak of 96 GFLOPs for the fastest single socket CPU.

stof
28-Nov-2007, 21:50
I am a HPC software developer. My software is used on about $100 million of hardware. It's pretty important for new hardware to recruit HPC software developers.

I need to be careful about what I invest my time in. With the high cost of Cell boards, the limited memory, and the limited install base, I don't have confidence Cell will become mainstream for commercial HPC ($500K-$10 million clusters). I agree with the above comments that The more likely outcome is that future x86 chips are going to copy most of what makes Cell perform so well, leaving Cell with little to offer. The x86 chips will probably do it at lower price and better software infrastructure.

And while I don't want to diverge this discussion on Intel hardware, the above Intel information is misleading. Yes, the Intel chips can work on a SIMD multiply and add at the same time, but they take more than a clock cycle. You can submit a SSE multiply but it takes 5 cycles to complete. 1 clock cycle after the submit, you can submit another SSE instruction, such as an SSE add, and they will work at the same time, but you won't get 8 flop throughput per cycle. You can only submit one SSE instruction at a time.

patsu
28-Nov-2007, 22:45
stof, what kind of HPC software ? Is it media related ? or scientific computing ?

3dilettante
28-Nov-2007, 23:22
And while I don't want to diverge this discussion on Intel hardware, the above Intel information is misleading. Yes, the Intel chips can work on a SIMD multiply and add at the same time, but they take more than a clock cycle. You can submit a SSE multiply but it takes 5 cycles to complete. 1 clock cycle after the submit, you can submit another SSE instruction, such as an SSE add, and they will work at the same time, but you won't get 8 flop throughput per cycle. You can only submit one SSE instruction at a time.

I read that the FP mulitplier has a throughput of 1 per cycle and a latency of 4. Only 80-bit FP multiply has a throughput of less than 1 per cycle.

Core2 also has SSE units on 3 issue ports, 1 port for FADD, 1 port for FMUL, and 1 port for other ops.

The peak number would seem to hold unless you can't find any non-dependent multiplies.

Nite_Hawk
29-Nov-2007, 16:26
I am a HPC software developer. My software is used on about $100 million of hardware. It's pretty important for new hardware to recruit HPC software developers.

I need to be careful about what I invest my time in. With the high cost of Cell boards, the limited memory, and the limited install base, I don't have confidence Cell will become mainstream for commercial HPC ($500K-$10 million clusters). I agree with the above comments that The x86 chips will probably do it at lower price and better software infrastructure.

And while I don't want to diverge this discussion on Intel hardware, the above Intel information is misleading. Yes, the Intel chips can work on a SIMD multiply and add at the same time, but they take more than a clock cycle. You can submit a SSE multiply but it takes 5 cycles to complete. 1 clock cycle after the submit, you can submit another SSE instruction, such as an SSE add, and they will work at the same time, but you won't get 8 flop throughput per cycle. You can only submit one SSE instruction at a time.

Hi Stof,

I'm a developer at the Minnesota Supercomputing Institute. Similar feelings about Cell. I really wish they would make development hardware cheaper to attract more attention. $10k isn't that much in the grand scheme of things, but it's not exactly throw away money either. Cell is a popular topic around here (MSI) mostly because it's neat and exotic. There are few people here that are actually doing any real work on them.

Nite_Hawk

Shifty Geezer
29-Nov-2007, 19:20
For the sake of experimentation, isn't PS3 a suitable introduction to try things out and gauge performance? IBM's libraries support distributed processing over networked PS3's, right? So you could get 2 or 3 and try out some algorithms and see how well you think it manages for a grand or so. Less if you know a few PS3 owning mates who wouldn't mind lending you their PS3's to run a bit of Linux code on!

Mmmkay
29-Nov-2007, 20:22
There's little interest at RAL, given its commodity focused HPC efforts. DP performance of the PS3 is just not worth it, and the eventual HPC Cell products will be out of reach. And that's forgoing the problems how RAL operates in terms of library and application support. In fact it's probably the latter which has more influence. Neat and exotic just isn't in the language.

Arwin
29-Nov-2007, 21:23
Hi Stof,

I'm a developer at the Minnesota Supercomputing Institute. Similar feelings about Cell. I really wish they would make development hardware cheaper to attract more attention. $10k isn't that much in the grand scheme of things, but it's not exactly throw away money either. Cell is a popular topic around here (MSI) mostly because it's neat and exotic. There are few people here that are actually doing any real work on them.

Nite_Hawk

As Shifty said, precisely what is making the Cell a popular chip in this area is the possibility to just buy that 399 PS3, install Linux on it and get going with the SDKs and excellent documentation. And you can even see examples out there already from people stacking several PS3s too.

seebs
30-Nov-2007, 02:15
It's a good testbed, I think. I did a bunch of stuff on cell simulators early on, and the PS3's faster, even if it's not quite the same.

I was going to get one of the actual dev systems, but I never got so much as a call back when I tried to contact the nice folks at Mercury. Apparently, they're WAY too busy with important things to even bother to tell me that they don't want my business. :p

Arwin
30-Nov-2007, 09:25
Apparently, they're WAY too busy with important things to even bother to tell me that they don't want my business. :p

That's a shame. On the other hand, I guess that also partly answers the thread title. :D

seebs
30-Nov-2007, 09:29
Well, to be fair, I'm just some guy. I wasn't even affiliated with a company -- I just wanted a cell blade system because I do a lot of technical writing, and I could have taken it as a deductible expense, and PROBABLY paid for it with work eventually.

But I'm just one guy, there's no company involved, so I assume they just figured there wasn't enough business there to justify the effort. It's not as though, if I wrote a lot of articles about it, I'd come back and buy fifty or a hundred more.

Arwin
30-Nov-2007, 09:46
Probably not, but if they were genuinely bored (i.e. not be at 100%+ work capacity), my guess would have been that they'd have gladly sold you one, precisely because you do write articles about it. That's just speculation on my part though.

seebs
30-Nov-2007, 09:58
It might be. One of my coworkers dealt with them in another capacity once, and apparently they tend to blow off anyone who isn't likely to directly buy a LOT of hardware. I figure there's no reason for them to check that, out of a hundred people who said "I want to write about this", one particular guy might be a moderately successful writer whose articles might get read, when most of them are just dead blogs. :)

Still, it's sort of a shame. I really want one of those to mess around with. What Cell programming I've done has been neat, but I'd rather have a blade with real memory than a PS3 with 6 available SPEs and barely over 200MB to play with.

Nite_Hawk
30-Nov-2007, 14:38
It might be. One of my coworkers dealt with them in another capacity once, and apparently they tend to blow off anyone who isn't likely to directly buy a LOT of hardware. I figure there's no reason for them to check that, out of a hundred people who said "I want to write about this", one particular guy might be a moderately successful writer whose articles might get read, when most of them are just dead blogs. :)

Still, it's sort of a shame. I really want one of those to mess around with. What Cell programming I've done has been neat, but I'd rather have a blade with real memory than a PS3 with 6 available SPEs and barely over 200MB to play with.

That's pretty much our problem too. We do have people doing development on PS3s, but it's even more of a niche than cell development in general. At least with a cell blade we'd have a small chance of getting it in our data center and making it a general resource for MSI users. There's no chance of that with PS3s.

Nite_Hawk

patsu
30-Nov-2007, 18:28
Where are you guys located ? I know of institutions with donated Cell Blades to encourage R&D activities.

EDIT: Oh... in Minnesota. Have you approach the schools for some value exchange (write about their programs in exchange for use of Cell and whoever are working on the Cell) ? I also know of an oversea location that allow companies to use their grid network and Cell blades for free (Some strings attached).

seebs
30-Nov-2007, 18:31
I'm in Minnesota, just ilke it says in the post. :)

The thing is, I'm not an "institution". I'm some guy. If I got a Cell system, it'd probably be in the basement about five or ten feet from the dryer. This is not an environment conducive to sales people drooling over the future sales prospects. :)

Vitaly Vidmirov
30-Nov-2007, 20:54
seebs
it'd probably be in the basement
Probably garage is a better place. Some great things started it's life in a garage ;)

ADEX
30-Nov-2007, 21:06
Weird, I posted in this thread the other day but my post appears not to have made it...


Anyway, I think you've hit the nail on the head. IBM Mercury don't sell stuff to end users, they only sell to other big corps. This looks bad for end users but if they're selling Cell it is a good thing as it gives them a base to work from.

The first Cell was only really for the PS3 so it's not surprising they're not pushing it elsewhere much (other than HPC where it fits nicely).

The second gen should change things, the real HPC chip will come along and will appear in blades and possibly even workstations, but this is IBM so they wont be cheap...

Toshiba's Spurs chip is probably more interesting for end users as they appear to want to put them in laptops. If they sell then maybe other companies will do the same and...

As for evaluation of the performance you need to read the academic papers, Cell is typically 10x (or more) faster than a traditional core - even on problems which appear "Cell hostile".
The only real problem with it is you have to program it specifically to get that level of performance, you can't just take existing code and expect a free speedup.

seebs
30-Nov-2007, 21:11
If I had a LOT of them, I could put them in the garage. One or two, they couldn't keep it warm enough and the cold would kill them through condensation.

Nite_Hawk
30-Nov-2007, 22:51
If I had a LOT of them, I could put them in the garage. One or two, they couldn't keep it warm enough and the cold would kill them through condensation.

Yeah, it's pretty cold this weekend. Ready for the snow? :P

Btw, where in Minnesota are you? I'm amazed there's someone else on these boards besides Geo from around here. :P

Nite_Hawk

seebs
30-Nov-2007, 23:20
Northfield, these days. Used to be in Saint Paul.

BTW, are you sure you're from Minnesota? The phrase "ready for the snow" sounds vaguely ungrammatical to me. :)

Elvedin
29-Dec-2007, 00:13
If I had a LOT of them, I could put them in the garage. One or two, they couldn't keep it warm enough and the cold would kill them through condensation.

If you buy enough for IBM to sell to you, they would certainly be able to keep your garage warm.

Does the Cell processor have a chance? Maybe. http://www.lanl.gov/news/index.php/fuseaction/nb.story/story_id/12129/nb_date/2007-12-10

Frank
11-Jan-2008, 18:21
Does it have a chance?
It's a great upgrade to the most popular processor in existence: the ARM. And it will run the target apps without much modification, giving you time to put all that added power to good use. It's also a great upgrade for the high end stuff: servers and supercomputers (as said).

Basically anything that uses Linux or similar (BSD etc) would be a great target. But you won't see it replacing x86 on the desktop any time soon, if at all.

3dilettante
11-Jan-2008, 18:35
What exactly does Cell have in common with ARM?

There's practically no code that can run well without modification on Cell, unless you want to run it all on the PPE and waste two thirds of the chip.
The ISA change, threading, and the need to refactor algorithms is not trivial.

Cell is a terrible choice for most of ARM market. It's big, complex, expensive, and probably leaks more current in standby than most ARMs draw under load.

Frank
11-Jan-2008, 19:16
Definitely.

I meant, that if you have a black box that runs a Linux derivate and you run out of steam with the ARM, a Cell would be a good upgrade. That's a small part of that whole market, but still a serious amount of boxes.

And your application would most likely run after a recompile (although only on the PPU), making the transition a lot easier.

Gubbi
14-Jan-2008, 12:35
Definitely.

I meant, that if you have a black box that runs a Linux derivate and you run out of steam with the ARM, a Cell would be a good upgrade. That's a small part of that whole market, but still a serious amount of boxes.


If you start out using an ARM processor, you are worried about:
1. Cost
2. Power usage
3. Performance

Likely in that order. CELL is expensive and power hungry. You'd probably be much better off with a G3/4 PPC derivative or a MIPS core. - Or multiple ARM cores, the new A9 with dual issue OOO execution and +1GHz operating frequency core look promising (and all in 1.5mm^2 in 65nm to boot)

Edit: Alright, just echoing 3dilettantes points

Cheers