End of Cell for IBM

IBM have not stopped Cell processor development - DriverHeaven.net

We managed to get hold of an IBM spokesperson an hour ago and they said that only one CPU development cycle is being 'halted' which is the successor to the current PowerXCell-8i cpu. IBM have said they are planning to work on other CPU's in the Cell Processor 'family' and we would assume that by the time the Playstation 4 hits market that they could very well be involved on some level.
...
 
That's more interesting. So they're going to update Cell significantly I assumed to keep up with the competition. I mean if I remember right the 32i was going to have improve SPU. So I guess IBM figure 32i iteration of Cell isn't going to cut it.
 
Another possibility is that IBM call the PPU and the its POWER PC architecture the "core" of the cell technology :LOL:
 
It would. Lots of people confused and still confuse Cell. It wasn't a specific processor, but an architecture. The scope of Cell extends to multiple different cores - they wouldn't have to be tied to SPU ISA cores. So a POWER7 with SPUs attached on a ring bus would be still Cell.

The SPUs are both the strength and the weakness of CELL. Strength in that it is what gives CELL its extraordinary computing density. Weakness in that it makes CELL CPUs impossibly hard to virtualize, limiting them to a single user (and single application !!) environment, - fine for game console and HPC, but useless everywhere else.

Now obviously for real-world purposes, a Cell that doesn't run existing Cell code with little more than a recompile won't convince many folk that it's stil Cell! I'd expect any proper Cell to be code compatible with current SPU code. However, replacing the PPU with a real processor is a Good Idea. If the Cell 32i was just 2 PPUs with the SPUs, it'd be worth canning IMO and replacing those with better cores.

There isn't a lot of room for CELL designers to maneuvre. The size and latency of the local store is effectively part of the architecture spec now, that is, fixed.

Cheers
 
And can never be changed or updated?

The local store can't be decreased in size, since that would break existing programs.

It could be increased, but that would increase latency as well, making programs slower, especially existing ones, which expect a six cycle latency. Also an increase has zero benefit on existing programs (unlike caches).

The latency of the local store is a function of its size. Lowering it is out of the question, signal propagation delays increase with smaller geometry, increasing latency is bad for existing code because code is statically scheduled by the compiler (or manually by the coder, *ugh*) to deal with the six cycle latency.

Cheers
 
It could be increased, but that would increase latency as well

Honest question - is it theoretically impossible to increase the local store without increasing latency? Isn't it possible to, say, use faster memory to compensate?
 
The local store can't be decreased in size, since that would break existing programs.

It could be increased, but that would increase latency as well, making programs slower, especially existing ones, which expect a six cycle latency. Also an increase has zero benefit on existing programs (unlike caches).

The latency of the local store is a function of its size. Lowering it is out of the question, signal propagation delays increase with smaller geometry, increasing latency is bad for existing code because code is statically scheduled by the compiler (or manually by the coder, *ugh*) to deal with the six cycle latency.

OK, pardon my ignorance, but how is it then possible that the first level caches have kept increasing in size while maintaining the same low latency?
 
By the way, the title of this topic should be amended, a question-mark should be added at least. 'End of Cell for IBM?'
 
Weakness in that it makes CELL CPUs impossibly hard to virtualize, limiting them to a single user (and single application !!) environment, - fine for game console and HPC, but useless everywhere else.
Nonsense. PS3 gameOS / linux are virtualized (including SPUs of course).
Backing SPU state is pretty easy.

OK, pardon my ignorance, but how is it then possible that the first level caches have kept increasing in size while maintaining the same low latency?
Increasing? Last time Intel bumped L1 cache (32KB) in their CPUs was 7 years ago. Amd live with 64KB for over a decade.
And do you know any examples of 256kb L1 cache? =)
 
Last edited by a moderator:
Increasing? Last time Intel bumped L1 cache (32KB) in their CPUs was 7 years ago. Amd live with 64KB for over a decade.
And do you know any examples of 256kb L1 cache? =)

There were some CPU in ancient time, named PA-RISC, which have huge L1 cache (1MB data + 0.5MB instruction). Some even boast 1 cycle access latency. Though their clock frequency are not fast compared to other CPU of their generation. :)
 
Last time Intel bumped L1 cache (32KB) in their CPUs was 7 years ago. Amd live with 64KB for over a decade.
So they have panned out, is that due to effects Gubbi refered to?

And do you know any examples of 256kb L1 cache? =)
That was never the question. BTW accessing the LS does not evolve any table lookups before accessing any position as well so I am not really impressed by the 6 cycle latency considering L1 caches do quite a bit of logic in less cycles.
 
So they have panned out, is that due to effects Gubbi refered to?


That was never the question. BTW accessing the LS does not evolve any table lookups before accessing any position as well so I am not really impressed by the 6 cycle latency considering L1 caches do quite a bit of logic in less cycles.

All modern caches are virtually indexed. That means that the virtual address is used to initiate the load from the cache array immidiately. Tags are checked in parallel and used one or a few cycles later to determine if the data found (if any) is a hit or not.

Cheers
 
Easy as in "few lines of code", yes. Easy as in "only takes a few cycles", no.
Since when virtualized context switching takes "a few cycles"? Does it require L2 cache flush?
90us or so to backup SPU context is OK. That's ~5K switches (save+restore) per second.
 
Dead again, per Ars :LOL:

In an interview with Heise.de, IBM's VP of Deep Computing, David Turek, confirmed that the Cell processor has reached the end of the line. Turek then put a more positive spin on the news by stating the obvious truth that heterogeneous multiprocessors, of which Cell was the first mass-market example of, are here to stay, so insofar as IBM continues to produce such chips, Cell's basic concepts and ideas will live on in the company's product line.

The authors usually respond to comments regarding factuality but none have commented on the links about IBM clarifying. Maybe Cell is dead because Ars wants it dead (conspiracy and all!) Anyhow, their take is always interesting overview.
 
I don't much about the math they're using, but what important calculations was the CELL faster in?

As I was mentioning earlier in the thread, signal processing is one of Cell's strong suits. The article itself mentions the tasks for the cluster:

The Air Force has used the cluster to test a method of processing multiple radar images into higher resolution composite images (known as synthetic aperture radar image formation), high-def video processing, and "neuromorphic computing," or building computers with brain-like properties.

As for the supposed speed delta between a potential GPU solution, I think it has more to do with the project being green-lighted in 2008. I think if an institution were to begin today to evaluate various architectures, GPGPU would look stronger than even a year ago simply due to OpenCL, increased FLops, and increased DP per card. Not that Cell hasn't recently come under the OpenCL fold as well of course.

Even today if we look at the price of a Firestream 9270 and consider the host system needs, on a pure cost basis ~3 PS3's (3 nodes) per cost of a single card I think would still make it a worthwhile choice in certain situations.
 
I think Jon's been off on Cell since day 1 personally... he never seemed to 'get' it. Even that article painting the picture as if IBM 'sold' Sony on Cell reflects a warped understanding of the chips origins, since IBM had to be dragged to the SPE party essentially.

But I do think that as a branch, Cell's ball will be picked up by a different architecture over at IBM. The Driverheaven article is just the most positive spin on what is essentially the same non-denial denial out of David Turek that all these sites are working with.

I've said before in other threads but I find it a bit ironic that the greatest beneficiary of the architecture may ultimately be IBM, who of course wanted something more 'standard' at the outset. Cell has given them a position quick off the line in the world of many-core architectures, the supporting tools, and plain old experience/R&D. Whatever comes next for them, I'm hoping for an interesting chip.

As an aside, I think the HPCWire article linked within Jon's makes a great sort of "memories of Cell" piece:

http://www.hpcwire.com/features/Will-Roadrunner-Be-the-Cells-Last-Hurrah-66707892.html

It's interesting to note that the top 6 supercomputers on the Green 500 are all Cell-based systems.

http://www.green500.org/lists/2009/11/top/list.php
 
Not to derail, but does anyone have a relative size comparison between a single Larrabee core and an SPE? What is going to be the cost of taking a more traditional core (sans OOOe) with cache, latching on a honking vector unit, compared to a clean slate SPE design?

Anyhow, it will be interesting to see how chip communication matures. You don't hear a lot of complaints about Cell in this regards.
 
Since when virtualized context switching takes "a few cycles"? Does it require L2 cache flush?

A normal processors explicit context is on the order of half to one kilobyte. The SPUs context is almost three decimal orders of magnitude larger.

90us or so to backup SPU context is OK. That's ~5K switches (save+restore) per second.

It's ok because CELL is used in environments where you on average run only one application (a game) at a time.

For reference, on bog standard Linux switching time is less than one microsecond.

A typical server has around 1-2000 context switches per second per core but can go much higher (eg. Citrix servers do). Your typical Vista desktop has around 1-4000 switches/second at any given time just web browsning, listening to streaming music, etc.

One thing is the amount of time used, another is bandwith, 2000 switches per SPU per second times 512 KB per context switch (256KB out, 256KB in) means you would spend 7GB/s bandwidth just for context switches.

Cheers
 
Back
Top