End of Cell for IBM

Vitaly Vidmirov · Nov 24, 2009

Gubbi said:
For reference, on bog standard Linux switching time is less than one microsecond.

You mentioned "virtualization". I suppose "inter-OS" switch can't be that fast.
It doesn't matter how much SPU switches occur within single OS per sec.
SPUs are running asynchronously and do not depend on PPE thread sheduling.
It is also not neccessary to save all SPU contexts - only ones what is used.
Each SPU is independent and can work with different address space.

upd:
Oh, well, I see you mean SPU virtualization, but I mean OS virtualization =)

patsu · Nov 24, 2009

Gubbi said:
It's ok because CELL is used in environments where you on average run only one application (a game) at a time.

For reference, on bog standard Linux switching time is less than one microsecond.

A typical server has around 1-2000 context switches per second per core but can go much higher (eg. Citrix servers do). Your typical Vista desktop has around 1-4000 switches/second at any given time just web browsning, listening to streaming music, etc.

One thing is the amount of time used, another is bandwith, 2000 switches per SPU per second times 512 KB per context switch (256KB out, 256KB in) means you would spend 7GB/s bandwidth just for context switches.

Depends on how you look at it, it may be because the existing implementation of Cell is a power efficient model (read: PPU is too weak for the workload you highlighted). The SPUs are there to speed up media/computationally heavy solutions and perhaps natural interface (a la PS Eye) apps at relatively lower power consumption.

ADEX · Nov 24, 2009

Reading around the web, IBM are denying it's the end. It's not clear what they are doing though.

Interesting snippet in this one, don't know where it came from though:

http://www.itproportal.com/portal/news/article/2009/11/24/ibm-denies-pulling-out-cell-development/1/

Crossbar · Nov 24, 2009

Gubbi said:
All modern caches are virtually indexed. That means that the virtual address is used to initiate the load from the cache array immidiately. Tags are checked in parallel and used one or a few cycles later to determine if the data found (if any) is a hit or not.

Thanks for the information, those caches must run damn hot.

Carl B · Nov 24, 2009

ADEX said:
Reading around the web, IBM are denying it's the end. It's not clear what they are doing though.

Interesting snippet in this one, don't know where it came from though:

http://www.itproportal.com/portal/news/article/2009/11/24/ibm-denies-pulling-out-cell-development/1/

This article is the worst of them yet, because it delivers a hodge-podge amalgam of the articles previously posted in the thread. Worst of all, if taken at face value, it confuses the reader.

It basically goes like: IBM says PowerXCell 32i is canceled, media reports Cell is EOL, people contact IBM for clarification, all that IBM clarifies is that XCell 32i is canceled, confirms that Cell in its present form will continue to be fabbed... (and this blow my mind, was it really in question Kotaku?)... and that the technology lessons learned will go be applied to other chips.

There has still been no word on a Cell2, PS4, or anything other than the XCell 32i. These stories are all feeding off of each other to create news where there is very little, and turning the whole thing into a game of 'telephone.'

Weaste · Nov 24, 2009

If this is the roadmap that people are talking about, then the PowerXCell 32i was always labeled as a concept, even in 2008. The act of them dropping it shouldn't really be so much of a shock. It clearly doesn't suit IBM's needs.

That said however, Sony were never going to use that chip in any case. Although labeled in a certain way, you can really read those three branches from top to bottom as IBM, Sony, Toshiba. The IBM stem never really said anything about what Sony would do to the architecture (and Sony are not likely to release a roadmap), yet the whole thing has turned into an Internet event.

Weaste · Nov 24, 2009

Gubbi said:
The local store can't be decreased in size, since that would break existing programs.

It could be increased, but that would increase latency as well, making programs slower, especially existing ones, which expect a six cycle latency. Also an increase has zero benefit on existing programs (unlike caches).

The latency of the local store is a function of its size. Lowering it is out of the question, signal propagation delays increase with smaller geometry, increasing latency is bad for existing code because code is statically scheduled by the compiler (or manually by the coder, *ugh*) to deal with the six cycle latency.

Cheers

Again, I'm quite naive about these things, but are there not all sorts of ingenious jiggery pokery that you could come up with to increase the size of it yet still retain performance?

For example, could you not have 256k for instruction and 256k for data? Could you then not say stick another 512k on top of that as another layer for the DMAing in and out from DRAM/XDR, which then could be copied over when needed? Would something like that not allow for greater performance than constantly DMAing data in and out of a smaller space?

EDIT: Also, on the roadmap, it mentions an eSPE. I suppose that would mean an enhanced SPE? If more memory was not to be added, what could they possibly enhance about the SPE? They sort of imply it here.

Lazy8s · Nov 25, 2009

Real innovations in processor efficiency have come from the computing space where necessity for the technology can overcome the momentum of established platforms, like in the mobile sector.

Designs like Metagence or perhaps SuperH or ARM are among the few CPUs which might actually qualify to be called efficient, and they'd scale up far more effectively than CELL would scale down.

Gubbi · Nov 25, 2009

Vitaly Vidmirov said:
upd:
Oh, well, I see you mean SPU virtualization, but I mean OS virtualization =)

Yeah, sorry, should have clarified. Its process virtualization I see as the problem.

I don't like any of the process models for SPUs. They either depend on cooperative scheduling or batch processing, concepts that were abandoned 30 years for good reasons.

Cheers

Arwin · Nov 25, 2009

Gubbi said:
Yeah, sorry, should have clarified. Its process virtualization I see as the problem.

I don't like any of the process models for SPUs. They either depend on cooperative scheduling or batch processing, concepts that were abandoned 30 years for good reasons.

Cheers

Abandoned where? Next you're going to say that in-order processing was abandonded 30 years ago also for good reasons.

thambos · Nov 25, 2009

If Sony will be abandoning the cell arch for the PS4, doesn't that make a lot of the effort spent learning how to optimize for the cell on the PS3 less valuable moving forward?

I always thought that would be their strategy, so that PS4 could hit the ground running (as opposed to the PS3 which has had growing pains).

Gubbi · Nov 25, 2009

Arwin said:
Abandoned where? Next you're going to say that in-order processing was abandonded 30 years ago also for good reasons.

Unix and VMS changed the game in the 70s. Cooperative multitasking was last used in Windows 3.10 and Mac Os 9, try to find three nice things to say about those (then) living dinosaurs.

Batch processing is only used in select areas (banking) because you can afford to have services offline, - and it makes checkpointing vastly easier,

Cheers

liolio · Nov 25, 2009

thambos said:
If Sony will be abandoning the cell arch for the PS4, doesn't that make a lot of the effort spent learning how to optimize for the cell on the PS3 less valuable moving forward?

I always thought that would be their strategy, so that PS4 could hit the ground running (as opposed to the PS3 which has had growing pains).

From reading Repi's presentations I would say no. The teams that made the most out of the cell are likely to be the ones that will make the most of up coming GPU.

Arwin · Nov 25, 2009

I'm pretty sure you've answered my question, but I also have a very strong feeling that you don't realise it.

Gubbi said:
Unix and VMS changed the game in the 70s. Cooperative multitasking was last used in Windows 3.10 and Mac Os 9, try to find three nice things to say about those (then) living dinosaurs.

Batch processing is only used in select areas (banking) because you can afford to have services offline, - and it makes checkpointing vastly easier,

Cheers

Shifty Geezer · Nov 25, 2009

Lazy8s said:
Designs like Metagence or perhaps SuperH or ARM are among the few CPUs which might actually qualify to be called efficient, and they'd scale up far more effectively than CELL would scale down.

Yes. This is why ARM features in 4 out of the top 6 supercomputers, and continues to go from strength to strength in the many-core system market. :yep2:

patsu · Nov 25, 2009

Gubbi said:
Unix and VMS changed the game in the 70s. Cooperative multitasking was last used in Windows 3.10 and Mac Os 9, try to find three nice things to say about those (then) living dinosaurs.

Batch processing is only used in select areas (banking) because you can afford to have services offline, - and it makes checkpointing vastly easier,

Cooperative multitasking is not a Cell architectural characteristics though. Linux runs well on PS3 and IBM's Cell workstation afterall. Heck, RoadRunner also runs Linux.

In heterogeneous environments where power consumption is not an issue, they have dedicated chips like GCGPU to handle the heavy workload. In Cell, everything is packed into one with a fast interconnect, that's all.

Gubbi · Nov 25, 2009

patsu said:
Cooperative multitasking is not a Cell architectural characteristics though. Linux runs well on PS3 and IBM's Cell workstation afterall.

But the SPUs are used either by spufs or by SPE threads, the former is cooperative (because the SPUs are a limited resource) the latter is basically batch processing since SPE threads are run in simple FIFO order. The reason you don't see preemptive multitasking for SPUs is because of the cost of switching the SPU context, and that is an architectural characteristic.

I have no problem with that, it just limits the application space where CELL is useful.

Cheers

patsu · Nov 25, 2009

Sure, but this is also true for GPUs and many assymmetric processing systems ? It is an acceptable and legitimate form of high performance systems. In some cases, for sustained throughput, you want to avoid the overhead of task switching altogether anyway.

They are separate from whether the Cell supports pre-emptive scheduling OS or not. The IBM Cellblade has Linux running on 2 Cells for example. It does mean that the programmer has to handle SPU scheduling explicitly (in an application specific way), just like delegating graphics tasks to GPUs in a pre-emptively scheduled OS.

With the introduction of OpenCL, the hardware differences have been abstracted, which may popularize the programming model further. This does not look like an obsoleted concept to me.

Gubbi · Nov 25, 2009

patsu said:
Sure, but this is also true for GPUs and many assymmetric processing systems ? It is an acceptable and legitimate form of high performance systems. In some cases, for sustained throughput, you want to avoid the overhead of task switching altogether anyway.

Absolutely, I'm not disputing that at all. I've stated all along that CELL makes sense in a console or for HPC.

I was merely pointing out why CELL has problems breaking out of those niches, and therefore enjoys less than stellar commercial succes (the context of this thread).

Cheers

Shifty Geezer · Nov 25, 2009

Gubbi said:
I have no problem with that, it just limits the application space where CELL is useful.

Isn't this more a problem of OS design than Cell's design? Take your Windows 3.1 example. Cooperative multitasking is ancient history, right? But if we had Cell back then, 8 hardware cores would mean 8 active processes and no need to task switch given the workloads of the period. A 32 core Cell could handle 32 tasks simultaneously. How much need is there to context switch with a standard OS workload? Obviously with Windows as is, loads, because the system is fractured into zillions of little processes. But even then, plenty could be batch processed comfortably. In terms of user experience, you'd need a core for media playback, another for the web-browser, another for the wordprocessor, another for printing...and given how people use computers, there's no need for simultaneous processing beyond the number of cores we'll have available. So design an OS more efficiently to collect and process various functions into a set of fast streams (we don't need to work about interrupting one process to switch to another if it can be completed fast enough to not affect the system) and SPUs/other simpler cores will be just fine while offering huge provessing throughput for demanding tasks.

End of Cell for IBM

Vitaly Vidmirov

patsu

ADEX

Crossbar

Carl B

Friends call me xbd

Weaste

Weaste

Lazy8s

Gubbi

Arwin

Now Officially a Top 10 Poster

thambos

Gubbi

liolio

Aquoiboniste

Arwin

Now Officially a Top 10 Poster

Shifty Geezer

uber-Troll!

patsu

Gubbi

patsu

Gubbi

Shifty Geezer

uber-Troll!

Similar threads