Was Cell any good? *spawn

Andrew Lauritzen · Aug 6, 2012

Arwin said:
If I want to just buy a single CPU today that has significant multi-core power, even a quad-core i7 is significantly more expensive (and I don't doubt much more power-hungry) than any current gen console, including the most complete PS3 package you can find. And GPUs that are as good as the Cell for GPGPU programming style tasks may well exist, but find one that is as powerful at the same wattage that the Cell is currently using isn't as easy as you might think. Today. In 2012.

Uhh... trolling much? Or do you actually believe this? Especially for GPUs... they are massively more powerful using a fraction of the power. I'd even guess i7's are in the same throughput range and with a far more useful memory hierarchy behind it.

As far as "reaching peak throughput" goes, that's a meaningless notion without considering a specific problem. And in those cases Cell is like a GPU... it looks good on the brute force algorithm with respect to throughput, but the efficient algorithm (read, recursive/irregular in most cases) on a more flexible processor will crush it.

Vitaly Vidmirov · Aug 6, 2012

aaronspink

The problem is that the SPEs are basically limited to the same or less programmable functionality of a modern GPU core while being much larger and power hungry.

Hmm, can you generate the code from your compute shader and run it immediately without a glitch?

Still this "power hungry" CELL was unreachable by other CPUs a few years ago and was #1 CPU in Green500 list.
http://www.realworldtech.com/compute-efficiency-2012/
It took Intel (and IBM

) 5 years to come close to CELL in performance. Yeah, this is a bad architecture. LOL
I mean SP of course, but GPUs are not a DP powerhouse either :smile:

And they significantly lack the flexibility of a modern SIMD CPU core while being not much smaller.

So how GP core that is actually more flexible than any "classic" CPU core can be less programmable than GPU?
You can use cache where needed, with any cache-policy, prefetch data you need in advance. All this is not possible on "classic" CPUs.
Program flow is "scalar" on SPE. Not sure how to describe it.
GPUs are SPMD machines and waste resources with their wide warps/wavefronts.

SPE is predictable and can entirely hide data transfer overhead on streaming workloads.
"Advanced" prefetchers that Intel uses is a joke compared to explicitly programmable DMAs.
Effective ISA and DMA allows 90%+ compute and bandwidth utilisation on a wide range of workloads.
Shared register file and lack of flags eliminate a lot of stalls and issues of PPC/ARM/X86 on data transfer between the different computation units.

You still want the capability to do coherent transaction/message passing in order to co-ordinate and interact. But this isn't how the SPEs were architected.
SPE has message passing channels as well as SL1 cache to fast inter-SPE communications via atomic DMA.
DMA traffic is CC, so it is perfectly synchronised with PPU. CC is not required for actual data processing because each core working with separate set of data.
All temporal data is also irrelevant to other cores. "Classic" cpus just wasting an energy here on useless job.

I've read all the documentation as well as discussed the architecture with many who have programmed it. One doesn't need to program an architecture in order to understand its issues and shortcomings.

As if someone who learned a book about sex somehow became LESS virgin.
BTW Sex and SPU has at least two similarities

Where you see issues, I see opportunities. CELL is #1 most brilliant CPU design for me. You can color me as CELL fanboy.
I don't know for HPC, but in games the particular problem can usually be changed to map effectively to underlying hardware.

Andrew Lauritzen

Having a unified address space is barely useful since it's not even cached... not even for read-only data (like GPUs).

1) You think of GPU as a set of CPU-initiated kernels. IMO GPU should act like a separate CPU core and pick the data by itself and traverse CPU data structures directly without additional fuss from CPU.
Discrete GPU is a deadend because of perpetual interconnection bottleneck.

2) It is not cached yet on PC GPU, because GPUs are still in their infancy.
ARM Mali T6xx GPU has unified address space and allows cache coherency.
Next-gen APU will have the same functionality one day.

You don't want to be pointer-chasing on SPUs (or GPUs) anyways

Why not? As far as it is not on critical path.
In Local Store pointer-chasing is free. Immediate DMA pointer-chasing is not but it is a less problem than anemic PPU.

I really want to give Cell the benefit of the doubt but as I mentioned retrospectively the memory hierarchy choices they made are just too crippling.

PPU with "proper" memory hierarchy can hardly achieve 10% of memory BW under Linux without a manual prefetching.

Thus I agree with Aaron... either a CPU or a GPU is more efficient at the vast majority of interesting workloads.

That depends of CPU core. SPU can be several times faster than PPU on the same unoptimised scalar code.
The problem of SPU is a lack of adequate programming environment and compiler. GCC is completely suck at this task. SPU programming still is a serious waste of time, but at least it is fun

Furthermore it came out around the same time that G80 did, which really is where the GPUs begun to put the nails in the coffin.

And it was a gigantic GPU with 3 times more transistors than CELL.
CELL is not GPU. So why not compare GPU to x86? Is GPU put the nails in x86 coffin?
The fact that CELL is used as RSX accelerator is a consequence of less-than-required GPU performance.

so I think it's pretty fair to conclude that Cell is going to look really bad in that comparison today.

7 billion transistor 2012 GPUs vs. 230 million 2005 CELL?

I'm writing this after sleepless night of SPU coding on weekend =)

archangelmorph · Aug 6, 2012

Well said Vitaly...

liolio · Aug 6, 2012

Thinking almost to my self whereas I do expect proper counter arguments to follow I believe both of you archangel and Vladimir have brought me an almost philosophical moment in this discussion and you got to appreciate for your self the wisdom in some pretty simplistic statements: Winners write history.

That's scary ain't it? And it's as simple as that, may you win this argument that would not change anything.
:arrow:

time for bed...

Arwin · Aug 6, 2012

the efficient algorithm (read, recursive/irregular in most cases) on a more flexible processor will crush it.

I'd love to hear (honestly) about recursive/irregular algorithms actually being the faster ones. As far as I've learnt, by far the most efficient method of doing almost anything these days is streaming data along a factory line of modifier code.

sebbbi · Aug 6, 2012

Massively parallel programming requires different way of thinking compared to traditional sequential programming. Algorithms are different, analysis methods are different and time complexity classes are different. GPU requires problems to be split to thousands of small work items (you need to have lots of excess items for latency hiding purposes). This is fine for some algorithms, but many algorithms are very hard to fully parallelize. For example a simple (single line) loop that accumulates values in array and stores the accumulated value to each index (prefix sum) is a very hard problem to solve efficiently on a parallel architecture. Every programmer knows instantly how to program this for a CPU (or for a SPU), but there's dozens of articles written just about this one algorithm for parallel architectures (I have read few hundreds of pages of academic research about good parallelization of this single line algorithm). Simple common algorithm steps can be very hard to do efficiently with a GPU. Libraries have started to have collections of "primitives" to help programmers getting started, because even simple steps can be too hard to implement efficiently without doing lots of research first.

GPU compute was still very much in an experimental stage in 2003-2004 (when PS3 design must have been locked in). G80 was released in 2006 and it was the first "general purpose" programmable GPU. It supported Nvidias brand new proprietary CUDA (1.0) compute API. The debuggers were horrible back then, and the support libraries were pretty much nonexistent. There wasn't any cross platform/device APIs until many years later. We had to wait until DirectX 11 (2009) to have DirectCompute, and even longer to have solid OpenCL drivers for all hardware. G80 wasn't a straight compute oriented design like it's followers (G200 and especially Fermi). It lacked many important compute oriented features such as atomic operations. Parallel kernel support (G200, 2008) and general purpose caches (Fermi, 2010) also were introduced later on. Nvidia started focusing heavily on compute after G80: G200 had significantly higher ratio of compute hardware to graphics hardware (texturing etc), and Fermi continued this further. G80 wasn't yet a compute monster.

G80 compute performance was around 150-170 gigaflops (http://www.beyond3d.com/content/reviews/1/11). That's pretty much in the same ballpark as Cell peak figures. Sony could have waited, but they likely knew that the GPU compute peak performance was similar, and that GPUs required a completely new highly parallel programming model, and completely new tools and debuggers. GPGPU was in no way a proven ready to use technology, but a very high risk proposal. Cell on the other hand was much more like traditional CPU. It just lacked automated caches. You have to manually prefetch everything on Xenon as well, if you want to have good performance out of it, so it's not that much different for the programmer. It's actually easier to get good peak performance out of Cell vector pipelines compared to Xenon vector pipelines (less stall cases, shorter pipelines, data addressing using vector data, constant latencies, etc). Cell wasn't alone in dropping coherent caches to scale up their core counts. Intel did exactly the same in their 48 and 80 core prototype chips. Intel did however later change their design to have coherent caches (Larrabee).

I don't think that things would be better if they had chosen to drop the SPUs completely, and to rely completely on GPU compute. A single (SMT) PPC core with a G80 would have been a disaster. GPU rendering is easy because you don't want to get any data back. Asynchronous operation is not a problem, since ~half a frame latency to display isn't noticeable. But if you have a GPU as your primary number cruncher, you absolutely need low latency and for that you need multiple hardware ring buffers, parallel kernel execution and hardware context switching. Kepler was the first Nvidia GPU to include all these things. And we are still waiting for cache coherency (HSA will bring this eventually). So you would have had all the same problems as Cell (SPUs) brought up, and then some (super long compute latencies, forced to write highly parallel code).

Most game programmers are not capable of doing efficient massively parallel GPU compute code. Not today, and surely were not in 2005 (the year when first dual core CPUs were introduced to consumer markets). The new features in GCN, HSA, Fermi and Kepler help a lot, and new tools and languages (for example C++ AMP) help also. GPU compute would be a safe bet for a console released today, but I don't think it would have been that good idea in 2005.

archangelmorph said:
Well said Vitaly...

Couldn't have said it better myself.

Arwin said:
As far as I've learnt, by far the most efficient method of doing almost anything these days is streaming data along a factory line of modifier code.

That's true. That's usually the best way to compute things in games. For CPUs, for Cell and for GPUs.

Shifty Geezer · Aug 6, 2012

sebbbi said:
The hot question here seems to be: should Sony have waited slightly longer to replace their SPU based vector processing architecture with GPU compute (G80)?

That's a good reply on the shortcomings of GPGPU, but I don't think that's quite the argument for the anti-Cell party. Their criticism is that the problems that Sony was wanting to address with Cell as a small, fast compute architecture for many devices, were something of their own imagining. CE devices still benefit from custom ASICs and Cell hasn't a place there. In the PC space, extended stream processing is added to the core processors, so Cell hasn't a place there, while super parallel processing is moving to the GPU, so Cell hasn't a place there. As such, perhaps Sony shouldn't have tried to do something clever with a new CPU architecture and should have instead seen and accepted how things were going to turn out? If the lack of room for Cell was apparent, they could have forgone investment in Cell and spent their money on a completely different PS3 design, perhaps commissioning a proper custom GPU early on.

I think the biggest support for that view is that Cell hasn't moved out of PS3 or HPC. Toshiba have flirted with the idea with a SPURSEngine powered TV and laptop, but AFAIK their latest top-end model TVs have replaced Cell with another processor. SPURS hasn't found its way into laptops or the like either to offer flexible, easy compute resources. Whether that's because the architecture is a no go, or because the market is resistant to new ideas, I don't know, but I think the power draw of Cell is too much to be useable in the places where it'd be most use. HD cameras and audio laptops and the like aren't going to feature Cell. I'd be very interested to see what a 22nm SPURSEngine can offer though.

aaronspink · Aug 6, 2012

Vitaly Vidmirov said:
aaronspink

Hmm, can you generate the code from your compute shader and run it immediately without a glitch?

Yes you can.

Still this "power hungry" CELL was unreachable by other CPUs a few years ago and was #1 CPU in Green500 list.
http://www.realworldtech.com/compute-efficiency-2012/
It took Intel (and IBM ) 5 years to come close to CELL in performance. Yeah, this is a bad architecture. LOL
I mean SP of course, but GPUs are not a DP powerhouse either :smile:

Peak != real. Facts of life and all. What would top that chart? An FPGA composed completely of MACs!

You'll notice it died a rather quick and merciless death(PowerX Cell 8i). IBM basically decided that it wasn't a viable direction and left it to die. If they thought it was a viable direction, they could of easily expanded the number of SPUs in the design, instead they designed several different completely new chips for that space (Power7 + the associated super router and BG).

So how GP core that is actually more flexible than any "classic" CPU core can be less programmable than GPU?

Because it simply isn't anywhere close to "more flexible than any "classic" CPU core". Not within a million light years of your assertion.

You can use cache where needed, with any cache-policy, prefetch data you need in advance. All this is not possible on "classic" CPUs.

You apparently don't understand the cell architecture.

Where you see issues, I see opportunities. CELL is #1 most brilliant CPU design for me. You can color me as CELL fanboy.

Yeah, we get that you are a fanboy and don't know what your talking about. I'm pretty sure that was apparently from your first sentence.

That depends of CPU core. SPU can be several times faster than PPU on the same unoptimised scalar code.

No, no it cannot. You have to start doing optimizations just to be able to RUN.

And it was a gigantic GPU with 3 times more transistors than CELL.
CELL is not GPU. So why not compare GPU to x86? Is GPU put the nails in x86 coffin?
The fact that CELL is used as RSX accelerator is a consequence of less-than-required GPU performance.

I'm not sure you want people comparing CELL to a CPU. It does even worse in that regard. 1 PPU and some at best fussy poorly designed coprocessors. CELL is not a good CPU. And no GPUs have not put nails in x86 because x86 has significant attributes that GPUs cannot touch performance and flexibility wise along with the most supported tool chains in the world. Nothing is going to kill x86 at this point. The best something can hope for is that x86 doesn't steamroll right over it. In order to compete with x86 you need to be a sustained 2-10x better in delivered performance over a 10 year period. And given the x86 volumes involved, that is basically impossible.

I'm writing this after sleepless night of SPU coding on weekend =)

So instead of just wasting your time, you are trying to waste everyone elses as well.

aaronspink · Aug 6, 2012

sebbbi said:
GPU compute was still very much in an experimental stage in 2003-2004 (when PS3 design must have been locked in). G80 was released in 2006 and it was the first "general purpose" programmable GPU. It supported Nvidias brand new proprietary CUDA (1.0) compute API. The debuggers were horrible back then, and the support libraries were pretty much nonexistent.

Cell was all those thing and more.

I don't think that things would be better if they had chosen to drop the SPUs completely, and to rely completely on GPU compute. A single (SMT) PPC core with a G80 would have been a disaster. GPU rendering is easy because you don't want to get any data back. Asynchronous operation is not a problem, since ~half a frame latency to display isn't noticeable.

It isn't about GPU compute and neither is CELL. The SPUs aren't doing GPU compute stuff, they are doing basic GPU graphics stuff(which shouldn't exactly be shocking since that was what they were largely intended to do before Sony realized their cell graphics solution wouldn't work). And the argument isn't a single PPU vs CELL, it is 3-4 PPUs vs cell.

Carl B · Aug 6, 2012

I started reading this thread at #276 and just jumped right in - that was probably my fault because clearly some themes have been re-tread, and those previous 275 posts have seemingly worked to get people pretty heated. Discussion on the merits/demerits of Cell and architecting in general are still of course very much welcome (even if worded tersely), but let's try to take it down a notch in terms of things getting personal.

sebbbi · Aug 6, 2012

aaronspink said:
No, no it cannot. You have to start doing optimizations just to be able to RUN.

That's not a bad thing at all. You will save a huge chunk of time later in the project, if you write proper technical design about your data access patterns before you start coding. That's true for any CPU, not just for Cell. Refactoring data access patterns later in the project usually leads to huge problems.

I spent around half a year optimizing our data access patterns in our last Xbox 360 game. Proper technical design for this would have saved a lot of time

archangelmorph · Aug 6, 2012

sebbbi said:
I spent around half a year optimizing our data access patterns in our last Xbox 360 game. Proper technical design for this would have saved a lot of time

I know that feeling all too well...

Probably one of the best things the CELL ever did was forcing developers to think about this stuff going into production from the off rather than write huge monolithic C++ object-based multiple-inheritance hierarchies of spaghetti code & then have tov deal with the technical debt of trying to optimize this at the 11th hour to hit 30fps.

No fun at all...

Shifty Geezer · Aug 6, 2012

A significant problem with the discussion at the moment is that for all the enthusiasm either side may bring to the table, there's little hard evidence. People are talking generalisations without reference to comparable test-cases, such as such-and-such code taking however long to develop and running however quickly on whatever architecture. A proper test would see a competition, with x86, Cell, and GPGPU (let's say G80 for the sake of this discussion) having to perform several diverse challenges. The time taken to write the solution and the measure of performance would give some actual reference points. The problem with that sort of challenge is that optimal code may not be the first approach on any platform, but a programmer experienced with their platform should be able to get a reasonable ballpark result for reference.

3dilettante · Aug 6, 2012

Can we take a look at the position that the good thing about CELL was that it forced developers to properly design programs for a highly parallel architecture?
I'm curious, especially since this thread also included the additional claim that this also lead the optimal program design strategy for other multicore solutions, such as Xenon or a hypothetical 4 PPE alternative to CELL.

In reverse order, if this strategy was optimal for all the competing architectures, why wouldn't they have eventually gotten around the same point with or without CELL?
If saving the rest of its competitors years of time reaching the same conclusion was the benefit, was STI's effort one of biggest sucker bets in recent tech history?

In effect, are we praising a multibillion (edit: billion plus, not sure after the back and forth on the fab what the final sum was) dollar investment made by STI's members whose return was all the competition and all the customers Sony and Toshiba never attained realizing that if they didn't mend their ways, at some point they'd run into the same problems as CELL when programming with many cores, so lets change software design and skip using CELL?

There are many design decisions made for CELL that are orthogonal to the proper design of a parallel system, such as the disparate ISAs and the rudimentary memory subsystem within each SPE, the SPE's stripped down yet long pipeline, and the large context and the commensurate switching penalties introduced with the LS.
Why did any of these things make people program better parallel programs, rather than say they got over those wonky architectures decades prior?

jonabbey · Aug 6, 2012

3dilettante said:
In effect, are we praising a multibillion (edit: billion plus, not sure after the back and forth on the fab what the final sum was) dollar investment made by STI's members whose return was all the competition and all the customers Sony and Toshiba never attained realizing that if they didn't mend their ways, at some point they'd run into the same problems as CELL when programming with many cores, so lets change software design and skip using CELL?

I remember hearing the $400 million figure for Sony's pre-production investment in Cell. Not sure about Toshiba or IBM.

Carl B · Aug 6, 2012

Yeah the $400 was the R&D/engineering budget - the remainder of the costs were related to the fab build-outs, though there again I think Sony footed the brunt. Nagasaki was straight up them I believe, where East Fishkill they helped foot build-out for capacity rights.

3dilettante · Aug 6, 2012

jonabbey said:
I remember hearing the $400 million figure for Sony's pre-production investment in Cell. Not sure about Toshiba or IBM.

I've seen that, and I've seen larger numbers bandied about that included large investments in infrastructure and gearing up for the expected flood of devices using it.

Nesh · Aug 6, 2012

aaronspink said:
my tone is only aggressive when people are wearing very rose colored glasses. Cell was an albatross around the neck of the PS3 and still is. The only thing the SPUs can do is make up some of the shortfall of the RSX, the RSX design that only exists because CELL was the wrong solution and sony had to do an about face because of it. If Sony had gone down a G80 based GPU design from the beginning, they would of been able to shift resources around and had a much better console.

You act as if Sony couldnt negotiate for a better GPU if they planned for a Cell+Nvidia GPU from the beginning, Which means Cell wasnt the problem. It was Sony's expectations for the then "future" and the overall design. A design which they had to change. It was their plans for the GPU solution the problem and not Cell itself. A Cell+better GPU was perfectly viable if they saw clearly how things would have went in 2006

aaronspink · Aug 6, 2012

Shifty Geezer said:
A significant problem with the discussion at the moment is that for all the enthusiasm either side may bring to the table, there's little hard evidence. People are talking generalisations without reference to comparable test-cases, such as such-and-such code taking however long to develop and running however quickly on whatever architecture. A proper test would see a competition, with x86, Cell, and GPGPU (let's say G80 for the sake of this discussion) having to perform several diverse challenges. The time taken to write the solution and the measure of performance would give some actual reference points. The problem with that sort of challenge is that optimal code may not be the first approach on any platform, but a programmer experienced with their platform should be able to get a reasonable ballpark result for reference.

There is actually some data out there, most of it available from PRACE:

http://www.prace-project.eu/IMG/pdf/wp6_newlanugages_v1.pdf
http://www.prace-ri.eu/IMG/pdf/D6-5.pdf
http://www.prace-ri.eu/IMG/pdf/D6-6.pdf
http://www.prace-ri.eu/IMG/pdf/christadler.pdf

aaronspink · Aug 6, 2012

Nesh said:
You act as if Sony couldnt negotiate for a better GPU if they planned for a Cell+Nvidia GPU from the beginning, Which means Cell wasnt the problem. It was Sony's expectations for the then "future" and the overall design. A design which they had to change. It was their plans for the GPU solution the problem and not Cell itself. A Cell+better GPU was perfectly viable if they saw clearly how things would have went in 2006

Cell by and large was their plan for a GPU. Hence why Cell does ok at GPU tasks like geometry and anti-aliasing, etc. The only reason Sony had to scramble at the last moment is that their Cell as graphic theory didn't work.

Was Cell any good? *spawn

Andrew Lauritzen

Moderator

Vitaly Vidmirov

archangelmorph

liolio

Aquoiboniste

Arwin

Now Officially a Top 10 Poster

sebbbi

Shifty Geezer

uber-Troll!

aaronspink

aaronspink

Carl B

Friends call me xbd

sebbbi

archangelmorph

Shifty Geezer

uber-Troll!

3dilettante

jonabbey

Carl B

Friends call me xbd

3dilettante

Nesh

Double Agent

aaronspink

aaronspink

Similar threads