Was Cell any good? *spawn

Status
Not open for further replies.
Cell by and large was their plan for a GPU. Hence why Cell does ok at GPU tasks like geometry and anti-aliasing, etc. The only reason Sony had to scramble at the last moment is that their Cell as graphic theory didn't work.

That's actually not true - Cell wasn't going to be doing the graphics (so to speak), even though the original BE patent did cover that base. The original R&D into the PS3's graphics chip was centered around a processor named the RS, which has been the subject of a couple of discussions around here in the past. It was a spiritual successor to the GS in the PS2 and was supposedly in a similar vein of being a fill-rate monster. Die size/lagging node shrinks, and concern over tools maturity set Sony on a late-game course correction towards NVidia.
 
I know that feeling all too well...

Probably one of the best things the CELL ever did was forcing developers to think about this stuff going into production from the off rather than write huge monolithic C++ object-based multiple-inheritance hierarchies of spaghetti code & then have tov deal with the technical debt of trying to optimize this at the 11th hour to hit 30fps.

No fun at all...

To be fair a lot of old school game developers never went that way in the first place. Trying to claim that Cell had anything to do with people recognizing C++ bad practices is stretching the truth. Most old school console devs allocated and largely still do out of pools with data locality, so I don't see Cell changing much there either.

Data access patterns have been critical to performance on any processor single or multi core since memory access stopped being free circa early 1990's. Some of it was hidden by poor caches in the PS1/N64 era where code layout was a more obvious issue.
Once we got decent L2 caches and code layout stopped being much of a factor data access patterns became everything to performance, this was true on Xbox long before Cell.

Cell did force developers who'd been ignoring data layout to rethink things, but I doubt it had any real impact on how the better studios developed.
 
There is actually some data out there, most of it available from PRACE:
http://www.prace-ri.eu/IMG/pdf/christadler.pdf
Thanks. We actually have some solid material to discuss now. Without reading everything thoroughly and just looking for a launch point of a discussion, I notice some pretty damning results in the PPT slides suggesting very low utilisation on Cell. eg. Slide 18, Cell achieves a 2% peak performance in a 1D FFT, whereas an 8 core Nehalem attains 30%. And in sparse vector multiplication, the Intel manages 3.31% peak performance whereas Cell struggles with only 0.04% peak. Those wanting to argue Cell's case should take a look and offer their explanations on the findings.
 
Cell did force developers who'd been ignoring data layout to rethink things, but I doubt it had any real impact on how the better studios developed.

Could not agree more. Any popular console forces (performance minded) programmers to restructure their code/algorithms to maximize performance for that given console. The notion that cell's architecture was the reason for better programming practices is nonsense. Sebbbi doesn't jump through hoops for the ps3 for the fun of it; he does it to maximize profits.

If the best selling point of an architecture is that "it forces programmers to code better!", then maybe it's time to start looking at a different architecture. :p
 
If the best selling point of an architecture is that "it forces programmers to code better!", then maybe it's time to start looking at a different architecture. :p
That would be the intial interpretation/reaction to that sentiment, but there is actually a legitimate alternative. If an architecture allows devs to code poorly and the result is considerable underutilisation of the hardware, an architecture that forces good practice ensure better use of the hardware people have paid for. Of course in the console space that happens anyway, so I don't believe Cell instigated better practice in general computing. And if a processor can allow bad code to run fairly efficiently and not cost a lot more as a result, that's better for software development.
 
Thanks. We actually have some solid material to discuss now. Without reading everything thoroughly and just looking for a launch point of a discussion, I notice some pretty damning results in the PPT slides suggesting very low utilisation on Cell. eg. Slide 18, Cell achieves a 2% peak performance in a 1D FFT, whereas an 8 core Nehalem attains 30%. And in sparse vector multiplication, the Intel manages 3.31% peak performance whereas Cell struggles with only 0.04% peak. Those wanting to argue Cell's case should take a look and offer their explanations on the findings.

The lesson from those papers is not that Cell struggles compared to other architectures though IMO, but that Cell struggles when gifted with quick-and-dirty code ports. Looking at FFTs specifically, we have these results from back in the day showing extremely favorably vs the OOE contemporaries of 2007:

http://www.cc.gatech.edu/~bader/papers/FFTC-HiPC2007.pdf

(pg 181 for quick visuals)

And another briefer, more succinct effort into the same problem:

http://www.capsl.udel.edu/~jmanzano/Cell/docs/misc/GSPx_FFT_paper_legal_0115.pdf
 
The notion that cell's architecture was the reason for better programming practices is nonsense.

Practices no, but it did lead to a PS3-first multiplatform tools effort across the industry a couple of years in, which ended up paying performance dividends for said "multi" platforms that may not have taken place at all were it not for hands being forced. Now, that's not to argue this as a benefit to Cell, just some color on the evolution of dev tools in the late 2000's.
 
I still still believe that the end game for the Cell was called by the master brain behind it, so IBM.

Aaron's words "xenos got praised for its compute density' are in my opinion really important because it's also IBM pov.
I peaked an eye at the presentations aaron linked and it's too much for me to wrap my head around... :LOL:
Still it got me to research paper from PRACE especially about the Power A2.
I find nothing about performance but that paper was interesting.

To me and my limited understanding it looks like Power A2 are everything both Cell and Xenon could have ever wanted to be (I mean as far as slab of silicon slabs can dream), and in essence IBM chose Xenon in place of Cell.
 
Last edited by a moderator:
The lesson from those papers is not that Cell struggles compared to other architectures though IMO, but that Cell struggles when gifted with quick-and-dirty code ports. Looking at FFTs specifically, we have these results from back in the day showing extremely favorably vs the OOE contemporaries of 2007:

http://www.cc.gatech.edu/~bader/papers/FFTC-HiPC2007.pdf
That's what I remembered, and I looked up this 2007 paper after seeing the 2009 PRACE slides. However, that was a Cell-optimised algorithm. Would the same approach benefit other architectures? And I'm also far from in position to look at the 2009 paper and claim the results are from a 'lazy developer'. ;) It'd be far better for someone with Cell experience to weigh in on what they have seen from Cell and how that compares to the PRACE results, and whether the PRACE results are indicative of good Cell use or don't show the architecture in its best light.

To me and my limited understanding it looks like Power A2 are everything both Cell and Xenon could have ever wanted to be (I mean as far as slab of silicon slabs can dream), and in essence IBM chose Xenon in place of Cell.
Absolutely. Cell is not IBM's design. They wanted homogenous multicore and didn't want SPEs. SPEs were all Toshiba's doing, and so it's no wonder that Cell schizmed into IBM's CPU and Toshiba's SPURSEngine.
 
The lesson from those papers is not that Cell struggles compared to other architectures though IMO, but that Cell struggles when gifted with quick-and-dirty code ports. Looking at FFTs specifically, we have these results from back in the day showing extremely favorably vs the OOE contemporaries of 2007:

35 days spent porting and optimizing mod2f is hardly quick and dirty. I think you should take a look again. And also understand that best case setup our own data set FFT is a world apart from actually running a real dataset.

And I'll always put more trust in a comparitive papers/presentations results vs one where the author has a vested interest in looking good.
 
35 days spent porting and optimizing mod2f is hardly quick and dirty.
Without info on how experienced each coder was with their platform, such figures aren't terribly meaningful. That is, if the Cell coder was trying his hand at Cell for the first time, while the CUDA implementation was by a very seasoned GPGPU vet, then it wouldn't be a fair comparison. Whereas if all implementations were by the same person, or equivalently experienced developers, then it would be a more accurate comparison. That is of course some of the debate with Cell, difficulty to code for, but a simple '35 days' number doesn't provide a great deal of information withou some context.
 
35 days spent porting and optimizing mod2f is hardly quick and dirty. I think you should take a look again. And also understand that best case setup our own data set FFT is a world apart from actually running a real dataset.

And I'll always put more trust in a comparitive papers/presentations results vs one where the author has a vested interest in looking good.

More trust in an apples-to-apples situation, sure, but if the question in this case specifically was: "can Cell perform well in FFT operations?," then the answer is, yes, it can. In fact it can do quite well. The institutions which set up clusters or purchased blades knew enough to know that for their purposes at least, algorithms would be adaptable and data sets understood.

Those papers may have been setting out to achieve something, but I don't think that goal undermines the results.
 
More trust in an apples-to-apples situation, sure, but if the question in this case specifically was: "can Cell perform well in FFT operations?," then the answer is, yes, it can. In fact it can do quite well. The institutions which set up clusters or purchased blades knew enough to know that for their purposes at least, algorithms would be adaptable and data sets understood.

Those papers may have been setting out to achieve something, but I don't think that goal undermines the results.

It is important to remember that PRACE are the people who build and use those machines. AFAICT, the person who did the Cell ports has significant experience with both Cell and the various programming models as well as being able to do a google search. If either of the papers presented would of worked, it is highly likely they would of used them.
 
It is important to remember that PRACE are the people who build and use those machines. AFAICT, the person who did the Cell ports has significant experience with both Cell and the various programming models as well as being able to do a google search. If either of the papers presented would of worked, it is highly likely they would of used them.
What sort of requirements could there be rendering the IBM implementations not usable?
 
I think an important difference between the two papers is that the 2009 one is by scientists, who are almost universally interested in double precision. Hence they are using the much later revision of Cell that was modified by IBM to get better DP performance, something which Cell could basically only do in emulation, and hence very inefficiently. With only small changes they managed very respectable double precision performance, but it was not nearly as impressive comparatively. The 2007 paper only focusses on single precision, which was what the original cell was designed for (though it is quite good at integers as well) and that is a big reason for the difference.

Note by the way that comments just posted from the Sine Mora dev in his post-mortem thread directly contradict that you can't run any generic (C) code directly on SPEs without serious refactoring. This is a common misunderstanding. Similarly, context switching isn't nearly as expensive as some people sometimes seem to think. We've had some good explanations on that topic here from, was it mike? And few people realise that SPEs can, and probably should, run the game's main loop and talk and manage each other.

As for console developers from last generation knowing what's important already before Cell, I'm pretty sure that is partly true, but all of them would agree I think there was still a lot to discover on how to best do it now. More importantly, in no generation pf consoles before did we have such a big influx of game developers coming in from PC and GPU programming, and being tasked with porting ready made PC engines from there (it did happen in generations before, but I'd argue the scale has changed immensely, as more PC developers were also forced to move to consoles more and more because of economic realities, where previously they could perhaps get away with outsourcing the console version of their engine)

The development of Folding@Home is also a fun one to look at in terms of what kind of workloads were optimised for which platform and how that evolved from Intel CPU to including Cell and then GPGPU.
 
It is important to remember that PRACE are the people who build and use those machines. AFAICT, the person who did the Cell ports has significant experience with both Cell and the various programming models as well as being able to do a google search. If either of the papers presented would of worked, it is highly likely they would of used them.

Fair points. Taking them into consideration I delved into the provided PRACE papers and actually the results are quite specific to the circumstances. The Cell results when read through offer a lot more color as to what was being tested, why the results were as they were, and what their own take on the architecture "outside the results" were.

Paper D6-5 is the more informative I think, giving insights into the architecture as well as initial porting results & experiences through pages 41 through 49, and taking on another test case on pages 51 through 53. Honestly the conclusions are generally in line with what I view the 'Cell-friendly' stance as being; it is a shift in thinking, the rewards are there, and the PPE rather than the SPEs are the main issue with current implementation.

D6-6 pages 41 through 49 really deals more with CellSs the language/environment than it does with Cell itself per se, but points highlights some of the challenges in optimization. Again though, it's the PPE that's called out rather than the SPEs per se.

Vectorize your code, keep in mind task as well as data-level parallelism, hide latency... the same topics that have been oft discussed. I don't think Cell emerged from those papers much differently than it seemed prior to their introduction.
 
If the paper recognises improvements could be made in the Cell code, why did they not implement those changes in the 35 days of development and optimisation? Is the development environment that bad?
 
As a note some of those papers were done not that long after the Cell launch.
It kind of a testament about how important legacy code and tools are, that's a massive X86 advantage vs /whatever.
Some people just doesn't have to wait like game devs locked on 2 systems for +7 years.
By the time the Cell environment got better it was most likely out done by gpu, etc.
 
If the paper recognises improvements could be made in the Cell code, why did they not implement those changes in the 35 days of development and optimisation? Is the development environment that bad?

For the FFT implementation, it was a combination of not being able to port to the SPEs in this instance and bad PPE performance. I did give the page numbers you know. :p

I'll highlight also for doc D6-6 that pages 59 shows Cell as the top performer on mod2am, and page 60 affirms the relative strengths. A tweaked 2005/2006 chip beaten (barely) by an 8-core Nehalem and NVidia C1060, when you consider wattage, die sizes, and transistors, an extremely strong showing.

Granted the irony there is since it's an IBM proprietary blade, the expense is surely not in its favor. But that's not pertinent to this particular angle of discussion.
 
As a note some of those papers were done not that long after the Cell launch.
It kind of a testament about how important legacy code and tools are, that's a massive X86 advantage vs /whatever.
Some people just doesn't have to wait like game devs locked on 2 systems for +7 years.
By the time the Cell environment got better it was most likely out done by gpu, etc.

tool wise Cell never really has had a good set. One thing you notice pretty quickly is that anything you want to optimize kernel wise almost always has a call/function within MKL. Cells tools and libraries are severely lacking even today compared GPGPU. A large part of Cell's issues stem from it being a relatively exotic architecture will very poor tools support. It is interesting to contrast Cell in that regard with the data that came out of the TACC workshop on Intel's MIC which had large full scientific workloads ported in days and with good scaling.
 
Status
Not open for further replies.
Back
Top