High Performance Computing Potential of Cell

aldo

Newcomer
Found an interesting article on the analysis of the Cell processor's power which also discusses the potential of a double precision version of the processor dubbed Cell+:
HPCwire - According to the authors, the current implementation of Cell is most often noted for its extremely high performance single-precision (32-bit) floating performance, but the majority of scientific applications require double precision (64-bit). Although Cell's peak double precision performance is still impressive relative to its commodity peers (eight SPEs at 3.2GHz = 14.6 Gflop/s), the group quantified how modest hardware changes, which they named Cell+, could improve double precision performance.

"Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency," the authors wrote. While their current analysis uses hand-optimized code on a set of small scientific kernels, the results are striking. On average, Cell is eight times faster and at least eight times more power efficient than current Opteron and Itanium processors, despite the fact that Cell's peak double precision performance is fourteen times slower than its peak single precision performance. If Cell were to include at least one fully utilizable pipelined double precision floating point unit, as proposed in their Cell+ implementation, these speedups would easily double.
Berkeley's research paper in pdf format: http://www.cs.berkeley.edu/~samw/projects/cell/CF06.pdf

-aldo
 
The simplicity of the SPEs and the deterministic behavior
of the explicitly controlled memory hierarchy make Cell
amenable to performance prediction using a simple analytic
model. Using this approach, one can easily explore multiple
variations of an algorithm without the effort of programming
each variation and running on either a fully cycle-accurate
simulator or hardware.

Good stuff. Thanks for the link.
 
Berkely pdf - The Double Precision (DP) pipeline in Cell is obviously an afterthought as video games have limited need for DP arithmetic. Certainly a redesigned pipeline would rectify the performance limitations, but would do so at a cost of additional design complexity and power consumption. We offer a more modest alternative that can reuse most of the existing circuitry. Based on our experience designing the VI-RAM vector processor-in-memory chip [12], we believe these “Cell+” design modifications are considerably less complex than a redesigned pipeline, consume very little additional surface area on the chip, but show significant DP performance for scientific kernels.

In order to explore the limitations of Cell’s DP issue bandwidth, we propose an alternate design with a longer forwarding network to eliminate the all but one of the stall cycles ― recall the factors that limit DP throughput as described in Section 3. In this hypothetical implementation, called Cell+, each SPE would still have the single DP datapath, but would be able to dispatch one DP SIMD instruction every other cycle instead of one every 7 cycles. The Cell+ design would not stall issuing other instructions and would achieve 3.5x the DP throughput of the Cell (51.2 Gflop/s) by fully utilizing the existing DP datapath; however, it would maintain the same SP throughput, frequency, bandwidth, and power as the Cell.
Sounds like a win-win situation for Sony.

-aldo
 
mckmas8808 said:
Wow. Looks like IBM and Sony need to get the working on that Cell+ processor.


they are. STI have an entire roadmap for CELL processors beyond the current-gen CELL
(a refined current-gen CELL is going into PS3).

In addition to process-shrinks to 65nm, 45nm and smaller for current-CELL,
there will be tightly connected dual-CELL processors on two dies (it seems)

Also, Mini and Micro Cell Processors (PSP2 CPU?)

then further out, there will be single-die Cell Processors (or CELL+ or CELL2) with many more SPEs, and-or dual-core CELLs on single die (PS4 CPU?)


1027sce_cell_roadmap.jpg
 
Last edited by a moderator:
More specifically, as stated in IBM's own CELL architecture forums IIRC, IBM is already working on replacing the DP FPU in each SPE with a much faster pipelined version although it obviously won't make it or be needed in PLAYSTATION 3.
 
Panajev2001a said:
More specifically, as stated in IBM's own CELL architecture forums IIRC, IBM is already working on replacing the DP FPU in each SPE with a much faster pipelined version although it obviously won't make it or be needed in PLAYSTATION 3.

Are they (IBM) doing what Berkley suggested?
 
IBM is already working on replacing the DP FPU in each SPE with a much faster pipelined version although it obviously won't make it or be needed in PLAYSTATION 3.
Yeah, they were talking about fully pipelining double precision ops (or at least "fully" to the same extent that everything else is "fully" pipelined). If they do that, that suggests they could probably bring theoretical DP FLOPS throughput to within half of the SP FLOPS.

Though it supposedly wouldn't be necessary for PS3, I can't say it wouldn't be handy here and there. Once people discover a little thing called "frequency", I can see where DP certainly wouldn't hurt.
 
mckmas8808 said:
Are they (IBM) doing what Berkley suggested?
I bet IBM is/was evaluating that and other Variations.
It really depends where the wanna go though, for pure scientific uses they could pipeline/advance the DP-Part and cut down the SP at the same time.
For a a high volume Chip the Cell+ would be interesting, if the additional cost is low enough it could even go into PS3 (while making sure Developers are coding for the original Cell). It would be a very versatile chip and high volume.

ShootMyMonkey said:
Yeah, they were talking about fully pipelining double precision ops (or at least "fully" to the same extent that everything else is "fully" pipelined). If they do that, that suggests they could probably bring theoretical DP FLOPS throughput to within half of the SP FLOPS.
Aint the current DP-Pipe only scalar? Even if you pipeline it, you still need atleast 2 cycles for 2x64 vectors, SP would be 1 cycle for 4x32 vectors - or 4 times faster in FLOPS
 
Last edited by a moderator:
That is what they seem to say in this doc, but IIRC, one of the old postings (referring to the postings Pana spoke of) was making it sound like the DP pipe wouldn't just be pipelined, but also 2-way SIMD, so you could get an instruction out every cycle.
 
ShootMyMonkey said:
That is what they seem to say in this doc, but IIRC, one of the old postings (referring to the postings Pana spoke of) was making it sound like the DP pipe wouldn't just be pipelined, but also 2-way SIMD, so you could get an instruction out every cycle.
Ok, I was aiming at Cell+ - just pipelining what we got now (only adding little die-space, as the doc claimed), regardless of what IBM is actually doing. Adding SIMD on top certainly would affect the die size a good deal, too much to do it aslong it aint required( ie not viable for the mainstream part ).

Wish I would find that pdf where the implementation and decisions on the current SPE are explained in detail.
 
Hmm well i didn't see this I made a new topic about this. Anyway I don't know much about what's said in that doc(i only know the basics). Extacly how much more powerful is the Cell vs Xenon with this new information?
 
MadReaper said:
Hmm well i didn't see this I made a new topic about this. Anyway I don't know much about what's said in that doc(i only know the basics). Extacly how much more powerful is the Cell vs Xenon with this new information?

This is not new information saying Cell in the ps3 is more powerfull than expected.

The article's focus is on Cell in scientific applications. To achive accuracy in scientific applications, you need double precision (DP) floating point numbers. Cell in it's current form is 10-15x faster at crunching single precision numbers than DP.

This is not a design flaw. DP performance is not nessasary for a gaming console. So the focus was only on SP performance. But the article suggests that with a little tweaking IBM could easy increase DP performance to several times what it is now. But we won't see these changes until the next version of the chip. So it's irrelevant to your console war.
 
ShootMyMonkey said:
Though it supposedly wouldn't be necessary for PS3, I can't say it wouldn't be handy here and there.
IMO it would primarily just make lazy people lazier - I like to criticise artists about this (give them a finger and they'll tear off the whole arm), but it's human nature really.
There are programmers out there that given the chance will happily blow through DP in situations where even fixed point should suffice.

Npl said:
Aint the current DP-Pipe only scalar?
No, it's 2-way SIMD, one op every 7 cycles (1.8GFlop/SPE).
That's why all the talk of how boosting performance to 50% of SP could be relatively cheap (without any changes to ISA compatibility).
 
Last edited by a moderator:
Fafalada said:
...(without any changes to ISA compatibility).
I think that is very true, they have no intention expanding the registers to 256 bit, so 2-way DP is probably here to stay. From following the cell-developer discussion boards at IBM a while back, I'm pretty sure IBM has some sort of a Cell+ in the works, I guess it will be on 65 nm so we will have to wait a while. It will be interesting to see if they stick to the XDR memory.
 
While their current analysis uses hand-optimized code on a set of small scientific kernels,

Didn't anyone else notice this?

Of course you're gonna get impressive speed ups if you run code like this, this is true of just about every single chip out there ATM.

When they can run a compiler and have it auto vectorize the code and get speed ups like the article talks about then you'll see Cell start to take off in HPC and PC applications, until then this gonna be a niche of a niche computing platform that very few will be able to make practical use of.
 
mesyn191 said:
Didn't anyone else notice this?

Of course you're gonna get impressive speed ups if you run code like this, this is true of just about every single chip out there ATM.

When they can run a compiler and have it auto vectorize the code and get speed ups like the article talks about then you'll see Cell start to take off in HPC and PC applications, until then this gonna be a niche of a niche computing platform that very few will be able to make practical use of.

When you say very few; how many companies do you see using CELL? 1 company, 3, 5?
 
mesyn191 said:
Didn't anyone else notice this?

Of course you're gonna get impressive speed ups if you run code like this, this is true of just about every single chip out there ATM.

Someone can correct me if I'm wrong, but I was always under the impression that hand optimisation is par for the course in scientific computing. They really do try to squeeze every last drop out of hardware - and that is true for the chips they're comparing Cell against also, so the comparison is valid.
 
Titanio said:
Someone can correct me if I'm wrong, but I was always under the impression that hand optimisation is par for the course in scientific computing. They really do try to squeeze every last drop out of hardware - and that is true for the chips they're comparing Cell against also, so the comparison is valid.

You are absolutely right. HPC is not like general purpose computing. HPC uses a small number of extremely highly optimised and thoroughly tested libraries to do things like matrix operations, Fourier transformations etc. and these are used over and over again. They are all hand optimised.
 
Back
Top