A very interesting article about CELL programming

chris1515

Legend
Supporter
It is about the algorithm Breadth-First Search (BFS) on a graph, they finish to optimise the algorithm and it is now 22 times faster on CELL than the PC implementation on a 3,4 Ghz Xeon HT. I remember that someone post something about this algorithm on B3D before.

http://www.ddj.com/dept/64bit/197801624

It is very interesting, they describe every optimisation step by step. You can view some code. And the authors think the final code is a very good exemple of optimized CELL code.

It is fast but the program is much longer on CELL: 60 lines of code for original implementation... and 1200 lines for CELL.

On a Pentium 4 HT running at 3.4 GHz, this algorithm is able to check 24-million edges per second. On the Cell, at the end of our optimization, we achieved a performance of 538-million edges per second. This is an impressive result, but came at the price of an explosion in code complexity. While the algorithm in Listing One fits in 60 lines of source code, our final algorithm on the Cell measures 1200 lines of code.

They compare the CELL programmation to DOS programmation :D and it seems the SPU with the local store are fast:

In a way, this feels like the old DOS days, when everything had to fit in the (in)famous 640 KB. On the other hand, an SPE's local storage (256 KB) is so much larger than most L1 data caches (a Xeon has just 32 KB). This is one of the reasons why a single SPE outperforms the highest-clocked Pentium Xeon core by a factor of three on many benchmarks.

And I like the new B3D design.

EDIT:
I find it was patsu:
http://forum.beyond3d.com/showpost.php?p=901916&postcount=22

Breadth-First Graph (BFS) Exploration
-- Random memory access pattern (when exploring random graphs, which is our case)
-- Fine granularity (tens, hundreds of bytes)
-- Little opportunities to increase the communication granularity
-- Almost no computation
-- Reading/updating bitmap and data transfers
---- Difficult problem for conventional processors
---- Can’t get any worse in terms of irregular access pattern and communication granularity

I find this exemple very interesting. It is not easy to rethink the problem for CELL. And with a problem with random access and almost no computation, I find the performance very impressive.
 
Last edited by a moderator:
It is about the algorithm Breadth-First Search (BFS) on a graph, they finish to optimise the algorithm and it is now 22 times faster on CELL than the PC implementation on a 3,4 Ghz Xeon HT. I remember that someone post something about this algorithm on B3D before.

http://www.ddj.com/dept/64bit/197801624

It is very interesting, they describe every optimisation step by step. You can view some code. And the authors think the final code is a very good exemple of optimized CELL code.

It is fast but the program is much longer on CELL: 60 lines of code for original implementation... and 1200 lines for CELL.



They compare the CELL programmation to DOS programmation :D and it seems the SPU with the local store are fast:



And I like the new site design.

I find it was patsu:
http://forum.beyond3d.com/showpost.php?p=901916&postcount=22



I find this exemple very interesting. It is not easy to rethink the problem for CELL. And with a problem with random access and no computation, I find the performance incredible.

Would it not be possible to perform a similar level of very low level optimisation on the Xeon to produce significant performance increases also? Perhaps not to the level of Cell but at least to far les than a 22x difference?

Or is this difference mostly born from the speed of the LS compared to normal PC RAM?
 
Would it not be possible to perform a similar level of very low level optimisation on the Xeon to produce significant performance increases also? Perhaps not to the level of Cell but at least to far les than a 22x difference?

Or is this difference mostly born from the speed of the LS compared to normal PC RAM?

Maybe, but the local store latency versus cache latency seems to be only one part of the Cell advantage.

And it is not an article from STI employee, it is not the goal to downplay the Xeon. And most of the time the IBM bench are fair.

This work has been supported by the Laboratory Directed Research and Development Program for the Data-Intensive Computing Initiative at Pacific Northwest National Laboratory, operated by Battelle for the U.S. Department of Energy under Contract DEAC0576RL01830.
 
Last edited by a moderator:
It's all the factors together. If you look at the slides I linked to (second last page), you can see that for this task, Cell is on par with the 128-CPU Blue Gene.
 
It's all the factors together. If you look at the slides I linked to (second last page), you can see that for this task, Cell is on par with the 128-CPU Blue Gene.

And now it is faster.

In this article, we present strategies we've used to make a Breadth-First Search on graphs as fast as possible on the Cell, reaching a performance that's 22 times higher than Intel's Woodcrest, comparable to a 256-processor BlueGene/L supercomputer—and all this with just with a single Cell processor!
 
Good find Chris. Impressive figures for Cell. But why compare Cell to a Pentium 4 at this point in time?? If anything is it possible that there was no original coding done for the Pentium 4 and they simply used archival information??


Also is it just coincidence that their 22 times increase in performance mirrors their 20 times increase in code (60:1200)
 
Good find Chris. Impressive figures for Cell. But why compare Cell to a Pentium 4 at this point in time?? If anything is it possible that there was no original coding done for the Pentium 4 and they simply used archival information??


Also is it just coincidence that their 22 times increase in performance mirrors their 20 times increase in code (60:1200)

It is probably an algorithm they use at work on x86. It seems very useful. And they compare the performance result with Blue Gene and other multicore on the first doc find by Patsu. It is probably just a coincidence the code increase mirroring the performance increase.

Despite its simplicity, this algorithm is important because it is a building block of many applications in computer graphics, artificial intelligence, astrophysics, national security, genomics, robotics, and the like.
 
Last edited by a moderator:
As quite common with these sorts of comparisons, the speed-up is from a naive (single-threaded) C-implementation to a heavily optimized CELL implementation, using very different data-structures.
Looking at the complete source-code from the Dobb's site, I can tell you which version of the code I'd rather maintain... ;)

Although I'll sooner or later get around to installing Linux on my PS3 and then I'll try to optimise an interesting bit of code (maybe parts of my wavelet compression library?) for the fun of it... :)
 
BFS/DFS are fundamental algorithms to computer science used so often they're practically ubiquitous.

The key here is they are comparing the classic implementation on a P4 (simple) to a highly efficient cell implementation. Yes, you could probably optimize the P4 implementation some, but that's defeating the purpose. The whole point of a general purpose processor like the P4 with good cache, branching, and OoOE is that you can do things simply, and still get reasonable performance.

If you have to start seriously optimizing BFSes on a P4, you've kinda thrown away some of the HW that you've paid for which was to increase your coding productivity.

Most C programmers on x86 are simply not going to spend uber amounts of time on custom BFS, many will use quick and dirty obvious hand-rolled implementations or use a library version.

I think what has been shown by this example is that a) CELL is a lot of work BUT b) since you have to do the work anyway, you can get better performance if done right.

It's also shown that all of the fears of branch problems and random I/O can be overcome. Remember all the "CELL will suck at physics, traversing trees, branching randomly on collisions, etc"?
 
[maven];946234 said:
As quite common with these sorts of comparisons, the speed-up is from a naive (single-threaded) C-implementation to a heavily optimized CELL implementation, using very different data-structures.
Looking at the complete source-code from the Dobb's site, I can tell you which version of the code I'd rather maintain... ;)

Although I'll sooner or later get around to installing Linux on my PS3 and then I'll try to optimise an interesting bit of code (maybe parts of my wavelet compression library?) for the fun of it... :)

I see the CELL version is very difficult to read.:oops:

You can probably optimize the x86 version but I'm sure that you will not be as fast than CELL with the Xeon.
 
Last edited by a moderator:
BFS/DFS are fundamental algorithms to computer science used so often they're practically ubiquitous.

The key here is they are comparing the classic implementation on a P4 (simple) to a highly efficient cell implementation. Yes, you could probably optimize the P4 implementation some, but that's defeating the purpose. The whole point of a general purpose processor like the P4 with good cache, branching, and OoOE is that you can do things simply, and still get reasonable performance.

If you have to start seriously optimizing BFSes on a P4, you've kinda thrown away some of the HW that you've paid for which was to increase your coding productivity.

Most C programmers on x86 are simply not going to spend uber amounts of time on custom BFS, many will use quick and dirty obvious hand-rolled implementations or use a library version.

I think what has been shown by this example is that a) CELL is a lot of work BUT b) since you have to do the work anyway, you can get better performance if done right.

It's also shown that all of the fears of branch problems and random I/O can be overcome. Remember all the "CELL will suck at physics, traversing trees, branching randomly on collisions, etc"?

Yes I think it is the meaning of the example.
 
Also is it just coincidence that their 22 times increase in performance mirrors their 20 times increase in code (60:1200)

It is definitely a coincidence. Though a funny one. ;)


Anyone find the source code. I can't seem to find it on the site just yet. I want to see what they spent the 1200 lines on before I comment.
 
[maven];946286 said:

Thanks.


It looks like nearly half of the code is just dealing with moving data to and fro from the Local Stores via DMA. And then a good chunk of the other half is the manual loop unrolling and SIMDfication to process 16 words per iteration.

He does say the difference between the deeply optimized implementation of step5 vs the simple straightforward version was a speedup of 3.68-times. So all that work does seem to be worth it. The Cell has such deep pockets for pulling out performance with optimization.
 
The Cell has such deep pockets for pulling out performance with optimization.

An alternative, slightly more provocative formulation would be: CELL is such a complicated architecture that getting good performance out of it takes great effort / redesign.

And on that note, I would be quite interested in the speed difference between running the straight-forward C-version on the P4 and CELL.
 
Back
Top