It is about the algorithm Breadth-First Search (BFS) on a graph, they finish to optimise the algorithm and it is now 22 times faster on CELL than the PC implementation on a 3,4 Ghz Xeon HT. I remember that someone post something about this algorithm on B3D before.
http://www.ddj.com/dept/64bit/197801624
It is very interesting, they describe every optimisation step by step. You can view some code. And the authors think the final code is a very good exemple of optimized CELL code.
It is fast but the program is much longer on CELL: 60 lines of code for original implementation... and 1200 lines for CELL.
They compare the CELL programmation to DOS programmation and it seems the SPU with the local store are fast:
And I like the new B3D design.
EDIT:
I find it was patsu:
http://forum.beyond3d.com/showpost.php?p=901916&postcount=22
I find this exemple very interesting. It is not easy to rethink the problem for CELL. And with a problem with random access and almost no computation, I find the performance very impressive.
http://www.ddj.com/dept/64bit/197801624
It is very interesting, they describe every optimisation step by step. You can view some code. And the authors think the final code is a very good exemple of optimized CELL code.
It is fast but the program is much longer on CELL: 60 lines of code for original implementation... and 1200 lines for CELL.
On a Pentium 4 HT running at 3.4 GHz, this algorithm is able to check 24-million edges per second. On the Cell, at the end of our optimization, we achieved a performance of 538-million edges per second. This is an impressive result, but came at the price of an explosion in code complexity. While the algorithm in Listing One fits in 60 lines of source code, our final algorithm on the Cell measures 1200 lines of code.
They compare the CELL programmation to DOS programmation and it seems the SPU with the local store are fast:
In a way, this feels like the old DOS days, when everything had to fit in the (in)famous 640 KB. On the other hand, an SPE's local storage (256 KB) is so much larger than most L1 data caches (a Xeon has just 32 KB). This is one of the reasons why a single SPE outperforms the highest-clocked Pentium Xeon core by a factor of three on many benchmarks.
And I like the new B3D design.
EDIT:
I find it was patsu:
http://forum.beyond3d.com/showpost.php?p=901916&postcount=22
Breadth-First Graph (BFS) Exploration
-- Random memory access pattern (when exploring random graphs, which is our case)
-- Fine granularity (tens, hundreds of bytes)
-- Little opportunities to increase the communication granularity
-- Almost no computation
-- Reading/updating bitmap and data transfers
---- Difficult problem for conventional processors
---- Can’t get any worse in terms of irregular access pattern and communication granularity
I find this exemple very interesting. It is not easy to rethink the problem for CELL. And with a problem with random access and almost no computation, I find the performance very impressive.
Last edited by a moderator: