The algorithm noted here is very well conceived, but one important factor was missed. Somebody touched on it earlier (that the CELL has 9 cores)...
Basically when the Woodcrest, Pentium 4 HT or any of the other single chip, multiple core processors are performing this operation, they are fully committed to the task. The CELL PPE core is, however, waiting for the completion of the SPE's in order to use the results. If this operation were pipelined early enough in (for example) a real time game or physics simulation, the two HT threads on the PPE core could happily continue to process other tasks until the results were available. Although memory bandwidth would be somewhat compromised due to the frequent load/store DMAs triggered by the SPE units, there should still be plenty of bandwidth left for the main core threads to continue.
This basically means that in its current implementation, aside from setting up the initial graph, the whole latency of the operation could be hidden by pipelining other tasks on the PPE core.
Another implementation would be to redesign the algorithm to take into account the two other threads and gain, perhaps, some more bandwidth? This would give *true* performance figures for a fully optimized CELL implementation.