A remarkably interesting read and an excellent sum-up of this thread:
Some excerpt first:
*GPU and Cell/B.E. are close cousins from a hardware architecture point of view.
*They both rely on Single Instruction Multiple Data (SIMD) parallelism — a.k.a vector processing, and
*they both run at high clock speed (>3GHz) and implement floating point operations using RISC technology achieving single cycle execution even for complex operations like reciprocal or square root estimates. These come in very handy for 3D transformations and distance calculations (used a lot both in 3D graphics and scientific modeling).
*They both manage to pack over 200 GFlops (billions of floating point operations per second) into a single chip. They are excellent choices for applications like 3D molecular modeling, MM force field computations, docking, scoring, flexible ligand overlay, protein folding.
BUT
*There are some subtle differences between the two, e.g. Cell/B.E. support double precision calculations while GPUs do not (there is some work being done in that direction at Nvidia though), which makes the
Cell/B.E. the only suitable choice for quantum chemistry calculations.
*There is a difference in memory handling too:
GPUs rely on caching just like CPUs, while the Cell/B.E. puts complete control into the hands of the programmers via direct DMA programming. This allows the developers to keep “feeding the beast” with data using double buffering techniques
without ever hitting a cache-miss causing stalls in the computation.
*Another difference is that GPUs use wider registers 256 bits, while the Cell/B.E. uses 128 bits, but using a double-pipe which allows two operations to execute in a single cycle. The two approach may sound like equivalent on a cursory look, but again provides a subtle difference. 128 bit houses 4 floats, enough for a 3D transformation row or point coordinate (typically extended to 4 instead of 3 to handle perspective),
so you can execute 2 different operations on them on the Cell/B.E. while the GPU can only do the same operation on more data.
So If the purpose is to apply an operation to a lot of data, that comes down to the same, but a
more complex computation series on a single 3D matrix can be done twice as fast on the Cell/B.E.
*The 8 Synergetic Processor Units of the Cell/B.E. can transfer data between each others memory via a
192GB/s bandwidth bus, while the fastest GPU (GeForce 8800 Ultra) has a
bandwidth of 103.7 GB/s and all others fall well below 100GB/s. The high end GPUs have over 300GFlops theoretical throughput, but
due to the memory bus speed limitations and cache miss latency, the practical throughput falls far short of that, while the
Cell/B.E. has demonstrated benchmark results (e.g. for
real-time ray tracing application) far superior to that of the G80 GPU despite the theoretical throughput being lower than the GPU.
Here's the rest of it, even cost effectiveness is discussed:
http://www.simbiosys.ca/blog/2008/05/03/the-fast-and-the-furious-compare-cellbe-gpu-and-fpga/