CELL to me is a General Purpose CPU with a bank of ultra powerful DSP's, with lots of localized memory, massive internal bandwidth, and with massive external bandwidth.
I will describe each one in turn:
1) ultra-powerful DSP's - extremly high clock rate for DSP's, as no DSP by any other company runs anywhere near 3.2 GHz. So these specialized processors were designed to run at high clock rate. This was a key architecture decision in designing CELL.
2) localized memory - 256 KB of SRAM, must faster than eDRAM, but also taking up four times the number of transistors, but the cost is worth it, as you have the speed of cache, but in a signaficant amount to run a fair size algorithm or algorithms. Each processor can operate at full speed within it localized memory, while having ZERO impact on the rest of the system. This eliminates huge amounts of wait states, where multiple processors is contending for the same cache. Of course there is still data access contention in the system as all these processors have to be coordinated in getting fed. This makes CELL more difficult to program than a traditional processor.
3) massive internal bandwidth - very important, especially for algorithms that need to stream data from one processor to the next, and with such a large number of processors, seven of them, you have to keep them fed, even if they are not doing streaming algorithmic work.
4) massive external bandwidth - 25 GB/sec to main memory, and 35 GB/sec to a GPU. Data can come in from main memory, and be worked on by a bank of processors, and then instead of being written out to main memory, can be passed directly to the GPU. This kind of through put with so many processors, and especially for any data that can be worked on in streams, would provide huge benefits. Could easily be 10 to 30 times more powerful than a single traditional processor in this regard.
I know there are pluses and benifits to this design, but in the end, many processors with localized memory can either be operating on a single task and having ZERO impact on the processing ability (except data contention) of another processor in the system, or each processor is employed to work on a small piece of a single problem, before passing it on to the next processor. There is flexibility in the design, with the localized memory being big enough to support individual jobs, and with the massive internal and external bandwidth supporting these units being fed.
While I describe the SPE's as DSP's they are much more than DSP's, and more akin to something between a traditional DSP and a general purpose CPU. I only use the term DSP because DSP are designed to be fast at math operations, just like the SPE's in CELL.
CELL is very forward looking, especially one could argue that CELL in it's present incarnation exists to teach programmers about data contention, and how to break up algorithms between many processors. I think PS4 will contain a single tranditional processor again, but even more SPE's, probably 32 to 40 of them. Unless a breakthrough comes in preventing current leakage, I don't see the clock rate going up by much, with 3.6 GHz probably being tops.