Edge said:
Jumping through hoops? Can you define that?
Running slowly? The SPE's are anything but slow, so I really would like to hear your reasoning behind that.
I guess my idea of jumping through hoops in my definition is having to program or design in code with properties that have no algorithmic reason to exist for the sake of acceptable performance. All processors have gotchas like that, though the SPEs have more in being geared towards a specific kind of work.
Instruction ordering is pretty important in SPE code, otherwise its limited ability at dual issue instructions is wasted, potentially a significant fraction of its peak execution abilities.
The local store is of fixed size and it exists as its own memory space, meaning tasks must stream well and be of a certain size. This is something the programmer must always keep in mind, even though algorithmically it shouldn't matter. All processors have such wrinkles, they are just more pronounced and present in the software model of the SPE.
Coherency must be explicitely maintained, and future Cell variants with a larger local store will need a recompile to use it.
Memory accesses are more optimal if they pull in batches of data, it's the sweet spot for the EIB and DMA engines. There are tasks that don't really like to do that, which either mean a lot of performance goes wasted or the programmer has to get real creative in structuring code for the sole reason of catering the architecture.
SPEs don't have complex branch prediction hardware, though branch hints can be inserted into the code. Branch prediction hardware is still on average better than static prediction, and the SPE pipeline is a long one.
Most processors with such a long pipeline would have a robust predictor in place, but in the case of the SPEs, designers figured that the more specialized target applications would not benefit from the hardware costs incurred. Nobody is going to want to run branchy code on an SPE. It can run it, but that workload isn't the target for the design.
There are tasks that are inherently branchy and difficult to predict at compile time, and the SPE will always run far below its peak in such situations. Since the SPE is not targeted at such workloads, this really isn't much of a problem.
That's a lot of the compute-intensive tasks out there that run very well on an SPE, but there are still tasks that exceed the bounds of the SPE's comfort zone, which means that performance suffers. Future variants will do better.
Is the SPE general purpose in that it can run (almost) any kind of code? I'd say no, being general purpose would have sacrificed a lot of peak capability, and the designers had a specific target in mind.
I'd say an SPE is pretty much a universal computing machine because it can process anything another machine can, but that doesn't mean it does everything well.