Gubbi said:
I fail to see how CELL is a solution to this. The SPEs have massive contexts, and next to nothing in the way of virtualization (at least not in hardware, - again software must provide a solution).
Batch processing. No cache, but a bit of full-speed local memory. Like a very large register file. So, no cache stalls. And the instruction set is (almost surely) optimized for running a small program (SIMD) on a small dataset and generate a new set as the result.
So, it is like a one-shot deal. You load it up, execute, and collect the results. And it has just enough general purpose and flow control logic to do all that by itself.
The branches and loops are (almost surely) quite specific, and they are there for the batch management and the implementation of more complex (mathematical) functions, not to be able to implement any arbritary operation that isn't directly related to the optimal functioning of the device itself.
Having dynamic branches is again mostly just a way to save transistors by not having to implement any possible (and redundant) function directly in silicon.
CELL is made to run with a fixed set of threads at anyone time and you're f*cked if you need more. This works well for a game console where you're only running one big app. But it's useless in a multiprocessing OS system (like a PC running Windows/linux whatever). Similar (or actually even worse) in a server environment.
Again, batches. Instead of having multiple, arbitarily processes run concurrently, you split the computations into batches and throw them at the first unit that is available. Collect and return the results when done. This is much better than the normal multi-threaded model, in that it eliminates the two largest problems with that: deadlocks on data and synchronization stalls.
The reason each unit can have concurrent threads is mostly to reduce large latencies. As long as they can store the state of enough threads to minimize that latency (and taking context switching into account), no more is needed. Because those threads will only run for a limited time, finish, and can be discarded afterwards.
It's a more general solution than the normal multi-threading, multi-processor, as done with most popular operating systems. And there are many things that could benefit from that. Not servers, but most applications that need some serious horse power.
You do have to use another kind of OS, but I think a Linux variant could run very well and fast, if the libraries are changed accordingly.
Strongly disagree.
Now you have more units, - that stall. And they will stall more often because you took out all the smarts that lowered apparent latency..
... And they will flush more often because they are sent on a wild goose chase more often by a non-existant branch predictor.
Cheers
Gubbi
Only the one that is used for system management and running the program logic. Everything else will be chopped up into objects (batches), and dispatched to the other units. You just (more or less) throw your C++ objects in a pool and forget about it.
So, what does it matter if the single Central Processing Unit that is used to control everything is running inefficiently? It's not as if there is much work for it to do in the first place, but wait and react to events, and keep all the others occupied.