Current PPU waste a lot of cycles because of long pipelines and stalls.
I believe that reasonably clocked ARM core (1.5-2GHz) can outrun it with a fraction of power consumption (still VMX >> NEON)
Only the recent Intel and AMD CPUs can read data from CPU store queue (and require certain alignment conditions to be met), but I don't know if ARM CPUs can yet do the same. So we could be getting LHS stalls on ARM as well if data is moved between float<->vector<->int registers (or if just written data is accessed again). Of course the out-of-order execution would fill some of the stall cycles with next instructions, but ARM doesn't have SMT, so all instructions must come from the same thread.
I have only programmed in-order ARM CPUs, and have no first hand experience of their out-of-order designs, but I doubt their second generation out-of-order CPU would yet match IBM, Intel or AMD designs that have been redefined for 15-20 years. And I doubt they are even trying to match the single threaded IPC of those monsters, since low power consumption is one of their main goals. ARM in-order design would likely be much better than the current in-order console CPUs, but I would expect that some stall cases still remain (especially when moving data between vector registers <-> general purpose registers). And a gaming console must have powerful vector instructions, and those will be used a lot.
I think SPEs would be rather redundant on a next-gen console if it has a Fermi or GCN type GPU.
They're will be as useful as today. GPU calculations has high latency. In console it could be done much faster, but not as fast as SPU jobs. A lot of code is inefficient on GPUs, but good for SPU.
Agreed completely. CPUs execute more and more threads and get wider and wider vector units all the time (AVX is 256 bit, and Intel has also stated they will expand it to 1024 bit). At the same time GPUs become more and more programmable (better branching, new synchronization primitives, shared memory, etc). It will be harder and harder to fit something between them (Cell/SPEs have fewer use cases left where they perform the best).
I don't believe we would even need a Fermi or GCN, as the current VLIW4 Radeons are performing very well in DirectCompute. Yes, they are a bit slower in general purpose scalar heavy code than Fermis, but offer the best performance (and performance per watt) for highly optimized vectorized DirectCompute code. Pretty much any recent GPU could be used as general purpose parallel processor, so we can expect next gen consoles to have one.
People that are talking about >50 SPE Cells, what kind of interconnect system would they use?
That's what I was wondering as well... to feed 256 SPEs you would need a lot main memory bandwidth, and a radically faster bus between 256 x local stores <-> main memory.