You know, I think perhaps better would be another class of SPE. You'd have PPE, SPE+LS, and SPE+cache+LS or somesuch as an optimized random-access processor. Nope, still can't see much point! Just shove another PPE in there! The moment you hit random access, the SPE's performance advantage is totally lost.
We might say that adding a shared L1 for the SPE's is a waste of silicon and with that argument we might agree with (depending on what we sacrifice if we do end up cutting things back).
Saying that we should add more PPE's (which are fairly HUGE and power-hungry as it is and would cut down a lot of space: better than changing the actual PPE to SPE's ratio, it would be to optimize both for power and performance the PPE's [somehow developers are publicly, on forums, suggesting that something better than PPE/PPX could be realized
]) or optimize random-access processors misses the point.
If all SPE's were is glorified vector co-processors of the PPE it would be one thing, but they are not: we value their independence and the amount of code they can run fast (the more time developers are spending with writing SPE code the more they seem to enjoy the speed that these buggers seem to execute what you throw at them).
Would it improve their performance if we could accelerate small, cache-able memory transactions and thus general-purpose code without changing the way developers access SPU's register-file and SPU's Local Storage ?
Surely, even with a good LS-L1 cache-EIB hierarchy performance with general purpose code might not be as fast as say a Core 2 Duo (and the worst case scenarios might even produce an increased performance hit, as in the case of a cache miss we would have some cycles thrown away)
, but let's not over-estimate PPE's performance as it does not seem that many developers are THAT happy about the kind of code you run quickly on them and what code is faster or quite as fast on the SPE's (basing this on things publicly said on forums such as this one, and IIRC something was said in this forum indeed) despite its huge size and power consumption (compared to SPE's).
If for a reasonable price we could buy a few percentage points of performance in general purpose processing for SPE's we could render them even more independent from the PPE's and faster over-all (especially when the developer was not able to optimize an SPU application well enough and left in some scalar/random-access happy code that constitutes a bottleneck in performance critical areas of the SPU application) not to mention better adaptable to all kinds of processing (which still takes advantage of the fact that on a CELL processor we have MANY SPE's and they are all pretty FAST even though they might be not as optimal as someone could want them to be in all kinds of code) it might be a win.
Evidently for the PLAYSTATION 3 CELL Broadband Engine this optional L1 cache was not implemented, but the game is not already closed for PLAYSTATION 4
.