Hehe. So you want to move back to cache, and trying to fix the latency problem you've just introduced with hardware threading?
Latency is not a problem for SPE. LS is the foundation of CELL. There is just no point in SPEs without LS.
Latency introduced by HW threading?
Threading is supposed to allow you hide latency not create a latency problem where there was none.
To be able to go from thread to thread swiftly when a particular thread stalls for a certain period of time having a huge context to save and restore does not help you... increasing LS's size only makes this problem worse.
Is putting a lot of man hours and transistors pushing up single thread performance of SPU's and force developers' hands the best of possible roads to explore or can we move in the direction well the entire industry (
Sun (Rock, Niagara), Intel (LRB, their desktop CPU line, IA64), and AMD along with IBM itself... look for some of their recent patents while searching for SIMD, VTE, BTE, etc...) is going that is heavily threading our cores and have more and more cores on chip?
I think LS is a bit of a tough roadblock for the evolution of a single SPE and so if you want to significantly move it forward keeping on increasing LS's size is counter-productive.
If you want to say that doubling the LS and then just keep on adding SPE's as your heart desires it might be a way, but I am not convinced it is the only way forward.
You will increase bandwidth over time to keep those SPE's fed, but you will also have higher and higher main RAM latency and if you keep on increasing the LS's size you will sooner or later find yourself increasing the access time to the LS and you are kind of going back to square one.
Latency hiding will not be useless with a 16-32 SPE's system even if you double the LS, but it might very well become more and more critical.
Something has also to be done to allow a less steep learning curve for developers.
A cache hierarchy with flexible cache locking would get you a bit of both worlds. If you want the predictability and deterministic nature of a LS you lock out a portion of the cache (which might be a nice substantial portion of it) and work from there, but you are not killing developers who prefer the cache model and HW DMA/prefetching it delivers.