IBM has significantly improved in the last two years their ability to develop high speed ICs. Both Cell and Xenon were great occasions to use external funds to do advanced research on custom logic design and it seems that it has paid off. If POWER6 comes out in the 4-5 GHz range as advertised with a much shorter pipeline than Cell then it is not so unrealistic that a 65nm shrink could hit 6 GHz. Not taking into account power consumption obviously which might go way up.
I'm expecting the 65nm Cell to plummet in power use (at least at 3.2Ghz).
It's currently rated at 110W at 1.1v. Going dual voltage will allow them to drop the logic voltage, even at 90nm that'd be a 30+% saving. 65nm should allow a further saving of around 30%. In total I expect the saving to be 50% or better at the same clock speed.
This should enable the Cell to run at higher frequencies in uses other than PS3, they'll have less constraints on power so expect some pretty high frequency parts. I don't know if they'll get anything like 6Ghz, though that may be possible if they de-couple the frequency of the PPE, it uses more power than an SPE so if the PPE was clocked lower the SPEs could be clocked relatively higher.
LS Vs Cache
On the topic of the thread...
Each has relative advantages and disadvantages but LS is higher bandwidth and lower latency than cache. L1s may be lower latency but that's because they are small, a 256K L1 would run slower than a 256K LS.
Cache represents memory, it's purpose is to pretend to the CPU that the memory is faster than it really is. Cache needs to be kept in sync and any other caches, this can be avoided on dual core chips by using larger shared caches but this increases latency. As the number of cores increases coherence will become a big problem as it'll increase cache latency. Expect to see complex cache arrangements like AMDs Barcelona or Sun's Rock.
LS does not represent memory so it does not require redirection logic, it also doesn't need to be kept in sync (coherent) with other LSs so increasing the number of SPEs will not cause LS latency to increase. LS are also smaller than caches and use less power.
The problem with LSs is in unpredictable code or data structures. A cache is better for working with this type of code, this is why the PPE uses a cache, control code is much more likely to be like this.
High compute code can use more predictable data structures which is more LS friendly, when it does data can be double buffered so the processor does not stall on memory, processors without LS cannot generally do this very well (if at all) so SPEs can run this sort of code faster.
If the data cannot be made LS friendly the LS can be made to act like a cache, this has a fairly high latency but enables more app types to work on SPEs.
LS is like many decisions in Cell, they've traded software complexity for hardware complexity, the result is a fast processor which uses relatively little power but takes more thought (or a complete change of thought) to program it.