For the desktop certainly - most desktop applications and games will depend on single thread performance. For other applications like servers maybe not, at least not to the exclusion of multi-thread performance, and in these cases, given that single thread performance is reaching it's limit, multi-core is the only way to go.
The time of seemingly exponential single-threaded performance growth is over, not all growth. Incremental gains will probably continue for decades. At a bare minimum, silicon scaling should continue to ~2020.
Actually this is completely wrong. With a large core, you have to multi-task since you have a limited number of cores (unless of course you are running a single tasking OS like MS-DOS). It requires time to interrupt other processes, and for the CPU to make a context change - bigtime latency there!
I'm not saying there should only be one core, I was saying that many dozens of simple single-pipeline cores are not the best solution for a lot of problems.
The performance penalty for multi-tasking is not as bad as you think. Most threads spend much of their time idling, and with a decent OS scheduler, compute intensive threads get a bigger share of the processor's time. With several cores of any type, the cost of multitasking goes from minor to negligible in most cases.
The penalty of context switching is also implementation-dependent. The Intel Montecito core can do a context switch in about 12 cycles, and it is rare that every thread in every process needs 100% attention all the time. Some threads can afford to be sidelined if they only run every other minute.
Tasks that have overlapping data sets or memory footprints can benefit, since they can use what the other has cached or stored.
This means that in the grand scheme of things, multitasking is not the biggest factor.
Overall, the individual thread latency is shorter if it is on a core that is more robust, period.
If there are a lot of threads, reduced single-threaded performance isn't too bad, but only if there are enough threads to hide the shortfall.
What will always come back to haunt a processor is that it is not always possible to spawn as many threads as one would like.
With a large number of small cores like Cell, you can dedicate a processor to a critical task, so no context switch is required.
That would be beyond overkill. I could have a chip with a quarter of the number of cores needed for that, and it would probably be several times more efficient if I allowed them to context-switch.
In a typical desktop environment, there could be over a hundred threads. Of them, probably one or two is actually doing anything most of the time. For a server, there could be hundreds of active threads, but they'll have a bunch of associated threads that also don't do anything most of the time. Direct mapping of threads to a processor means shutting down most of the cores for most of the time.
In addition to this, it would completely destroy locality for threads that share some of their memory footprint. If tasks do share data or code, they will either need their own copies (cache/local store gets wasted) or have to snoop and broadcast results on the chip. This isn't so bad, unless the design takes the many simple core idea to an absurd degree. (More on this later)
Even CELL context switches regularly, it just tries to avoid doing it too frequently, because the SPEs aren't very good at it.
As a result, the overall performance impact of multi-tasking will be drowned out by other factors past maybe four cores (even at two, system responsiveness for a desktop environment is pretty good).
Also running instructions from local store - no cache misses. Latency should be a hell of a lot lower with a large number of small cores like Cell.
Local store has nothing to do with OOE, and it has its own drawbacks that can affect performance in certain workloads. An OO core can have local store, it doesn't really matter to the local store what's in the core it's hooked to.
What begins to dominate at high numbers of cores is the cost of communications and synchronization.
Since the simple cores can't match a single heavy-duty core, they need to work together. The simpler they are, the more cores that need to be working on a common problem. That means they need to talk to each other more often.
CELL has a ring bus that serves a small number of cores. The bus only offers peak bandwidth if a transfer is between immediately adjacent cores, and it imparts a significan latency penalty. It's not impossible to manage if the problem being worked on doesn't care about the latency, or it can be easily divided up.
It gets very hard to guarantee good communication or divide up a problem well if there are 32 cores.
The more cores that are needed to match a single monolithic core, the greater the need that a core must communicate with its partners.
This means a given operation or set of operations must send a message out of its core to reach another processor, sending signals that may cross a distance as great or greater than that of a lower-level cache access.
Depending on the way the cores are hooked together, it could take longer to pass the request along.
For tasks that are not easily parallelizable, the amount of intercommunication is higher.
Inter-core communication has a cost, and it is much higher than the internal forwarding of a large core.
The key is to divide tasks only as far as it does not cause the cost of communication to outweigh the gain in throughput. Talk is not cheap at the silicon level.
If overhead becomes prohibitive at 32 threads, having 64 cores won't do a thing. Having 32 cores that can do their jobs twice as fast, however, would make a difference.
In other cases, it doesn't matter how a task is divided, since there is some critical stretch of code that can't be split up.
If that stretch of code controls how the task is handled by the hive of cores, then every core is going to sit and wait for it to finish.
Would you rather it run on a wide OOE core that can finish it in 15 cycles, or the weaker one that takes 30?
Remember, it's not just that one core taking 15 extra cycles, it's every one of the other cores waiting on it as well.
I'm of the opinion that it is better for a general chip solution to have a few powerful cores added to the mix.
If having 4 big cores means a chip can only have 50 other small cores, while a competitor has 64 small ones, it would appeal to more people if on a wider set of workloads the mixed variant can possibly do 50% better, despite an apparent core deficit of 13%.
ADEX said:
Intel make most of their money on servers - exactly where Terascale will be good.
It will work well for a given subset of servers, not all of them. Intel would be better to give up ten terascale cores out of 80 if it meant one or two Conroe-type were there to keep it from giving up when things get complicated.
One of the problems with CPUs with tons of cores is cache coherence, when your core needs data it needs to know if any other core has a copy of the data cached, that's going to send latency through the roof killing single threaded performance, so Intel are looking at things like "speculative threading" in the compiler in place of OOO. They're also doing a lot of work on the software side as that is the biggest problem.
It's not just cache coherence, it's communications overhead. Many cores means many will have to talk to each other. Speculation doesn't eliminate the problem with coherency, and there's no way Intel's going to give up on cache. That's a guaranteed drop of at least 100x in performance, period.
Chips like Cell and Niagara may look weird today but in 5 years time all processors are going to look that. OOO is the best solution for general purpose code *today* but once latency problems really start hitting OOO isn't going to help as the core will just be sitting doing nothing. It'll be hard to justify a feature which burns a lot of power but wont improve performance much.
Niagra II will double the single-threaded performance of Niagra. The magic word "parallel" can't produce work if a problem just doesn't parallelize.
Cell will likely prove very interesting going forward as local stores do not need to be kept coherent, I think this will turn out to be a major advantage.
Unless you want them to be coherent, in which case they suck. Not every workload lets each core play in its own sandbox.
It is more likely that Intel will compromise as IBM did with Xenon and allow software control over locking cache lines, perhaps even more control on the cache snoop protocols.