The point I was addressing was the idea that because of chips like those from the Terascale project, Intel was going to give up on OOE.It will happen, but mainly with increasing processor frequency, which is not increasing as fast as one would wish due to chip technology hitting physical limits which have to be circumvented. A single core is better than a dual core any day, if you can get the same performance out of it as a dual core with the same number of transistors. Sadly you can't get this. The reason the multi-core designs out there are proliferating is because of the limits to single core performance, limits which are expanding slowly, but limits that are still there none the less.
That's not going to happen. That single-threaded performance will no longer grow as rapidly as it once did doesn't mean that new designs are just going to give up on all the capabilities that lead up to this point.
On average, OOE (not even the ultra-speculative stuff of the latest gen) gives about a 50% boost in performance.
Adding 40 more cores to an 80 simple core chip is not going to lead to the same boost thanks to diminishing returns.
I got off on a tangent, the latency I was addressing was the wall-clock time of completion for a given stream of instructions.I am not saying multi-tasking performance has to be bad, but if you are that concerned about latency in response, then you should not multi-task because that is the biggest latency problem right there - worse than almost everything else.
High-throughput designs trade off on this latency for the advantage of having more streams running simultaneously. As long as the portion of code that runs serially is kept small, the overall penalty to performance scaling is small.
However, in problems that are not perfectly parallel, the serial component begins to dominate sooner. What is troubling with respect to going so far back as to rely on a pool in-order scalar pipelines is that in all but the most ideal workloads, the limited extraction of even easy ILP means that an even larger portion of the code gets lumped into that serial component of Amdahl's equation than is necessary.
Either a single core takes forever to finish a stream, or it will enlist the aid of other cores. This is not without cost, inter-core communication will never be faster than result forwarding, register storage, and cache access within a core.
The more cores there are, the more sacrifices that must be made to adequately link and coordinate them.
It's not a win to say parallel execution does better on a problem if only because the single-threaded performance has gone down the crapper.
If a problem is not very parallel--and a vast number are not, the serial portion (which is bigger than it needs to be because the cores are so feeble) becomes the limiter of performance almost immediately. If a problem is not embarassingly parallel--most problems are not, then past a certain point the sea of cores winds up waiting for some critical stream to complete, wasting time and energy.
No number of cores can change that, but designers can help minimize how much code is forced to run serially with more robust cores. Even a small number of wider OO cores could catch many of these cases.
It also helps that most real-time OSes are used for environments that don't really need great audio or video quality.There are a number of real-time operating OSes out there, and none of the true RTOSes are multi-tasking OSes for exactly this reason. With multi-tasking say video decoding, sound, disk i/o and say running an application at the same, you can and do get video or sound stuttering. You aren't likely to get that if you run the time critical stuff - the video and audio on SPEs for example, and for this type of use it is efficient since you utilise the SPEs fully while you are streaming, and you can use those SPEs for something else once you stop streaming.
Real-time is about stringent maintenance of deterministic behavior and fast response.
In most situations, the constraints don't need to be that harsh, which is good because tight constraints mean reduced performance.
Sound or video stuttering in most user environments can be handled by maybe four cores, so long as the IO junk can be handed to one of them.
Most heavy parallel environments don't need to be real-time either, and for large systems it would be incredibly serializing to force determinism.
Let's note that multiprocessing of any kind is a far greater source of non-deterministic behavior than wider issue, OOE, or context switching.
Locks and semaphores don't care how many cores there are. True, it helps if there's at least one, and it makes a difference if there are more than one, but they are software constructs. They don't know anything about the silicon that runs them.You can of course reduce the timeslices to very short times to get better response, however if you do this, your performance drops. Also you can have time critical code interrupting time critical code, and non-reentrant code in libraries which you may need to use which require spin locks or semaphores which causes even more latency.
I suppose you could have a core for every thread in the environment, and 99% of the time those cores are unavailable and doing nothing.
It's just better to aim for a reasonable solution, where the cost of context switching is minimal but hardware utilization is high.
Why not have a context switch and get over it?
With SMT or SoEMT, the cost is either not there or it happens in less that 20 cycles.
It's all about diminishing returns. It's why there aren't 1000-way associative caches. We could do it, but the difference from 18-way is almost nothing.