purpledog said:
A bit out of topic but not too much. A newbie question as well:
Can multi-thread processor hide latency as well as out-of-order execution?
Is multi-thread cheaper to make than out-of-order ?
Depending on how aggressive the multithreading implementation is, it is either massively cheaper or much cheaper.
It's somewhat harder to gauge just how much cheaper it would be on some modern OoO processors, because a lot of the scheduling and rename hardware can be repurposed to handle multithreading.
The latency multithreading can hide is highly dependent on the workload and the needs of the user.
As a general rule of thumb, OoO can handle most load miss latency for misses that stay on-chip. Memory and IO stalls cannot be worked around. However, OoO can pass over a load miss and find future load misses in the code stream, compacting what would be multiple hundreds of cycles idle into a time equal to a little more than one miss.
Helper threads have been discussed to give multithreaded chips a similar advantage, but the field is still rather young.
That's great for latency as it is seen by a single thread, and most multithreading schemes have either a small or significant penalty in single-threaded performance.
For workloads with frequent trips to main memory (a dominant stall source) or IO, multithreading can improve overall performance if there are enough threads to fill in the gaps. Each individual thread will probably still be waiting a lot, but at least other work gets done in the meantime. For most tasks without critical latency requirements, this is great.
Sun's latest Niagra chip has relatively poor single-threaded performance, but in a heavily multi-threaded web server role, having many threads working means it can take heavier loads and frequent misses better than OoO chips. It's response time is usually somewhat slower, but not much slower. On heavy loads, the situation is reversed as less threaded chips start to overload.
Just to be clear, by multi-thread, I'm talking about the ability to switch (in hardware) to another thread (command list) if the current one if staled.
Subsidiary question:
- How many cycle does the processor take to switch.
- Who is triggering the switch: the harware itself if it detect it staled ?
This sounds like a coarse-grained multithreading approach. The context switches are much faster, but they are still there.
For example, the upcoming Montecito processor takes something on the order of 10-20 cycles to make a switch, because there are things like cache victim buffers that must be written out in order to prevent incorrect memory behavior.
Detecting a stall can be something as simple as the cache controller signalling the instruction unit when it can't find an entry in the L2 cache. Potentially, the instruction unit can also pick up on certain instructions that are known to take a long time, such as synchronization instructions or cache-bypassing loads.