Cell/Xenon vs "normal" CPUs OS performance.

aaaaa00 · Dec 20, 2005

SPM said:
Windows is a binary API. In other words, it offers binary compatibility and so the binaries remain fixed. Even when a bug is found the binary code isn't usually changed, unless it is a really serious one because some programs may use undocumented entry points or may rely on the bug to work properly, the bug just becomes a feature.

Actually new optimizations are added all the time to Windows -- pieces of the OS get rewritten to optimize them for better performance, the binaries get recompiled with a newer compiler for each release, and the compiler itself gets better every release.

However every time Windows changes, MS has to make sure it replicates the old behavior of any APIs it changes, bug for bug, because of the huge array of existing software out there. There is a facility in Windows to trigger app specific backwards compatibility hacks as well, but no one can ever test all the possible programs that run on Windows.

Rys · Dec 20, 2005

purpledog said:
Subsidiary question:
- How many cycle does the processor take to switch.
- Who is triggering the switch: the harware itself if it detect it staled ?

Depends on the CPU. Usually thousands of cycles, though. And interrupts, raised by the OS scheduler, usually trigger a process context switch.

aaaaa00 · Dec 20, 2005

purpledog said:
Subsidiary question:
- How many cycle does the processor take to switch.
- Who is triggering the switch: the harware itself if it detect it staled ?

A hardware thread context switch can happen pretty much on the next instruction for a processor that's hardware multithreaded like Xenon. There is a huge performance advantage for the hardware threads when compared with software threads, you basically get the first two threads for "free" in terms of context switch overhead.

The processor itself triggers the switch on a stall.

purpledog · Dec 20, 2005

aaaaa00 said:
A hardware thread context switch can happen pretty much on the next instruction for a processor that's hardware multithreaded like Xenon. There is a huge performance advantage for the hardware threads when compared with software threads, you basically get the first two threads for "free" in terms of context switch overhead.

The processor itself triggers the switch on a stall.

Sounds like a good way to go !
Does anybody now if out-of-order execution is better than, let say, hardware 4-thread context (like UltraSPARC T1) in terms of minimising the stall ?

darkblu · Dec 20, 2005

purpledog said:
Sounds like a good way to go !
Does anybody now if out-of-order execution is better than, let say, hardware 4-thread context (like UltraSPARC T1) in terms of minimising the stall ?

SMT is as good an efficiency booster as you write your threads to be. i.e. it relies largely on cooperation from the programmer, as opposed to the compiler or any inherent ops parallelism exploited by the out-of-orderness.

furthermore, there's this expected relative win of (non-superscalar or narrow-superscalar) SMT, in-order designs versus wide supescalar non-SMT, in-order designs (say itaniums) stemming from the empirical rule that humans are generally better at exploiting macro level parralelism than the compiler is at exploiting insturction-level parallelism.

purpledog · Dec 20, 2005

darkblu said:
SMT is as good an efficiency booster as you write your threads to be. i.e. it relies largely on cooperation from the programmer, as opposed to the compiler or any inherent ops parallelism exploited by the out-of-orderness.

If I understand correcly, you're saying that the advantage of OOO (out-of-order execution) is that is doen't need mutiple thread to be efficient.
I would say that since the whole industry is moving toward more and more parallelism, this doen't seems to be a great advantage in favor of OOO.

darkblu said:
furthermore, there's this expected relative win of (non-superscalar or narrow-superscalar) SMT, in-order designs versus wide supescalar non-SMT, in-order designs (say itaniums) stemming from the empirical rule that humans are generally better at exploiting macro level parralelism than the compiler is at exploiting insturction-level parallelism.

And here, let me rephrase again

:
You're saying that OOO is generally a quite poor (micro-level) optimisation compare to what a human can do, because we have a much better (macro-level) overview of what the code is doing.

Conclusion: in a world where every single program is designed for parallism by good-enough programmer, SMT is more efficient than OOO.
... I'm sure I'm missing somehting here. Anyway, I'm going to buy a book on "parallelism for dummies" right now !

Does anybody have any idea of the respective cost ($) of SMT and OOO.
Basically: how much space does it take on a chip ?

ShootMyMonkey · Dec 20, 2005

Does anybody have any idea of the respective cost ($) of SMT and OOO.
Basically: how much space does it take on a chip ?

Depends on the extent of things. SMT generally implies duplicate register files and other such things so that you can back up context states during a switch. SMT often also means that more cache is needed in order to minimize eviction of data used by a thread that you temporarily supplanted. The thing is that you could not scale up cache and just increase ways of SMT to cover up more latency, which could probably cost less, but you're depending on higher TLP from the programmer to get your throughput.

On the OOO side, how extensive is it? Just plain Tomasulo's algorithm or current-x86-level? Are you including speculative this and that, or is it simply just out of order? I remember someone saying that in order to double our sustained IPC from the time of the 486 has cost us something like 12x the transistors.

If you're looking at logic transistors alone and comparing say, the 4-way SMT of the Niagara cores to the massive set of ILP extraction/protection schemes of current x86 chips, then there's probably a pretty significant advantage for SMT vs. OOO (a good 15:1 advantage in die space).

wireframe · Dec 20, 2005

There are some good documents by Hily and Seznec on the topic of SMT and in-order versus out-of-order execution here. The files are in PostScript format (PS) so you will need to convert them to PDF using something like PStill if you don't have PS support on your system. I made a conversion of one document already (PI-1179), but unfortunately the file exceeds the size for allowable attachments (it's 187 KB and ZIP can only be ~100 KB) so I cannot include it here.

For those interested in a summary of the paper "Out-Of-Order Execution May Not Be Cost-Effective on Processors Featuring Simultaneous Multithreading" (PI-1179, as mentioned above), it basically comes down to the simple logic that using two (re)scheduling methods is not effective and SMT and in-order execution may offer the better pay-off as the increase in IPC as you go SMT + OoOE is not cost effective. The general theme, however, is in context of a general multi-threaded computing environment, like the Windows operating system, for example, and overall performance (all threads) and not single thread performance. There are some other interesting papers about branch prediction and memory implications when using SMT.

EDIT: I had incorrectly referred to the file as PI-1197 instead of PI-1179. Changed "rescheduling methods" to "(re)scheduling methods"

3dilettante · Dec 20, 2005

purpledog said:
A bit out of topic but not too much. A newbie question as well:

Can multi-thread processor hide latency as well as out-of-order execution?
Is multi-thread cheaper to make than out-of-order ?

Depending on how aggressive the multithreading implementation is, it is either massively cheaper or much cheaper.

It's somewhat harder to gauge just how much cheaper it would be on some modern OoO processors, because a lot of the scheduling and rename hardware can be repurposed to handle multithreading.

The latency multithreading can hide is highly dependent on the workload and the needs of the user.

As a general rule of thumb, OoO can handle most load miss latency for misses that stay on-chip. Memory and IO stalls cannot be worked around. However, OoO can pass over a load miss and find future load misses in the code stream, compacting what would be multiple hundreds of cycles idle into a time equal to a little more than one miss.

Helper threads have been discussed to give multithreaded chips a similar advantage, but the field is still rather young.

That's great for latency as it is seen by a single thread, and most multithreading schemes have either a small or significant penalty in single-threaded performance.

For workloads with frequent trips to main memory (a dominant stall source) or IO, multithreading can improve overall performance if there are enough threads to fill in the gaps. Each individual thread will probably still be waiting a lot, but at least other work gets done in the meantime. For most tasks without critical latency requirements, this is great.

Sun's latest Niagra chip has relatively poor single-threaded performance, but in a heavily multi-threaded web server role, having many threads working means it can take heavier loads and frequent misses better than OoO chips. It's response time is usually somewhat slower, but not much slower. On heavy loads, the situation is reversed as less threaded chips start to overload.

Just to be clear, by multi-thread, I'm talking about the ability to switch (in hardware) to another thread (command list) if the current one if staled.

Subsidiary question:
- How many cycle does the processor take to switch.
- Who is triggering the switch: the harware itself if it detect it staled ?

This sounds like a coarse-grained multithreading approach. The context switches are much faster, but they are still there.

For example, the upcoming Montecito processor takes something on the order of 10-20 cycles to make a switch, because there are things like cache victim buffers that must be written out in order to prevent incorrect memory behavior.

Detecting a stall can be something as simple as the cache controller signalling the instruction unit when it can't find an entry in the L2 cache. Potentially, the instruction unit can also pick up on certain instructions that are known to take a long time, such as synchronization instructions or cache-bypassing loads.

Cell/Xenon vs "normal" CPUs OS performance.

aaaaa00

Rys

Graphics @ AMD

aaaaa00

purpledog

darkblu

purpledog

ShootMyMonkey

wireframe

3dilettante

Similar threads