Cell/Xenon vs "normal" CPUs OS performance.

SPM

Regular
I have read quite a few articles on the Internet about how badly Cell will perform compared with "normal" CPUs like Intel or AMD processors at general tasks like running an OS and stuff like AI because of lack of out of order processing in the PPE (the same applies to Xenon). The SPEs are criticised for lack of enough cache, lack out of order execution and lack of branch prediction, supposedly making it unsuitable for AI as a consequence.

The PPE and Xenon cores are conventional processors with less cache than typical, but since most processor time is spent on small number of small tight loops, the resulting speed reduction shouldn't be that huge and should be more than made up for by the code accelerated by the multi-core architecture of both Cell and Xenon.

Both PPE and Xenon cores are in-order. This can be made up for by using intelligent compiler technology to pre-optimise order of execution. This won't work for binary OSes like Windows which need to run code written for legacy processors (ix386). Windows PCs need to be able to run code written for several processors, making it impossible to use intelligent compilers to optimise code to make up for lack of out of order code. Also the vector processing features of the CPU in the multi-media extentions are different, and so difficult to optimise for multiple processors on Windows. However in code compiled for a specific CPU (like PS3 or XBox360) an intelligent optimising compiler can replace the need for out of order execution. This is also true for an OS like Apple OSX which is machine specific. It is also true to some extent for OSes like Linux which exist in source code rather than binaries and so can easily be recompiled in a machine specific way - especially with a source distribution like Gentoo Linux (the only problem is proprietary applications for which source code is not distributed and is not compiled for the specific CPU). It is certainly possible for a Sony Linux or a Microsoft Windows CE included with the PS3 or XBox360.

As far as the SPEs are concerned, I can' t understand how code written for the SPE units can run slower than on a conventional processor, unless you take typical procedural code written for conventional processors and try to run it on an SPE (ie. load code into SPE local memory and use it only once, and then load the next segment of code). Nobody in their right mind would do that would they? In most cases code written to run within the local memory of the SPEs and make use of DMA (including use of scatter lists to load data randomly located in large expanses of main memory), and using branch hints and clever coding and stream processing methodology to keep loops and frequently used data within the local store. In most cases the code will run an order of magnitude faster than on conventional processors with an machine managed cache simply because this enforced minimisation of instruction and data transfer to/from slow main memory can be done far more effectively if you can know and manually control what the algorithm does rather than leaving it up to a dumb piece of hardware.

As for AI on the SPE, a procedural AI model where the logic state is represented by the position of the program in the code is not appropriate for the SPE because this involves long branching. Instead the AI state running on the SPE should be represented as boolean flags or flag array data in main memory, loaded into SPE local memory for processing. For most things logic operations performed on boolean array tables in SPE local memory can take the place of branching and code execution position to represent AI state. In most cases it should be possible to carry out a more complex sequence of boolean operations on an array of boolean values without branching so as to give the same effect as repeating operations that set one bit at a time using conventional decision branching. The code that does this will run in SPE local memory, an may stream process a large number of AI objects eg. the status flags for a large number of characters in a game to update status. It is possible to parse or traverse large data structures in main memory efficiently using scatter list DMA to bring data to the SPE local memory. If blocks of data can be processed together then the SPE is more efficient than a conventional processor. If not, then it is no worse than a conventional processor. The main thing is to keep the code in local SPE memory (ie. to stream process AI data) otherwise it will be less efficient than a conventional processor. The overall AI/logic status of the main program (rather than object AI) may be more suited to the conventional procedural branch represented logic status, but this is no problem, because it can be done on the PPE which runs the main program execution thread.

It has also been suggested that programming Cell will be more difficult than a conventional processor? But will it? The SPE code will mostly be separate code seen as libraries or devices called by the main game programmer. Is this any more difficult to use than any other type of programming? The fact that, Cell has a single core, with the parrallelization undertaken through standardised libraries and device files may in fact make in easier to program than XBox360. Also the modularisation that the SPEs enforce can actually make programming easier and cleaner by encouraging standardisation/ code reuse.
 
I really hope someone converses with you on the points you raise as you have some very interesting things to talk about. Wish it was me, but I doubt I'm capable just yet. Anyway nice post.
 
SPM said:
I have read quite a few articles on the Internet about how badly Cell will perform compared with "normal" CPUs like Intel or AMD processors at general tasks like running an OS and stuff like AI because of lack of out of order processing in the PPE (the same applies to Xenon). The SPEs are criticised for lack of enough cache, lack out of order execution and lack of branch prediction, supposedly making it unsuitable for AI as a consequence.
The lack of branch prediction greatly limits the rate at which the SPE can perform branches. This is true even if you pepper your code wth branch hints. Also, if you are unable to make the branch decision until the end of your basic block, then you need to perform manual branch prediction (!), which is likely to be much slower than any hardware-based scheme.
The PPE and Xenon cores are conventional processors with less cache than typical, but since most processor time is spent on small number of small tight loops, the resulting speed reduction shouldn't be that huge and should be more than made up for by the code accelerated by the multi-core architecture of both Cell and Xenon.

Both PPE and Xenon cores are in-order. This can be made up for by using intelligent compiler technology to pre-optimise order of execution. This won't work for binary OSes like Windows which need to run code written for legacy processors (ix386). Windows PCs need to be able to run code written for several processors, making it impossible to use intelligent compilers to optimise code to make up for lack of out of order code. Also the vector processing features of the CPU in the multi-media extentions are different, and so difficult to optimise for multiple processors on Windows. However in code compiled for a specific CPU (like PS3 or XBox360) an intelligent optimising compiler can replace the need for out of order execution. This is also true for an OS like Apple OSX which is machine specific. It is also true to some extent for OSes like Linux which exist in source code rather than binaries and so can easily be recompiled in a machine specific way - especially with a source distribution like Gentoo Linux (the only problem is proprietary applications for which source code is not distributed and is not compiled for the specific CPU). It is certainly possible for a Sony Linux or a Microsoft Windows CE included with the PS3 or XBox360.
Yet another appeal to "intelligent compilers". Gaaaah. Look at Itanium, a well-known in-order processor architecture, desgned with extremely high performance goals in mind.. Its general architecture, optimization rules, etc have been basically static for 12 or so years, and still, Intel (which is widely considered to have the world's best compiler team, and billions of dollars to pour into it) still isn't anywhere close to producing optimal code for the damn thing. In fact, on integer code it is frequently trashed by the out-of-order Opteron, despite the fact that Opteron is hamstrung by a horrible instruction set, low (theoretical) IPC, small L2 cache, tiny register files, lack of predication, etc.

As for examples of where out-of-order beats the crap out of in-order:
  • Small loop: if you have a tiny loop consisting of, say, the sequence { LOAD; ADD; STORE; LOOP } you will in an in-order processor usually get an unpleasant stall between the LOAD and the ADD, because a LOAD usually has a latency of more than 1 clock cycle. You are slapped with this penalty once per loop iteration too. In an out-of-order processor, the processor will instead fetch the loop over and over again, and re-order instructions across loop iterations. That way, it can fill the gap between the LOAD and the ADD with e.g. LOAD from the next loop iteration or ADD/STORE from the previous loop iteration.
  • Cache misses: In an in-order processor, if you get a cache miss, you have no choice but to stall all execution until the cache miss is resolved. In an out-of-order processor, you can still continue to fetch and execute instructions that don't directly depend on the cache-missing instruction. This may uncover an additional cache miss; in this case, you can start fetching the second cache line BEFORE you are completed with the first one, allowing you to overlap the latencies of the two misses, increasing performance dramatically. Pentium4 allows you to pipeline as much as 12 cache misses in this way.
Neither of these effects are really available in an in-order processor. The first one can to a limited extent be approximated with loop unrolling and the second one with cache prefetch instructions, but these are still far from complete substitutes.
As far as the SPEs are concerned, I can' t understand how code written for the SPE units can run slower than on a conventional processor, unless you take typical procedural code written for conventional processors and try to run it on an SPE (ie. load code into SPE local memory and use it only once, and then load the next segment of code). Nobody in their right mind would do that would they? In most cases code written to run within the local memory of the SPEs and make use of DMA (including use of scatter lists to load data randomly located in large expanses of main memory), and using branch hints and clever coding and stream processing methodology to keep loops and frequently used data within the local store. In most cases the code will run an order of magnitude faster than on conventional processors with an machine managed cache simply because this enforced minimisation of instruction and data transfer to/from slow main memory can be done far more effectively if you can know and manually control what the algorithm does rather than leaving it up to a dumb piece of hardware.
There is nothing here you can really achieve with SPE's DMA control that you cannot achieve with careful cache prefetching in a traditional CPU. You may argue that the SPE programming model makes it impossible to write stupid code in this regard, but with that I will disagree. Most likely, people will quickly develop an abstraction layer that allows you to use the SPE local memory as a virtual cache, manually fetching data into it upon a cache miss. Once that abstraction layer is in place, the SPE is gonna suck just as much as any other CPU.
As for AI on the SPE, a procedural AI model where the logic state is represented by the position of the program in the code is not appropriate for the SPE because this involves long branching. Instead the AI state running on the SPE should be represented as boolean flags or flag array data in main memory, loaded into SPE local memory for processing. For most things logic operations performed on boolean array tables in SPE local memory can take the place of branching and code execution position to represent AI state. In most cases it should be possible to carry out a more complex sequence of boolean operations on an array of boolean values without branching so as to give the same effect as repeating operations that set one bit at a time using conventional decision branching. The code that does this will run in SPE local memory, an may stream process a large number of AI objects eg. the status flags for a large number of characters in a game to update status. It is possible to parse or traverse large data structures in main memory efficiently using scatter list DMA to bring data to the SPE local memory. If blocks of data can be processed together then the SPE is more efficient than a conventional processor. If not, then it is no worse than a conventional processor. The main thing is to keep the code in local SPE memory (ie. to stream process AI data) otherwise it will be less efficient than a conventional processor. The overall AI/logic status of the main program (rather than object AI) may be more suited to the conventional procedural branch represented logic status, but this is no problem, because it can be done on the PPE which runs the main program execution thread.
This suggests to me a very simplistic view on the data structures involved in AI calculations.
It has also been suggested that programming Cell will be more difficult than a conventional processor? But will it? The SPE code will mostly be separate code seen as libraries or devices called by the main game programmer. Is this any more difficult to use than any other type of programming? The fact that, Cell has a single core, with the parrallelization undertaken through standardised libraries and device files may in fact make in easier to program than XBox360.
Adn why would it be difficult to make a similar standardized library on the Xenon/Xbox360? It's not like the job model that the Cell seems to enforce cannot be set up on the Xenon CPU as well. There may be a psychological issue where the PPE<->SPE split suggests a more obvious split between different classes of tasks than the split between the different cores of the Xenon, but that shouldn't impede any serious programmer.
Also the modularisation that the SPEs enforce can actually make programming easier and cleaner by encouraging standardisation/ code reuse.
This would mainly be a matter of programmer sloppiness.
 
SPM said:
The PPE and Xenon cores are conventional processors with less cache than typical, but since most processor time is spent on small number of small tight loops, the resulting speed reduction shouldn't be that huge and should be more than made up for by the code accelerated by the multi-core architecture of both Cell and Xenon.

I think what you mean to say is: Small tight loops are not conventional processor's strong points. The resulting speed increase in this area should counter the otherwise poor performance on more tyipcal code.

SPM said:
Both PPE and Xenon cores are in-order. This can be made up for by using intelligent compiler technology to pre-optimise order of execution. This won't work for binary OSes like Windows which need to run code written for legacy processors (ix386).

People have been searching for the compiler that magically makes programs run fast on In Order CPUs for decades. It still does not exist.


SPM said:
It is also true to some extent for OSes like Linux which exist in source code rather than binaries and so can easily be recompiled in a machine specific way - especially with a source distribution like Gentoo Linux (the only problem is proprietary applications for which source code is not distributed and is not compiled for the specific CPU).

All OS start out as source code and then get compiled for their targets. Windows is no different from Linux in this respect. And again, it's not reasonable to expect the compiler to make it magically faster than the target platform.


I think we have to give credit where credit is due. Todays desktop CPUs are very good at what they do.

But architectures like the Cell are going to be very good in areas where tradional CPUs do not excel in.
 
arjan de lumens said:
You may argue that the SPE programming model makes it impossible to write stupid code in this regard, but with that I will disagree
Yes, you can write very bad code everywhere, SPEs are not exceptions to the rule, there's not out of order execution to save your ass there as you're already know.
Most likely, people will quickly develop an abstraction layer that allows you to use the SPE local memory as a virtual cache, manually fetching data into it upon a cache miss.
You can do that..but it's not likely to give you good performances in the general case.
Building an abstraction is good, save you time, but if you're not careful to use it you're screwed.
I'd prefer to write abstractions that force the unexperienced coder to not make very big and unpleasant mistakes :)
 
nAo said:
Yes, you can write very bad code everywhere, SPEs are not exceptions to the rule, there's not out of order execution to save your ass there as you're already know.
You can do that..but it's not likely to give you good performances in the general case.
Building an abstraction is good, save you time, but if you're not careful to use it you're screwed.
I'd prefer to write abstractions that force the unexperienced coder to not make very big and unpleasant mistakes :)

I heard of that abstraction, I think you internally named it "having both arms tied behind your back and have a senior member of the team point a gun at you so that your sight is focused on the monitor of a PC on which a senior programmer is coding for the SPE"... long name, but it gets the point across to newbies like us.
 
Panajev2001a said:
I heard of that abstraction, I think you internally named it "having both arms tied behind your back and have a senior member of the team point a gun at you so that your sight is focused on the monitor of a PC on which a senior programmer is coding for the SPE"... long name, but it gets the point across to newbies like us.

A little off topic, but I don't believe Marco is saying that at all.

The example I always give of a bad abstraction (for games programming) is the STL String class. Programmers without experience tend to not even think about the implementation (isn't that the point of the abstraction), and because of the operator overloading expect the comiler to do things it can't.

I've seen code like the following written

String path = "c:\\aPath\";
String filename = "blah.turd";

String name;
//other stuff
name = path+filename;

There are no less than 4 memory allocations in the above, when all that was needed was a static string, if the API made the allocations explicit and made it clear where function calls were actually taking place less senior programmers would start to think about what they are doing.

Now that means that your abstraction is probably less usefull in the real world, but your average games programmer must be thinking about memory allocations and function call overheads.
 
ERP said:
A little off topic, but I don't believe Marco is saying that at all.

Dude, ... :( I cannot have fun without either you, nAo, Faf or archie pulling me by the ear (it is starting to hurt that ear, please use the other one... I do have two of them :D) :p.

I am just being a bit silly and joking around, I think he understands that very well since he knows me (whether that is a good thing or not you have got to ask him ;)).

Still, thanks for the interesting "aside" :).

As an funny aside, enough looking up C++ style strings on Google I found exactly the example you made here, but as an example of good coding practices (using strings the safe way) :p.

I think we need both sides of the coin: on one side it is good for example that with Vista MS is going with more managed code (for WPF and other parts of the OS) as it allows better security and stability at the expense of reduced performance (even though there are parts of WPF written as unmanaged code or "unsafe" managed code for performance reasons). It is a way to close more holes from attackers, but I do understand that for real-time applications such as games what you need is to know what happens at what time and in what order since you focus on making the application as platform oriented as possible and as fast as possible (although you might have abstraction layers, over-all you have to try to do things the way the platform likes it best, not the way it might be easier...).

If I counted well the name = path+filename; counts as one of those memory allocations as the + operator returns an object of the type string which is then used by the = operator... it is easy to think that the temporary allocation for the path+filename string object is not done and the result goes straight into name.
 
Last edited by a moderator:
ERP said:
There are no less than 4 memory allocations in the above, when all that was needed was a static string
Not to mention allocations on Heap for temporaries(even when they are dynamic size) are just plain wrong - but we still have schools teaching that dynamic allocations are "supposed" to go on heap (even when You control them instead of the API).

But yea, like you and nAo said, we basically have to look for abstractions that forces you to avoid the worst pitfalls - and still make things easier to use nonetheless.

Panajev said:
If I counted well the name = path+filename; counts as one of those memory allocations as the + operator returns an object of the type string which is then used by the = operator... it is easy to think that the temporary allocation for the path+filename string object is not done and the result goes straight into name.
Well that would require compiler with more domain knowledge then it can have in C++. Actually a game-optimized string class could be written in such a way to allow compiler to optimize cases like that, but it would be a pretty involved metacode backend and one questions the reasoning to spend so much effort on optimizing a freaking string class.
 
Last edited by a moderator:
by arjan de lumens
Yet another appeal to "intelligent compilers". Gaaaah. Look at Itanium, a well-known in-order processor architecture, desgned with extremely high performance goals in mind.. Its general architecture, optimization rules, etc have been basically static for 12 or so years, and still, Intel (which is widely considered to have the world's best compiler team, and billions of dollars to pour into it) still isn't anywhere close to producing optimal code for the damn thing. In fact, on integer code it is frequently trashed by the out-of-order Opteron, despite the fact that Opteron is hamstrung by a horrible instruction set, low (theoretical) IPC, small L2 cache, tiny register files, lack of predication, etc.

The real problem with Itanium is not that the compiler doesn't produce fast executing code, nor that in-order is a lot slower, but that it doesn't run ix86 code well whereas if the code is compiled for the Itanium it runs well, while the Opteron does run 32 bit ix386 code well. Unfortunately when most people talk about OS performance, they are talking about Windows performance, and Itanium is crap at running Windows which a) is largely ix386 binaries, and b) runs mostly proprietary shrink wrapped software which you can't recompile yourself (and again is mostly ix386 binaries). Actually Itanium isn't doing that badly - other than at running Windows. It is reasonably popular on very high end servers running Linux - although nobody runs Windows on it any more. The thing that saved the Itanium from oblivion is Linux, which comes with the source code and most of the necessary applications as source code which the end user or a Linux distributor can re-compile for the Itanium. HP has adopted Linux and dropped Windows on the Itanium and workstation products but invested $3 billion in Itanium server products. Now why would they do that rather than investing in Opteron servers if Itanium didn't perform better at something?
http://www.eweek.com/article2/0,1895,1743088,00.asp
http://arstechnica.com/news.ars/post/20040927-4235.html
Another reason the Itanium isn't selling as well as it was supposed to is that it has competition from IBM power chips and Sun Sparc chips. Itanium was supposed to be cheaper due economy of scale, but because it didn't take off on Windows, it hasn't got any cost advantage and hence Sun and IBM dropped plans to use it and stuck to their own Power and Sparc chips. HP on the other hand stuck with Itanium as a replacement for the Alpha chip (which also ran Windows but was dropped and switched to Linux due to lack of demand for Windows on Alpha)

Out of order architectures and large caches do improve performance, but you have to compare like for like. The question is whether a single core CPU with out of order execution and a large cache will run faster than a chip the same size like the 3 core Xenon, or the 8 core Cell with a smaller cache and in-order execution. On a large file server where the CPU is not required to do a great deal of number crunching, or a machine running Windows, I would agree that the conventional out of order large cache architecture will win. However in other cases Xenon or Cell might do better.

Cell/Xenon will never run Windows well, but as I said as an OS for a media center computer based on a PS3 running Linux or XBox360 running Windows CE, I think it will work very well. Even if you are just running office applications, a Xenon/Cell based computer might be more responsive. It may not run an office suite any faster, but do you really need to? When Office is slow to respond, it is rarely because Office itself is running slowly, but rather because some other application or OS code or multimedia or screen rendering code (eg. in a web browser) that is multi-tasking is hogging the CPU or hard drive access. If you can speed these up with Cell/Xenon, and you speed up the system and make it more responsive, even if the office suite doesn't run faster.
 
by arjan de lumens
There is nothing here you can really achieve with SPE's DMA control that you cannot achieve with careful cache prefetching in a traditional CPU. You may argue that the SPE programming model makes it impossible to write stupid code in this regard, but with that I will disagree. Most likely, people will quickly develop an abstraction layer that allows you to use the SPE local memory as a virtual cache, manually fetching data into it upon a cache miss. Once that abstraction layer is in place, the SPE is gonna suck just as much as any other CPU.

Well that isn't the way the SPE is supposed to be used. If you are going to do that you might as well stick to a conventional architecture, and procedural programming.

Conventional processors (with big caches) supposedly spend on average 80% of their time waiting for the cache. You can get much closer to full utilization with Cell SPEs and you have 7 of them for the price of one conventional processor.

The ideal well coded SPE program will be written to run entirely from local memory, process data in local memory in batches, and DMA in the next batch while the current batch is being processed, or from the local memory of another SPE. Most people suppose that there will be difficulty fitting all the data into local memory. In most cases this is not true since scatter list DMA will allow you to load selected data spanning large tables, trees and other data structures. You just have to program so you minimise main memory access, and you have more control over this on the SPE than with a conventional processor since the SPE is designed to run like this.

Sure you can write very bad code for the SPE, but so what? The important thing is that good code will run very fast.
 
SPM said:
Conventional processors (with big caches) supposedly spend on average 80% of their time waiting for the cache. You can get much closer to full utilization with Cell SPEs and you have 7 of them for the price of one conventional processor.

A) on average conventional processors can spend 80% of their time waiting on cache loads, however for these same workloads, CELL would spend approximately 99% of its time waiting on DMA loads. For the types of workloads that cell will be good at, conventional CPUs with good coding will spend a small fraction of their time blocked.

B) an SPE takes a reasonable amount of die size, from a rough eye balling, the PPE takes the area of roughly 2 SPEs and is quite a bit more functional.

Sure you can write very bad code for the SPE, but so what? The important thing is that good code will run very fast.

unfortunately, good code is hard to find on any architecture.

Aaron Spink
speaking for myself inc.
 
by inefficient
All OS start out as source code and then get compiled for their targets. Windows is no different from Linux in this respect. And again,

Windows is a binary API. In other words, it offers binary compatibility and so the binaries remain fixed. Even when a bug is found the binary code isn't usually changed, unless it is a really serious one because some programs may use undocumented entry points or may rely on the bug to work properly, the bug just becomes a feature.

Unix/Linux don't offer binary compatability and exist in C code. They are constantly being freshly recompiled, and so the binaries are always changing with the compiler. It is also the reason why Unix and Linux run on so many platforms while Windows runs only on the ix86 architecture.

it's not reasonable to expect the compiler to make it magically faster than the target platform.

Compiler optimisations are very important for in-order processors, because in-order processors rely on the compiler rather than hardware to schedule the operations in a way that will be executed efficiently internally. Since the internals of different processors are different, an optimisation will only work on one target processor. This is why in order execution processors and in order compiler optimisations don't work on Windows or on the generic ix86 architecture. There is a whole range if ix86 compatibles, all with different internals - you can only optimise for one specific architecture. Also there is a whole lot of old ix386 code in Windows which can't be recompiled for the new target processor.

Compiler optimisation for in order processors does work. That is why Sony is employing Transmeta to write an optimising compiler for the Cell chip.
 
aaronspink said:
A) on average conventional processors can spend 80% of their time waiting on cache loads, however for these same workloads, CELL would spend approximately 99% of its time waiting on DMA loads.

Why?

B) an SPE takes a reasonable amount of die size, from a rough eye balling, the PPE takes the area of roughly 2 SPEs and is quite a bit more functional.

Actually the PPE plus it's cache is about the same size as four SPEs. Also don't forget the PPE (and the 3 cores in Xenon) are themselves quite small. They save quite a bit of die area by leaving out the out of order logic and using a smaller cache from the PPE.

unfortunately, good code is hard to find on any architecture.

Maybe, but I suspect SPE code will be treated as an API rather than being written on an ad hoc basis by application programmers. The SPE code in these APIs it will be carefully coded and optimised, and the average application programmer will simply call it by interacting with a virtual device file or by calling a library. Your sloppy application programmer will simply interface with a graphics or physics engine that uses the SPEs rather than program it directly.
 
SPM said:
Windows is a binary API. In other words, it offers binary compatibility and so the binaries remain fixed. Even when a bug is found the binary code isn't usually changed, unless it is a really serious one because some programs may use undocumented entry points or may rely on the bug to work properly, the bug just becomes a feature.
Incorrect. Windows API, is just that, an API. bug compatability is an orthogonal issue.

Unix/Linux don't offer binary compatability and exist in C code. They are constantly being freshly recompiled, and so the binaries are always changing with the compiler. It is also the reason why Unix and Linux run on so many platforms while Windows runs only on the ix86 architecture.

This is also incorrect. Both the various versions of Unix and Linux offer binary compata bility and exist in C and binary. In fact, the majority of linux is distributed in binary and almost all Unix versions are distrobuted in binary form. Windows actually runs on Arm, x86, x86-64, and IA64. They used to also build on a daily basis on Alpha though I'm not sure they still do. In addition, Windows has been released on MIPS and PowerPC.

Compiler optimisations are very important for in-order processors, because in-order processors rely on the compiler rather than hardware to schedule the operations in a way that will be executed efficiently internally.

Both in order and out of order processors rely on the compiler to properly schedule instructions.

Since the internals of different processors are different, an optimisation will only work on one target processor.
This is incorrect, there are a wide variety of optimizations that work across a wide range of processors.

This is why in order execution processors and in order compiler optimisations don't work on Windows or on the generic ix86 architecture.
Incorrect.


There is a whole range if ix86 compatibles, all with different internals - you can only optimise for one specific architecture. Also there is a whole lot of old ix386 code in Windows which can't be recompiled for the new target processor.

Each version of windows is compiled fresh for release and you can optimise for more than one specific micro-architecture, it happens all the time.

Aaron Spink
speaking for myself inc.
 
SPM said:
The real problem with Itanium is not that the compiler doesn't produce fast executing code, nor that in-order is a lot slower, but that it doesn't run ix86 code well whereas if the code is compiled for the Itanium it runs well, while the Opteron does run 32 bit ix386 code well.
I was talking about the performance Itanium achieves in its NATIVE mode, with code compiled specifically for it, NOT in its x86 emulation mode! Its x86 emulation mode is IIRC about 10-15x slower and in practice never used except at boot time.
HP has adopted Linux and dropped Windows on the Itanium and workstation products but invested $3 billion in Itanium server products. Now why would they do that rather than investing in Opteron servers if Itanium didn't perform better at something?
http://www.eweek.com/article2/0,1895,1743088,00.asp
http://arstechnica.com/news.ars/post/20040927-4235.html
Opteron systems are generally limited to about 8 processors; the place for Itanium appears if you need more than that. As far as HP is concerned, they have made enormous investments in the Itanium (most of the Itanium2 design is indeed done by HP engineers) and killed a fairly successful CPU family (PA-RISC) to in anticipation of the Itanium. It is probably at least as much a face-saving move as an attempt to choose the "best" architecture.
Out of order architectures and large caches do improve performance, but you have to compare like for like. The question is whether a single core CPU with out of order execution and a large cache will run faster than a chip the same size like the 3 core Xenon, or the 8 core Cell with a smaller cache and in-order execution. On a large file server where the CPU is not required to do a great deal of number crunching, or a machine running Windows, I would agree that the conventional out of order large cache architecture will win. However in other cases Xenon or Cell might do better.
Comparing Cell/Xenon to a single-core Opteron is hardly fair; on similar processes, Xenon is nearly as large as a dual-core Opteron (168 vs 199 mm2, from what I can find) and Cell is considerably larger (221 mm2).
Cell/Xenon will never run Windows well, but as I said as an OS for a media center computer based on a PS3 running Linux or XBox360 running Windows CE, I think it will work very well. Even if you are just running office applications, a Xenon/Cell based computer might be more responsive. It may not run an office suite any faster, but do you really need to? When Office is slow to respond, it is rarely because Office itself is running slowly, but rather because some other application or OS code or multimedia or screen rendering code (eg. in a web browser) that is multi-tasking is hogging the CPU or hard drive access. If you can speed these up with Cell/Xenon, and you speed up the system and make it more responsive, even if the office suite doesn't run faster.
This is basically the same that people are already experiencing with dual-core Opteron/Athlon64/Pentium-D.
 
SPM said:
because programs with less than a 20% cache hit rate aren't very localize, aren't predictable, involve lots of indirection, etc. As an example, how fast to you think an SPE is at pointer chasing?

Actually the PPE plus it's cache is about the same size as four SPEs. Also don't forget the PPE (and the 3 cores in Xenon) are themselves quite small. They save quite a bit of die area by leaving out the out of order logic and using a smaller cache from the PPE.
Actually the PPE is as stated. The Cache is an othogonal thing that is utilized by both the SPEs and the PPE and contains a lot of general coherency logic for the CELL chip as a whole. For instance, if I wanted to put another 2-3 PPEs on the CELL, the cache size/area wouldn't change in a significant mannor.

Maybe, but I suspect SPE code will be treated as an API rather than being written on an ad hoc basis by application programmers. The SPE code in these APIs it will be carefully coded and optimised, and the average application programmer will simply call it by interacting with a virtual device file or by calling a library. Your sloppy application programmer will simply interface with a graphics or physics engine that uses the SPEs rather than program it directly.

Your assuming that libraries are better than general programmer code, they actually usually are the same or worse.

Aaron Spink
speaking for myself inc.
 
Fafalada said:
Not to mention allocations on Heap for temporaries(even when they are dynamic size) are just plain wrong - but we still have schools teaching that dynamic allocations are "supposed" to go on heap (even when You control them instead of the API).

But yea, like you and nAo said, we basically have to look for abstractions that forces you to avoid the worst pitfalls - and still make things easier to use nonetheless.


Well that would require compiler with more domain knowledge then it can have in C++. Actually a game-optimized string class could be written in such a way to allow compiler to optimize cases like that, but it would be a pretty involved metacode backend and one questions the reasoning to spend so much effort on optimizing a freaking string class.

because I always liked a well done string class that even senior programmer faf would let me use :D ?
 
A bit out of topic but not too much. A newbie question as well:

Can multi-thread processor hide latency as well as out-of-order execution?
Is multi-thread cheaper to make than out-of-order ?

Just to be clear, by multi-thread, I'm talking about the ability to switch (in hardware) to another thread (command list) if the current one if staled.

Subsidiary question:
- How many cycle does the processor take to switch.
- Who is triggering the switch: the harware itself if it detect it staled ?
 
SPM said:
The real problem with Itanium is not that the compiler doesn't produce fast executing code, nor that in-order is a lot slower, but that it doesn't run ix86 code well whereas if the code is compiled for the Itanium it runs well, while the Opteron does run 32 bit ix386 code well. Unfortunately when most people talk about OS performance, they are talking about Windows performance, and Itanium is crap at running Windows which a) is largely ix386 binaries, and b) runs mostly proprietary shrink wrapped software which you can't recompile yourself (and again is mostly ix386 binaries). Actually Itanium isn't doing that badly - other than at running Windows.

Just curious, how much code have you actually written and/or debugged on Itanium?
 
Back
Top