bbot said:
I've read that a cache slows down an in order cpu.
An in-order CPU is mostly limited in that it cannot issue a non-dependant operation to an execution unit that is available if the next instruction has to wait.
Like, a CPU has an integer ALU, an floating-point one, a load/store one and a vector one. And say the integer and load/store ones take one clock cycle to complete, the floating point one takes two, and the vector unit takes four. (It's a bit more compilcated than that, but that doesn't matter.)
If you send two vector instructions to the CPU, the second one will have to wait 3 clock cycles, while the other three units are idle.
If you send a vector instruction, a floating point one, an integer, a second floating point, and then another vector instruction, both vector instructions will finish at the same time, while three other instructions are executed in the time in between.
A PPC CPU like in the XboX360 and PS3 can execute 2 different instructions at the same time. As long as the compiler can pick and choose different types of instructions and arrange them in such a way, that all the instructions can be executed immediately, the CPU is used most efficiently.
To be able to do all that as good as possible, it is best if the CPU has many registers that can be accessed directly, because a cache miss will slow things down pretty bad. And as long as it can't run other instructions while waiting for data (in-order), it stalls.
So, for in-order you want lots of registers and a good compiler. And as long as the CPU can run multiple threads at the same time, it will only stall if both are waiting for data. But that goes for out-of-order CPU's as well. And an out-of-order CPU will run better if it has lots of registers as well.
So, a good compiler can do the same thing as an out-of-order CPU, and they both might stall when there is a cache miss, but the out-of-order one
might be able to execute a few more instructions before it has to wait. And the in-order ones historically have more registers, so they don't have to access main memory as often.
Which one is better? The in-order one uses less transistors, but is a bit slower. Toss a coin.