In order execution and Xbox360 cpu

It doesn't matter very much, as long as they supply a compiler that rearranges the instructions.
 
DiGuru said:
It doesn't matter very much, as long as they supply a compiler that rearranges the instructions.

In theory perhaps.
In practice in order has a significant performance impact.
 
bbot said:
I've read that a cache slows down an in order cpu.

An in-order CPU is mostly limited in that it cannot issue a non-dependant operation to an execution unit that is available if the next instruction has to wait.

Like, a CPU has an integer ALU, an floating-point one, a load/store one and a vector one. And say the integer and load/store ones take one clock cycle to complete, the floating point one takes two, and the vector unit takes four. (It's a bit more compilcated than that, but that doesn't matter.)

If you send two vector instructions to the CPU, the second one will have to wait 3 clock cycles, while the other three units are idle.

If you send a vector instruction, a floating point one, an integer, a second floating point, and then another vector instruction, both vector instructions will finish at the same time, while three other instructions are executed in the time in between.

A PPC CPU like in the XboX360 and PS3 can execute 2 different instructions at the same time. As long as the compiler can pick and choose different types of instructions and arrange them in such a way, that all the instructions can be executed immediately, the CPU is used most efficiently.

To be able to do all that as good as possible, it is best if the CPU has many registers that can be accessed directly, because a cache miss will slow things down pretty bad. And as long as it can't run other instructions while waiting for data (in-order), it stalls.

So, for in-order you want lots of registers and a good compiler. And as long as the CPU can run multiple threads at the same time, it will only stall if both are waiting for data. But that goes for out-of-order CPU's as well. And an out-of-order CPU will run better if it has lots of registers as well.

So, a good compiler can do the same thing as an out-of-order CPU, and they both might stall when there is a cache miss, but the out-of-order one might be able to execute a few more instructions before it has to wait. And the in-order ones historically have more registers, so they don't have to access main memory as often.

Which one is better? The in-order one uses less transistors, but is a bit slower. Toss a coin.
 
ERP said:
DiGuru said:
It doesn't matter very much, as long as they supply a compiler that rearranges the instructions.

In theory perhaps.
In practice in order has a significant performance impact.

It depends. Mostly, the out-of-order example is an i686 class CPU, which translates all i386 instructions to micro-ops, that execute on a RISC core with a totally different architecture. Like a CPU running some interpreted p-code. So, it has to do it's own optimizations, because the compiler can only generate that p-code, not the actual instruction stream.

And, it uses a lot of transistors for that, which might be used for other things instead.
 
The difference between an out-of-order execution and in-order execution may not be that huge on xbox360 CPU as there is also multi-threading in the mix. When one functional block is not used in one of the thread, the other thread may use it.

Actually, in-order execution may make things a lot simpler for a programmer in a multi-threaded scenario. Programmer may submit threads according to their instruction mixture (i.e. a vmx-heavy thread and an integer heavy thread.. Both can work almost at full-speed in this case).
 
DiGuru said:
Which one is better? The in-order one uses less transistors, but is a bit slower. Toss a coin.
Doesn't it also run faster when doing things that is impossible to do out of order, because of its overall larger on-die memory pool, thereby making it generally faster?
 
Squeak said:
DiGuru said:
Which one is better? The in-order one uses less transistors, but is a bit slower. Toss a coin.
Doesn't it also run faster when doing things that is impossible to do out of order, because of its overall larger on-die memory pool, thereby making it generally faster?

That depends on the amount of registers, yes. Like the SPU's in a PS3 have very large register files, but no cache memory. Load them up (relatively slow), blaze through the program at full speed without any stalls, write the results back (slow again), repeat.

So, as long as you don't have to access main memory and have abundant registers, the in-order core will be much faster. But it is harder to keep it working all the time, so the overall throughput might be about the same.

It depends on what you want to do: for general purpose programs that access all of main memory, it is quite a bit slower. But for complex calculations, it's much faster.
 
DiGuru said:
It depends. Mostly, the out-of-order example is an i686 class CPU, which translates all i386 instructions to micro-ops, that execute on a RISC core with a totally different architecture. Like a CPU running some interpreted p-code. So, it has to do it's own optimizations, because the compiler can only generate that p-code, not the actual instruction stream.

And, it uses a lot of transistors for that, which might be used for other things instead.

I consider that a pretty bad explanation. First the most commonly used i386 instructions usually map 1:1 to a micro-op. And second the compilers usually have a good idea about the cpu (class) they're creating code for (otherwise there wouldn't be a need for compiler target switches).
The big advantage of out-of-order processing is that you can better hide memory latencies and thus make more efficient use of execution units and thus increase IPC.
Example: Your program wants to access some data in a memory location that is not in the cache. With an in-order cpu that stalls the cpu until the data is fetched from the main memory. An out-of-order cpu can still make execute other instructions if they are independent from the instruction that waits for the data.
For in-order cpus to be efficient you either need very large caches (see Itanium) or low-latency memory (like the SPU memory in cell).
 
DiGuru said:
ERP said:
DiGuru said:
It doesn't matter very much, as long as they supply a compiler that rearranges the instructions.

In theory perhaps.
In practice in order has a significant performance impact.

It depends. Mostly, the out-of-order example is an i686 class CPU, which translates all i386 instructions to micro-ops, that execute on a RISC core with a totally different architecture. Like a CPU running some interpreted p-code. So, it has to do it's own optimizations, because the compiler can only generate that p-code, not the actual instruction stream.

And, it uses a lot of transistors for that, which might be used for other things instead.

Hmm, I thought the x86 to micro-ops translation only took about 1 million transistors, which was significant during the time of the original pentium but not particularly so anymore.
 
Cramming as many instructions per cycle in a single core is a dead end, as long as you cannot increase memory bandwidth and clockspeed indefinitely. Sure, two smaller and simpler cores are slower if you only use one of them, but that's the only thing left.

And 2 simple cores can be much faster than 1 complex one, at the same clockspeed. What does it matter if they stall 15% of the time, instead of only 5% for the complex one?

It isn't IPC anymore that rules, it's instruction streams per amount of transistors nowadays.
 
DiGuru said:
And 2 simple cores can be much faster than 1 complex one, at the same clockspeed. What does it matter if they stall 15% of the time, instead of only 5% for the complex one?

Depends how much simpler the "simple" cores are. If they have just 10-15% less transistors, why bother?

Reminds me of the time when everybody said CISC and x86 was a dead end and how much simpler, cheaper and more efficient the Itanium would be.
 
DiGuru said:
And 2 simple cores can be much faster than 1 complex one, at the same clockspeed. What does it matter if they stall 15% of the time, instead of only 5% for the complex one?

But how much more complex does a OOOE CPU have to be?

The PPRO used 10% of the die area for the reorder buffer and schedulers. If it buys you more than 10% performance it's a win.

Cheers
Gubbi
 
Gubbi said:
DiGuru said:
And 2 simple cores can be much faster than 1 complex one, at the same clockspeed. What does it matter if they stall 15% of the time, instead of only 5% for the complex one?

But how much more complex does a OOOE CPU have to be?

The PPRO used 10% of the die area for the reorder buffer and schedulers. If it buys you more than 10% performance it's a win.

Cheers
Gubbi

And that goes for a lot of things, until your whole die area is taken up by a single core, with no place left.

On the other hand, if you can fit an extra core by scrapping a few things like that, a single core wouldn't be faster, but together they might be.

Why doesn't Intel make 10GHz complex, single core processors, but dual 3-4GHz ones instead, if the first one would be so much better on paper? Because they can't.
 
Something else have been bothering me:

How does the PPE and the X360 CPU schedule instructions?

Do they just statically schedule:

1. Two instructions from one thread (in single thread mode)
2. Two intstructions from each of two threads on alternating cycles
3. One instruction from each of two thread each cycle
4. Two instruction from one thread until stall, then switch to second thread and run full bore.

I mean, it's an in order core, when the core stalls, both threads would stall, right ?

I can see how alternating two threads would halve thoughput for each thread, essentially halving apparent latency (instruction, memory, what have you), and hence make it easier to sustain a higher throughput.

Does anybody know ?

Cheers
Gubbi
 
Gubbi said:
But how much more complex does a OOOE CPU have to be?

The PPRO used 10% of the die area for the reorder buffer and schedulers. If it buys you more than 10% performance it's a win.

Cheers
Gubbi
Exactly. And since the number of transistors used for out-of-order functionality remains more or less constant or just grows moderately between processor generations while the total number of transistors grows sign significantly, it gets cheaper every generation.
 
The MS engineers said on the Major Nelson interview that they would've preferred to choose a 10ghz processor.
 
Gubbi said:
Something else have been bothering me:

How does the PPE and the X360 CPU schedule instructions?

Do they just statically schedule:

1. Two instructions from one thread (in single thread mode)
2. Two intstructions from each of two threads on alternating cycles
3. One instruction from each of two thread each cycle
4. Two instruction from one thread until stall, then switch to second thread and run full bore.

I mean, it's an in order core, when the core stalls, both threads would stall, right ?

I can see how alternating two threads would halve thoughput for each thread, essentially halving apparent latency (instruction, memory, what have you), and hence make it easier to sustain a higher throughput.

Does anybody know ?

Cheers
Gubbi

I don't know that there have been any details released on this.

But in general the reason that you have the two hardware threads is that you can do useful work when one of your resources is blocked.

As long as both threads are not waiting on the same resource or blocked waiting on different resources, one of them can keep running.

Basically one thread blocked does not stop the other thread otherwise having two sequencing units is a total waste of transistors.
 
Back
Top