In order execution and Xbox360 cpu

Carl B · May 31, 2005

bbot said:
The MS engineers said on the Major Nelson interview that they would've preferred to choose a 10ghz processor.

Of course they would, who wouldn't? But that's just not realistic at this time. In fact, if clock speeds were ramping like they were expected to keep ramping a couple of years ago, I bet you'd be seeing something entirely different inside of the 360.

bbot · May 31, 2005

quote:

"On the Xbox 360 ideally if we could have bought one we would have preferred 10 GHz processor , unfortunately those kinda processor arenâ€™t available so we had to break it three way and get the best performance we can."

N00b · May 31, 2005

DiGuru said:
Why doesn't Intel make 10GHz complex, single core processors, but dual 3-4GHz ones instead, if the first one would be so much better on paper? Because they can't.

What has this to do with simple or complex cores? It's not like the so much simpler XBOX 360 CPU achieves higher clock rates than the more complex Intel ones. I would bet that even with it's additional core it would be somehow underwhelming if benchmarked against a dual-core Intel or Athlon 64 in a comparable environment (same os, memory, etc.).

Gubbi · May 31, 2005

N00b said:
Gubbi said:

But how much more complex does a OOOE CPU have to be?

The PPRO used 10% of the die area for the reorder buffer and schedulers. If it buys you more than 10% performance it's a win.

Cheers
Gubbi

Click to expand...

Exactly. And since the number of transistors used for out-of-order functionality remains more or less constant or just grows moderately between processor generations while the total number of transistors grows sign significantly, it gets cheaper every generation.

Depends. In Northwood P4s, the schedulers were around 20-25%, in Prescott P4s around 15% of the core die area. But then again the P4s can have about three times as many instructions in flight as the PPRO core.

Intel obviously felt it was worth it. Similar with AMD's Athlon: Big ass schedulers. And look who's king of the hill in SpecInt performance?

Cheers
Gubbi

Frank · May 31, 2005

N00b said:
What has this to do with simple or complex cores? It's not like the so much simpler XBOX 360 CPU achieves higher clock rates than the more complex Intel ones. I would bet that even with it's additional core it would be somehow underwhelming if benchmarked against a dual-core Intel or Athlon 64 in a comparable environment (same os, memory, etc.).

Yes, especially when you use a single-threaded benchmark. If you use a multi-threaded one, it might be different.

And if you have the choice between three cores, or two, but both 20% more efficient, or only one, but 50% more efficient, what do you do?

Gubbi · May 31, 2005

ERP said:
Gubbi said:

Something else have been bothering me:

How does the PPE and the X360 CPU schedule instructions?

Do they just statically schedule:

1. Two instructions from one thread (in single thread mode)
2. Two intstructions from each of two threads on alternating cycles
3. One instruction from each of two thread each cycle
4. Two instruction from one thread until stall, then switch to second thread and run full bore.

I mean, it's an in order core, when the core stalls, both threads would stall, right ?

I can see how alternating two threads would halve thoughput for each thread, essentially halving apparent latency (instruction, memory, what have you), and hence make it easier to sustain a higher throughput.

Does anybody know ?

Cheers
Gubbi

Click to expand...

I don't know that there have been any details released on this.

But in general the reason that you have the two hardware threads is that you can do useful work when one of your resources is blocked.

As long as both threads are not waiting on the same resource or blocked waiting on different resources, one of them can keep running.

Right, so each thread has it's own issue control.

ERP said:
Basically one thread blocked does not stop the other thread otherwise having two sequencing units is a total waste of transistors.

Except for the halving of apparent latencies.

Cheers
Gubbi

MfA · May 31, 2005

Even ignoring SMT, with narrow issue width and prepare to branch instructions I dont think OOOE would add much with a decent architecture optimized compiler ... unfortunately such compilers are mythical beasts.

N00b · May 31, 2005

DiGuru said:
And if you have the choice between three cores, or two, but both 20% more efficient, or only one, but 50% more efficient, what do you do?

Sorry, but these numbers have nothing to do with real world. You just picked them to "prove" your opinion. It's like I would say: And if you have the choice between three cores, or two, but both 60% more efficient, or only one, but 350% more efficient, what do you do?

How often has the death of the "complex" x86 architecture been predicted? But as far as I can tell x86 is alive and kicking. But a lot of "simple", competiting architectures are dead by now.

Multi-Core is just another feature. One that can be implemented by simple and by complex cores. In the end the complex cores will (again) outperform the simple ones for the same reasons that the complex cores always prevailed: The transistor cost for a feature is more or less constant but with every die shrink you have an increased transistor budget.
BTW, like every feature multi-core it has its limits. I doubt we will see multi-core beyond 4 cores for the next 10-15 years in mainstream, general-purpose computing.

MfA · May 31, 2005

In the end the sun will go nova, for now the simpler cores won out for consoles.

ralexand · May 31, 2005

Just making sure I have this straight, the original question in this thread was never answered or was it?

Carl B · May 31, 2005

No I don't think it's been *officially* confirmed as yet, but it's one of those things everyone knows, devs have stated, and Microsoft has done nothing to refute.

N00b · May 31, 2005

MfA said:
Even ignoring SMT, with narrow issue width and prepare to branch instructions I dont think OOOE would add much with a decent architecture optimized compiler ... unfortunately such compilers are mythical beasts.

You mean like Intel and HP said that Itanium would be such a simple and easy design because the compiler would be able to do all the optimizing.

... and then ended up with a 3rd level cache?

A couple of years ago I read an article about the Itanium. I barely remember it but what I remember is that they concluded that it is A BAD THING(tm) to rely on the compiler to keep the functional units of your processor busy. For the first generation Itanium reading data from main memory took about 50-100 clock cycles. So, they calculated, the compiler had to look ahead at least 300 clock cycles to feed all functional units. Pretty though, but somehow manageable. For the second generation Itanium with it's increased clock speed the compiler would have to look ahead *several thousand* clock cycles. That's simply not feasible. Of course the Itanium didn't live up to it's theoretical performance. For the 3rd generation of Itanium Intel bolted on the huge 3rd-level cache (among other things) to get decent performance out of it.

ralexand · May 31, 2005

xbdestroya said:
No I don't think it's been *officially* confirmed as yet, but it's one of those things everyone knows, devs have stated, and Microsoft has done nothing to refute.

Thanks.

Gubbi · May 31, 2005

N00b said:
For the second generation Itanium with it's increased clock speed the compiler would have to look ahead *several thousand* clock cycles.

That's simply not feasible. Of course the Itanium didn't live up to it's theoretical performance. For the 3rd generation of Itanium Intel bolted on the huge 3rd-level cache (among other things) to get decent performance out of it.

The memory wall is there for all CPUs. A 3.7GHz P4 sees about 500 cycles of main memory latency. The ROB can hold 128 instructions or about 40-something cycles worth. Pre-loading/fetching, explicit vertical threading or architected multi-threading are some of the stuff that has to be done to overcome the latency.

- Or add more cache. Lowering average latency by adding oodles of cache makes alot of sense. To the point that cache memory is some of the best ways to spend silicon die area today (at least for GP CPUs). So expect huge cache CPUs on the desktop in the future.

OOOE in current CPUs can only cover on die cache latencies, but with caches growing (and therefore their latency) the performance gain from having a self scheduling device is significant, IMO.

Cheers
Gubbi

MfA · May 31, 2005

The itanium has a much greater issue width.

Gubbi · May 31, 2005

MfA said:
The itanium has a much greater issue width.

But alot lower clock rate.

The amount of instructions issued per second is about the same (compared to a P4). The amount of instructions in flight at any one time is about the same. Register file access latency (real time) is about the same, level one cache latency is about the same (again, real time).

Going wider is one way to exploit instruction parallism, going deeper (longer pipeline) is another way.

Cheers
Gubbi

Fox5 · May 31, 2005

Gubbi said:
MfA said:

The itanium has a much greater issue width.

Click to expand...

But alot lower clock rate.

The amount of instructions issued per second is about the same (compared to a P4). The amount of instructions in flight at any one time is about the same. Register file access latency (real time) is about the same, level one cache latency is about the same (again, real time).

Going wider is one way to exploit instruction parallism, going deeper (longer pipeline) is another way.

Cheers
Gubbi

And the Pentium M does niether, yet can beat both the P4 and Itanium! Well, in a few areas anyhow, I believe it loses badly in most.

AlNom · May 31, 2005

Fox5 said:
And the Pentium M does niether, yet can beat both the P4 and Itanium! Well, in a few areas anyhow, I believe it loses badly in most.

It was interesting that Clock for clock it can match the Athlon 64s.

http://www.tomshardware.com/cpu/20050525/pentium4-10.html

Panajev2001a · May 31, 2005

N00b said:
Gubbi said:

But how much more complex does a OOOE CPU have to be?

The PPRO used 10% of the die area for the reorder buffer and schedulers. If it buys you more than 10% performance it's a win.

Cheers
Gubbi

Click to expand...

Exactly. And since the number of transistors used for out-of-order functionality remains more or less constant or just grows moderately between processor generations while the total number of transistors grows sign significantly, it gets cheaper every generation.

Which is why Intel worked so long on a new architecture (IA-64) and put the two best CPU teams in the world (ex-EV7 and ex-EV8 guys) to work on the this new architecture, the IPF line

.

Fox5 · May 31, 2005

Alstrong said:
Fox5 said:

And the Pentium M does niether, yet can beat both the P4 and Itanium! Well, in a few areas anyhow, I believe it loses badly in most.

Click to expand...

It was interesting that Clock for clock it can match the Athlon 64s.

http://www.tomshardware.com/cpu/20050525/pentium4-10.html

Well, I believe it has a heck of a lot more transistors and die size than the athlon 64s.
Also, athlon 64s can take advantage of much faster memory than ddr400, while P-Ms are limited to around PC2700 max. An athlon 64 with PC4000 ram at low latencies gets a very nice performance boost.
Plus, I've seen many benchmarks online(typically the non gaming ones) where the P-M gets utterly destroyed by the Athlon 64s and Pentium 4s.

In order execution and Xbox360 cpu

Carl B

Friends call me xbd

bbot

N00b

Gubbi

Frank

Certified not a majority

Gubbi

MfA

N00b

MfA

ralexand

Carl B

Friends call me xbd

N00b

ralexand

Gubbi

MfA

Gubbi

Fox5

AlNom

Moderator

Panajev2001a

Fox5

Similar threads