In order execution and Xbox360 cpu

bbot said:
The MS engineers said on the Major Nelson interview that they would've preferred to choose a 10ghz processor.

Of course they would, who wouldn't? But that's just not realistic at this time. In fact, if clock speeds were ramping like they were expected to keep ramping a couple of years ago, I bet you'd be seeing something entirely different inside of the 360.
 
quote:

"On the Xbox 360 ideally if we could have bought one we would have preferred 10 GHz processor , unfortunately those kinda processor aren’t available so we had to break it three way and get the best performance we can."
 
DiGuru said:
Why doesn't Intel make 10GHz complex, single core processors, but dual 3-4GHz ones instead, if the first one would be so much better on paper? Because they can't.

What has this to do with simple or complex cores? It's not like the so much simpler XBOX 360 CPU achieves higher clock rates than the more complex Intel ones. I would bet that even with it's additional core it would be somehow underwhelming if benchmarked against a dual-core Intel or Athlon 64 in a comparable environment (same os, memory, etc.).
 
N00b said:
Gubbi said:
But how much more complex does a OOOE CPU have to be?

The PPRO used 10% of the die area for the reorder buffer and schedulers. If it buys you more than 10% performance it's a win.

Cheers
Gubbi
Exactly. And since the number of transistors used for out-of-order functionality remains more or less constant or just grows moderately between processor generations while the total number of transistors grows sign significantly, it gets cheaper every generation.

Depends. In Northwood P4s, the schedulers were around 20-25%, in Prescott P4s around 15% of the core die area. But then again the P4s can have about three times as many instructions in flight as the PPRO core.

Intel obviously felt it was worth it. Similar with AMD's Athlon: Big ass schedulers. And look who's king of the hill in SpecInt performance?

Cheers
Gubbi
 
N00b said:
What has this to do with simple or complex cores? It's not like the so much simpler XBOX 360 CPU achieves higher clock rates than the more complex Intel ones. I would bet that even with it's additional core it would be somehow underwhelming if benchmarked against a dual-core Intel or Athlon 64 in a comparable environment (same os, memory, etc.).

Yes, especially when you use a single-threaded benchmark. If you use a multi-threaded one, it might be different.

And if you have the choice between three cores, or two, but both 20% more efficient, or only one, but 50% more efficient, what do you do?
 
ERP said:
Gubbi said:
Something else have been bothering me:

How does the PPE and the X360 CPU schedule instructions?

Do they just statically schedule:

1. Two instructions from one thread (in single thread mode)
2. Two intstructions from each of two threads on alternating cycles
3. One instruction from each of two thread each cycle
4. Two instruction from one thread until stall, then switch to second thread and run full bore.

I mean, it's an in order core, when the core stalls, both threads would stall, right ?

I can see how alternating two threads would halve thoughput for each thread, essentially halving apparent latency (instruction, memory, what have you), and hence make it easier to sustain a higher throughput.

Does anybody know ?

Cheers
Gubbi

I don't know that there have been any details released on this.

But in general the reason that you have the two hardware threads is that you can do useful work when one of your resources is blocked.

As long as both threads are not waiting on the same resource or blocked waiting on different resources, one of them can keep running.

Right, so each thread has it's own issue control.

ERP said:
Basically one thread blocked does not stop the other thread otherwise having two sequencing units is a total waste of transistors.

Except for the halving of apparent latencies.

Cheers
Gubbi
 
Even ignoring SMT, with narrow issue width and prepare to branch instructions I dont think OOOE would add much with a decent architecture optimized compiler ... unfortunately such compilers are mythical beasts.
 
DiGuru said:
And if you have the choice between three cores, or two, but both 20% more efficient, or only one, but 50% more efficient, what do you do?
Sorry, but these numbers have nothing to do with real world. You just picked them to "prove" your opinion. It's like I would say: And if you have the choice between three cores, or two, but both 60% more efficient, or only one, but 350% more efficient, what do you do?

How often has the death of the "complex" x86 architecture been predicted? But as far as I can tell x86 is alive and kicking. But a lot of "simple", competiting architectures are dead by now.

Multi-Core is just another feature. One that can be implemented by simple and by complex cores. In the end the complex cores will (again) outperform the simple ones for the same reasons that the complex cores always prevailed: The transistor cost for a feature is more or less constant but with every die shrink you have an increased transistor budget.
BTW, like every feature multi-core it has its limits. I doubt we will see multi-core beyond 4 cores for the next 10-15 years in mainstream, general-purpose computing.
 
No I don't think it's been *officially* confirmed as yet, but it's one of those things everyone knows, devs have stated, and Microsoft has done nothing to refute.
 
MfA said:
Even ignoring SMT, with narrow issue width and prepare to branch instructions I dont think OOOE would add much with a decent architecture optimized compiler ... unfortunately such compilers are mythical beasts.
You mean like Intel and HP said that Itanium would be such a simple and easy design because the compiler would be able to do all the optimizing.

... and then ended up with a 3rd level cache? ;)

A couple of years ago I read an article about the Itanium. I barely remember it but what I remember is that they concluded that it is A BAD THING(tm) to rely on the compiler to keep the functional units of your processor busy. For the first generation Itanium reading data from main memory took about 50-100 clock cycles. So, they calculated, the compiler had to look ahead at least 300 clock cycles to feed all functional units. Pretty though, but somehow manageable. For the second generation Itanium with it's increased clock speed the compiler would have to look ahead *several thousand* clock cycles. That's simply not feasible. Of course the Itanium didn't live up to it's theoretical performance. For the 3rd generation of Itanium Intel bolted on the huge 3rd-level cache (among other things) to get decent performance out of it.
 
xbdestroya said:
No I don't think it's been *officially* confirmed as yet, but it's one of those things everyone knows, devs have stated, and Microsoft has done nothing to refute.
Thanks.
 
N00b said:
For the second generation Itanium with it's increased clock speed the compiler would have to look ahead *several thousand* clock cycles.

That's simply not feasible. Of course the Itanium didn't live up to it's theoretical performance. For the 3rd generation of Itanium Intel bolted on the huge 3rd-level cache (among other things) to get decent performance out of it.

The memory wall is there for all CPUs. A 3.7GHz P4 sees about 500 cycles of main memory latency. The ROB can hold 128 instructions or about 40-something cycles worth. Pre-loading/fetching, explicit vertical threading or architected multi-threading are some of the stuff that has to be done to overcome the latency.

- Or add more cache. Lowering average latency by adding oodles of cache makes alot of sense. To the point that cache memory is some of the best ways to spend silicon die area today (at least for GP CPUs). So expect huge cache CPUs on the desktop in the future.

OOOE in current CPUs can only cover on die cache latencies, but with caches growing (and therefore their latency) the performance gain from having a self scheduling device is significant, IMO.

Cheers
Gubbi
 
MfA said:
The itanium has a much greater issue width.

But alot lower clock rate.

The amount of instructions issued per second is about the same (compared to a P4). The amount of instructions in flight at any one time is about the same. Register file access latency (real time) is about the same, level one cache latency is about the same (again, real time).

Going wider is one way to exploit instruction parallism, going deeper (longer pipeline) is another way.

Cheers
Gubbi
 
Gubbi said:
MfA said:
The itanium has a much greater issue width.

But alot lower clock rate.

The amount of instructions issued per second is about the same (compared to a P4). The amount of instructions in flight at any one time is about the same. Register file access latency (real time) is about the same, level one cache latency is about the same (again, real time).

Going wider is one way to exploit instruction parallism, going deeper (longer pipeline) is another way.

Cheers
Gubbi

And the Pentium M does niether, yet can beat both the P4 and Itanium! Well, in a few areas anyhow, I believe it loses badly in most.
 
N00b said:
Gubbi said:
But how much more complex does a OOOE CPU have to be?

The PPRO used 10% of the die area for the reorder buffer and schedulers. If it buys you more than 10% performance it's a win.

Cheers
Gubbi
Exactly. And since the number of transistors used for out-of-order functionality remains more or less constant or just grows moderately between processor generations while the total number of transistors grows sign significantly, it gets cheaper every generation.

Which is why Intel worked so long on a new architecture (IA-64) and put the two best CPU teams in the world (ex-EV7 and ex-EV8 guys) to work on the this new architecture, the IPF line ;).
 
Alstrong said:
Fox5 said:
And the Pentium M does niether, yet can beat both the P4 and Itanium! Well, in a few areas anyhow, I believe it loses badly in most.


It was interesting that Clock for clock it can match the Athlon 64s.

http://www.tomshardware.com/cpu/20050525/pentium4-10.html

Well, I believe it has a heck of a lot more transistors and die size than the athlon 64s.
Also, athlon 64s can take advantage of much faster memory than ddr400, while P-Ms are limited to around PC2700 max. An athlon 64 with PC4000 ram at low latencies gets a very nice performance boost.
Plus, I've seen many benchmarks online(typically the non gaming ones) where the P-M gets utterly destroyed by the Athlon 64s and Pentium 4s.
 
Back
Top