In order execution and Xbox360 cpu

ralexand · May 31, 2005

Wow, I had completely forgot about the Itanium. Whatever happened to that chip?

Panajev2001a · May 31, 2005

N00b said:
MfA said:

Even ignoring SMT, with narrow issue width and prepare to branch instructions I dont think OOOE would add much with a decent architecture optimized compiler ... unfortunately such compilers are mythical beasts.

Click to expand...

You mean like Intel and HP said that Itanium would be such a simple and easy design because the compiler would be able to do all the optimizing.

... and then ended up with a 3rd level cache?

A couple of years ago I read an article about the Itanium. I barely remember it but what I remember is that they concluded that it is A BAD THING(tm) to rely on the compiler to keep the functional units of your processor busy. For the first generation Itanium reading data from main memory took about 50-100 clock cycles. So, they calculated, the compiler had to look ahead at least 300 clock cycles to feed all functional units. Pretty though, but somehow manageable. For the second generation Itanium with it's increased clock speed the compiler would have to look ahead *several thousand* clock cycles. That's simply not feasible. Of course the Itanium didn't live up to it's theoretical performance. For the 3rd generation of Itanium Intel bolted on the huge 3rd-level cache (among other things) to get decent performance out of it.

DECENT ?

Itanium 2 is quite toe-to-toe with its friend POWER4+/POWER5 and you have not seen IPF's third generation yet.

Itanium 2 is a 6-way processor with static schedulers (in-order) and it seems no slouch in SPECInt.

Till they can keep fabbing (lots of SRAM on chip buys you nice redundancy) 440+ mm^2 monsters for about $140 a chip, they are in business to do lots of good

.

Gubbi · May 31, 2005

Fox5 said:
Alstrong said:

Fox5 said:

And the Pentium M does niether, yet can beat both the P4 and Itanium! Well, in a few areas anyhow, I believe it loses badly in most.

Click to expand...

It was interesting that Clock for clock it can match the Athlon 64s.

http://www.tomshardware.com/cpu/20050525/pentium4-10.html

Click to expand...

Well, I believe it has a heck of a lot more transistors and die size than the athlon 64s.

Well, it has more transistors, but most of these are in the 2MB level 2 cache.

Die size for Dothan is 83.6mm^2, which is smaller than A64s & P4s, all in 90nm

Cheers
Gubbi

Panajev2001a · May 31, 2005

ralexand said:
Wow, I had completely forgot about the Itanium. Whatever happened to that chip?

It is included in the 2nd fastest Super-computer on Earth

.

http://www.top500.org/lists/2004/11/

Fox5 · May 31, 2005

Panajev2001a said:
N00b said:

Gubbi said:

But how much more complex does a OOOE CPU have to be?

The PPRO used 10% of the die area for the reorder buffer and schedulers. If it buys you more than 10% performance it's a win.

Cheers
Gubbi

Click to expand...

Exactly. And since the number of transistors used for out-of-order functionality remains more or less constant or just grows moderately between processor generations while the total number of transistors grows sign significantly, it gets cheaper every generation.

Click to expand...

Which is why Intel worked so long on a new architecture (IA-64) and put the two best CPU teams in the world (ex-EV7 and ex-EV8 guys) to work on the this new architecture, the IPF line .

But if the P4 can accomplish similar theoretical performance per second, and does so much cheaper and has higher actual performance, wouldn't it be the better design?

Carl B · May 31, 2005

ralexand said:
Wow, I had completely forgot about the Itanium. Whatever happened to that chip?

Oh it's still around - breaking into the server market at a glacial pace.

Panajev2001a · May 31, 2005

Fox5 said:
Panajev2001a said:

N00b said:

Gubbi said:

But how much more complex does a OOOE CPU have to be?

The PPRO used 10% of the die area for the reorder buffer and schedulers. If it buys you more than 10% performance it's a win.

Cheers
Gubbi

Click to expand...

Exactly. And since the number of transistors used for out-of-order functionality remains more or less constant or just grows moderately between processor generations while the total number of transistors grows sign significantly, it gets cheaper every generation.

Click to expand...

Which is why Intel worked so long on a new architecture (IA-64) and put the two best CPU teams in the world (ex-EV7 and ex-EV8 guys) to work on the this new architecture, the IPF line .

Click to expand...

But if the P4 can accomplish similar theoretical performance per second, and does so much cheaper and has higher actual performance, wouldn't it be the better design?

The IA-32 architecture has had compiler technology and some of the wisest micro-processor designers going at it or the past 30+ years compared to the young life of an architecture like IPF which has lots of room to grow on an already solid base.

The war on ultra high IPC count on code with low ILP is harder and harder each year that passes (extracting more and more parallelism out of code is getting like getting blood from a stone).

Complex, large OOOe logic stays in your critical path and buys you nothing in terms of redundancy: main RAM is still getting way too far... RAM is getting like 500-1,000 cycles away and there is no way is find that much work for your OOOe logic without stalls once you hit memory too much.

We need local memory (be it cache or Working RAM) and once you talk about L1 cache latencies then with a good ISA and good compilers you can work around those latencies which in a heavvily threaded design (IPF is going multi-core and Multi-Threaded too) might be even more mitigated.

Gubbi · Jun 1, 2005

Panajev2001a said:
Complex, large OOOe logic stays in your critical path and buys you nothing in terms of redundancy.

You pipeline your scheduling stage until it doesn't impact cycle time.

The real killer is register file read and result forwarding latency. And these are constant given the same amount of registers and execution units.

In an in-order CPU you need a large architected (visible) register file in order to statically schedule around latencies, the same amount of registers as you would find rename registers in an OOOE CPU.

Exactly how do you define critical path ? Branch mispredict penalty?

I would define the critical path as the schedule-execute loop (or issue-execute loop in the in-order case). That is, how fast can we get the next dependent instruction executed when the results it needs are ready.

If anything, a OOOE CPU would have an advantage because instructions can be pre-scheduled before needed results are available. Both the P4 and Athlon has two-stage scheduling, a global ROB, from which instructions are issued to local schedulers, when either the instruction's results are ready or when the global ROB knows that the results needed will be produced in the local scheduler's attached exec unit (like the double pumped ALUs in the P4 and the integer box in Athlon).

By doing so, the local scheduler can intercept results on a local result bus sooner, resulting in better schedule-execute latency.

Cheers
Gubbi

Fox5 · Jun 1, 2005

Gubbi said:
Panajev2001a said:

Complex, large OOOe logic stays in your critical path and buys you nothing in terms of redundancy.

Click to expand...

You pipeline your scheduling stage until it doesn't impact cycle time.

The real killer is register file read and result forwarding latency. And these are constant given the same amount of registers and execution units.

In an in-order CPU you need a large architected (visible) register file in order to statically schedule around latencies, the same amount of registers as you would find rename registers in an OOOE CPU.

Exactly how do you define critical path ? Branch mispredict penalty?

I would define the critical path as the schedule-execute loop (or issue-execute loop in the in-order case). That is, how fast can we get the next dependent instruction executed when the results it needs are ready.

If anything, a OOOE CPU would have an advantage because instructions can be pre-scheduled before needed results are available. Both the P4 and Athlon has two-stage scheduling, a global ROB, from which instructions are issued to local schedulers, when either the instruction's results are ready or when the global ROB knows that the results needed will be produced in the local scheduler's attached exec unit (like the double pumped ALUs in the P4 and the integer box in Athlon).

By doing so, the local scheduler can intercept results on a local result bus sooner, resulting in better schedule-execute latency.

Cheers
Gubbi

Hmm, I thought the advantage of an in order cpu was a massively higher theoretical performance than could be achieved by an out of order. If there are only pluses to the out of order design, then why not go with it?

Gubbi · Jun 1, 2005

Fox5 said:
Hmm, I thought the advantage of an in order cpu was a massively higher theoretical performance than could be achieved by an out of order. If there are only pluses to the out of order design, then why not go with it?

The OOO apparatus takes up die space and use power. Die space and power that could be used for execution units (like big fat SIMD ones).

For execution unit centric workloads (like that in consoles) trading off OOO for more instruction throughput might be the right choice.

I obviously don't think so

Cheers
Gubbi

DemoCoder · Jun 1, 2005

I think ILP is overrated for the types of loads these processors must deal with. TLP is alot better and something like Sun's Niagara processor should be able to reach near its peak issue rate compared to OOOE.

Panajev2001a · Jun 1, 2005

Gubbi said:
Panajev2001a said:

Complex, large OOOe logic stays in your critical path and buys you nothing in terms of redundancy.

Click to expand...

You pipeline your scheduling stage until it doesn't impact cycle time.

Or we take this off and keep our pipeline shorter which helps us reduce branch mis-prediction penalty (which is getting worser and worser already) as well as saving power (it is not like deep pipelining does not have a cost to it).

The real killer is register file read and result forwarding latency. And these are constant given the same amount of registers and execution units.

Too bad the number of physical registers hardly stayed the same going from Pentium Pro to Prescott and we have some new execution units too (MMX and SSE/SSE2/SSE3 that Pentium Pro did not have to carry around).

In an in-order CPU you need a large architected (visible) register file in order to statically schedule around latencies, the same amount of registers as you would find rename registers in an OOOE CPU.

Large register file which being visible to the applications can be used MUCH better by the compiler unrolling loops and helping to implement a more sane ISA (one that maybe does not support destructive instruction formats as your only choice

).

No-one is arguing that a very well done OOOe engine can buy you extra performance as at run-time there are some-things you are just in a better position to face: it is just that the more processors grow it is getting more difficult to track the single instructions (some architectures are moving to track instructions groups instead even in the high-performance computing arena: see POWER4/POWER4+/POWER5/Power PC 970).

What I am arguing is that there are several fields that are going to be relying on multi-processing and heavvily multi-threaded applications in which spending many transistors on supporting larger and larger instruction windows, getting more and more instructions in flight, etc... might not be your best bet. Both Intel with IPF and IBM with XeCPU and the BPA based CELL architecture are taking that bet.

Not only that, but call me a Roswellian for this, the aquisition of the teams that worked on the (joke of the industry) Elbrus 2K hardware and compiler related technology (Dr. Babayan is now an Intel fellow IIRC) was IPF related. IPF has also always been a whole manufacturing node behind x86 so far and is getting now some of the best micro-processor designers in the industry (people from the EV7 and EV8 design teams) working on the next generation IPF MPU. We will see if they will transition IPF to OOOe or if they think the path HP and Intel started with EPIC was not so bad after-all

.

Gubbi · Jun 1, 2005

Panajev2001a said:
Gubbi said:

Panajev2001a said:

Complex, large OOOe logic stays in your critical path and buys you nothing in terms of redundancy.

Click to expand...

You pipeline your scheduling stage until it doesn't impact cycle time.

Click to expand...

Or we take this off and keep our pipeline shorter which helps us reduce branch mis-prediction penalty (which is getting worser and worser already) as well as saving power (it is not like deep pipelining does not have a cost to it).

Scheduling (and register file read out) takes two cycles on the Athlons, 3 cycles on Northwood P4s, this out of 14-20 cycle pipelines, so hardly devastating. Deep pipelining (hyper pipelining) really has nothing to do with OOOE, but more to do with pushing operating frequency higher. SPEs in CELL has a 12-20 cycle pipeline, the PPE longer. So about the same as Athlon and P4s in a similar process at a (roughly) similar operating frequency.

Panajev2001a said:
The real killer is register file read and result forwarding latency. And these are constant given the same amount of registers and execution units.

Click to expand...

Too bad the number of physical registers hardly stayed the same going from Pentium Pro to Prescott and we have some new execution units too (MMX and SSE/SSE2/SSE3 that Pentium Pro did not have to carry around).

My point was that the amount of rename registers in Athlons and P4s roughly match the architected registers in the PPE, XeCPU and Itanium (ie 96-112 vs 128).

Panajev2001a said:
In an in-order CPU you need a large architected (visible) register file in order to statically schedule around latencies, the same amount of registers as you would find rename registers in an OOOE CPU.

Click to expand...

Large register file which being visible to the applications can be used MUCH better by the compiler unrolling loops and helping to implement a more sane ISA (one that maybe does not support destructive instruction formats as your only choice ).

Loop unrolling is a cludge and a waste. It's there to work around false dependencies in in-order CPUs (false dependencies are a non-issue in OOOE). It bloats code by multiplying the amount of space a loop takes up, lowering code-density (and thereby performance). Itanium has support for rotating the register file to avoid unrolling. Power/PPC does not.

Panajev2001a said:
No-one is arguing that a very well done OOOe engine can buy you extra performance as at run-time there are some-things you are just in a better position to face: it is just that the more processors grow it is getting more difficult to track the single instructions (some architectures are moving to track instructions groups instead even in the high-performance computing arena: see POWER4/POWER4+/POWER5/Power PC 970).

Actually Athlon was first

It's ROB is divided into three lanes. It retires instructions when all instructions in a given slot has completed for all three lanes. Similar to POWER 4/5/970 group retiring mechanism.

Panajev2001a said:
What I am arguing is that there are several fields that are going to be relying on multi-processing and heavvily multi-threaded applications in which spending many transistors on supporting larger and larger instruction windows, getting more and more instructions in flight, etc... might not be your best bet. Both Intel with IPF and IBM with XeCPU and the BPA based CELL architecture are taking that bet.

Well, IPF is the most aggressive ILP exploiting CPU out there, even if it is in-order, so a bad example IMO. But I agree that there are fields where an in-order, exec unit heavy core makes sense. Consoles is one.

But giving up on single-thread performance is a sure way to lose as a CPU manufaturer. Look at Sparc. The dual core UltraSparc IV is a joke, it gets creamed in almost all benchmarks by POWER, IPF and x86. And predictably SUN's Sparc-related business has gone downhill for the past years.

Panajev2001a said:
IPF has also always been a whole manufacturing node behind x86 so far and is getting now some of the best micro-processor designers in the industry (people from the EV7 and EV8 design teams) working on the next generation IPF MPU.

To be fair. The 2.4GHz Opteron and 3.2GHz P4 is on par with the 1.6GHz I2 in SpecInt, all on 130nm. The I2 however has almost double the die size (372mm^2) compared to the others.

And both AMD and Intel has world class designers working on their next gen x86 solutions (ie. AMD's Fred Weber is an ex-alpha architect).

Cheers
Gubbi

MfA · Jun 1, 2005

Loop unrolling has it's place on x86 though, because the brain dead branch prediction gets it wrong when it really shouldnt have to.

Frank · Jun 1, 2005

Gubbi said:
But giving up on single-thread performance is a sure way to lose as a CPU manufaturer. Look at Sparc. The dual core UltraSparc IV is a joke, it gets creamed in almost all benchmarks by POWER, IPF and x86. And predictably SUN's Sparc-related business has gone downhill for the past years.

For now. But it won't last, as the single-thread performance has been hitting the point of diminishing returns for some time now, and there is just no way to improve it much further. And the moment multi-threaded or stream-centered execution is taking off, a multi-core design like XboX360 or PS3 will go to the top. It will mostly depend on the target which one of those architectures is better.

I think processors in general will evolve towards a few general purpose cores and a lot of specific purpose ones (mostly FP/vector) in the near future.

DemoCoder · Jun 2, 2005

Gubbi said:
But giving up on single-thread performance is a sure way to lose as a CPU manufaturer. Look at Sparc. The dual core UltraSparc IV is a joke, it gets creamed in almost all benchmarks by POWER, IPF and x86. And predictably SUN's Sparc-related business has gone downhill for the past years.

Actually, its the opposite. First of all, Sun lost out because of Linux and commodity boxes. Even if you stick high-end x86 or Power chips in their servers, they'd still be too expensive. Sun was interested in selling Enterprise 4500s and 10000s, not rackmounted pizza boxes. But the rest of the world discovered commodity hardware was so cheap, that horizontal scalability was possible. This especially became true after Oracle introduced efficienct and high performance clustered databases.

But Sun's fundamental business is servers, and the types of applications that run on most Sun computers are *inherently* TLP. They are a very high number of clients demanding data that is inherently I/O bound, so that any system based on executing tons and tons of threads more efficiently is going to run their applications better.

If Sun manages to get a Niagara processor into a pizza-box rackmounted server, and if they stick Linux on it, or make it cheap, they will destroy x86 solutions in database, web serving, and file serving performance.

It's throughput, not latency for those apps that matter.

Gubbi · Jun 2, 2005

DemoCoder said:
Gubbi said:

But giving up on single-thread performance is a sure way to lose as a CPU manufaturer. Look at Sparc. The dual core UltraSparc IV is a joke, it gets creamed in almost all benchmarks by POWER, IPF and x86. And predictably SUN's Sparc-related business has gone downhill for the past years.

Click to expand...

Actually, its the opposite. First of all, Sun lost out because of Linux and commodity boxes.

Small and medium sized servers are being taken over by x86, yes. This is true for all server vendors.

But even if you look at high-end servers isolated you'll still see Sun has lost ALOT of marketshare from 1999 to now - to IBM (Power) and HP (PA-RISC and IPF).

DemoCoder said:
If Sun manages to get a Niagara processor into a pizza-box rackmounted server, and if they stick Linux on it, or make it cheap, they will destroy x86 solutions in database, web serving, and file serving performance.

Somewhat a bold statement since Niagara performance is a complete unknown, no ?

Eight core Niagara will be up against four core Opterons and Xeons. Niagara don't support multiple chips, Opterons and Xeons do. Time will tell who'll win.

BTW. Latency is important in alot of transaction processing workloads.

Cheers
Gubbi

In order execution and Xbox360 cpu

ralexand

Panajev2001a

Gubbi

Panajev2001a

Fox5

Carl B

Friends call me xbd

Panajev2001a

Gubbi

Fox5

Gubbi

DemoCoder

Panajev2001a

Gubbi

MfA

Frank

Certified not a majority

DemoCoder

Gubbi

Similar threads