N00b said:You mean like Intel and HP said that Itanium would be such a simple and easy design because the compiler would be able to do all the optimizing.MfA said:Even ignoring SMT, with narrow issue width and prepare to branch instructions I dont think OOOE would add much with a decent architecture optimized compiler ... unfortunately such compilers are mythical beasts.
... and then ended up with a 3rd level cache?
A couple of years ago I read an article about the Itanium. I barely remember it but what I remember is that they concluded that it is A BAD THING(tm) to rely on the compiler to keep the functional units of your processor busy. For the first generation Itanium reading data from main memory took about 50-100 clock cycles. So, they calculated, the compiler had to look ahead at least 300 clock cycles to feed all functional units. Pretty though, but somehow manageable. For the second generation Itanium with it's increased clock speed the compiler would have to look ahead *several thousand* clock cycles. That's simply not feasible. Of course the Itanium didn't live up to it's theoretical performance. For the 3rd generation of Itanium Intel bolted on the huge 3rd-level cache (among other things) to get decent performance out of it.
Fox5 said:Alstrong said:Fox5 said:And the Pentium M does niether, yet can beat both the P4 and Itanium! Well, in a few areas anyhow, I believe it loses badly in most.
It was interesting that Clock for clock it can match the Athlon 64s.
http://www.tomshardware.com/cpu/20050525/pentium4-10.html
Well, I believe it has a heck of a lot more transistors and die size than the athlon 64s.
ralexand said:Wow, I had completely forgot about the Itanium. Whatever happened to that chip?
Panajev2001a said:N00b said:Exactly. And since the number of transistors used for out-of-order functionality remains more or less constant or just grows moderately between processor generations while the total number of transistors grows sign significantly, it gets cheaper every generation.Gubbi said:But how much more complex does a OOOE CPU have to be?
The PPRO used 10% of the die area for the reorder buffer and schedulers. If it buys you more than 10% performance it's a win.
Cheers
Gubbi
Which is why Intel worked so long on a new architecture (IA-64) and put the two best CPU teams in the world (ex-EV7 and ex-EV8 guys) to work on the this new architecture, the IPF line .
ralexand said:Wow, I had completely forgot about the Itanium. Whatever happened to that chip?
Fox5 said:Panajev2001a said:N00b said:Exactly. And since the number of transistors used for out-of-order functionality remains more or less constant or just grows moderately between processor generations while the total number of transistors grows sign significantly, it gets cheaper every generation.Gubbi said:But how much more complex does a OOOE CPU have to be?
The PPRO used 10% of the die area for the reorder buffer and schedulers. If it buys you more than 10% performance it's a win.
Cheers
Gubbi
Which is why Intel worked so long on a new architecture (IA-64) and put the two best CPU teams in the world (ex-EV7 and ex-EV8 guys) to work on the this new architecture, the IPF line .
But if the P4 can accomplish similar theoretical performance per second, and does so much cheaper and has higher actual performance, wouldn't it be the better design?
Panajev2001a said:Complex, large OOOe logic stays in your critical path and buys you nothing in terms of redundancy.
Gubbi said:Panajev2001a said:Complex, large OOOe logic stays in your critical path and buys you nothing in terms of redundancy.
You pipeline your scheduling stage until it doesn't impact cycle time.
The real killer is register file read and result forwarding latency. And these are constant given the same amount of registers and execution units.
In an in-order CPU you need a large architected (visible) register file in order to statically schedule around latencies, the same amount of registers as you would find rename registers in an OOOE CPU.
Exactly how do you define critical path ? Branch mispredict penalty?
I would define the critical path as the schedule-execute loop (or issue-execute loop in the in-order case). That is, how fast can we get the next dependent instruction executed when the results it needs are ready.
If anything, a OOOE CPU would have an advantage because instructions can be pre-scheduled before needed results are available. Both the P4 and Athlon has two-stage scheduling, a global ROB, from which instructions are issued to local schedulers, when either the instruction's results are ready or when the global ROB knows that the results needed will be produced in the local scheduler's attached exec unit (like the double pumped ALUs in the P4 and the integer box in Athlon).
By doing so, the local scheduler can intercept results on a local result bus sooner, resulting in better schedule-execute latency.
Cheers
Gubbi
Fox5 said:Hmm, I thought the advantage of an in order cpu was a massively higher theoretical performance than could be achieved by an out of order. If there are only pluses to the out of order design, then why not go with it?
Gubbi said:Panajev2001a said:Complex, large OOOe logic stays in your critical path and buys you nothing in terms of redundancy.
You pipeline your scheduling stage until it doesn't impact cycle time.
The real killer is register file read and result forwarding latency. And these are constant given the same amount of registers and execution units.
In an in-order CPU you need a large architected (visible) register file in order to statically schedule around latencies, the same amount of registers as you would find rename registers in an OOOE CPU.
Panajev2001a said:Gubbi said:Panajev2001a said:Complex, large OOOe logic stays in your critical path and buys you nothing in terms of redundancy.
You pipeline your scheduling stage until it doesn't impact cycle time.
Or we take this off and keep our pipeline shorter which helps us reduce branch mis-prediction penalty (which is getting worser and worser already) as well as saving power (it is not like deep pipelining does not have a cost to it).
Panajev2001a said:The real killer is register file read and result forwarding latency. And these are constant given the same amount of registers and execution units.
Too bad the number of physical registers hardly stayed the same going from Pentium Pro to Prescott and we have some new execution units too (MMX and SSE/SSE2/SSE3 that Pentium Pro did not have to carry around).
Panajev2001a said:In an in-order CPU you need a large architected (visible) register file in order to statically schedule around latencies, the same amount of registers as you would find rename registers in an OOOE CPU.
Large register file which being visible to the applications can be used MUCH better by the compiler unrolling loops and helping to implement a more sane ISA (one that maybe does not support destructive instruction formats as your only choice ).
Panajev2001a said:No-one is arguing that a very well done OOOe engine can buy you extra performance as at run-time there are some-things you are just in a better position to face: it is just that the more processors grow it is getting more difficult to track the single instructions (some architectures are moving to track instructions groups instead even in the high-performance computing arena: see POWER4/POWER4+/POWER5/Power PC 970).
Panajev2001a said:What I am arguing is that there are several fields that are going to be relying on multi-processing and heavvily multi-threaded applications in which spending many transistors on supporting larger and larger instruction windows, getting more and more instructions in flight, etc... might not be your best bet. Both Intel with IPF and IBM with XeCPU and the BPA based CELL architecture are taking that bet.
Panajev2001a said:IPF has also always been a whole manufacturing node behind x86 so far and is getting now some of the best micro-processor designers in the industry (people from the EV7 and EV8 design teams) working on the next generation IPF MPU.
Gubbi said:But giving up on single-thread performance is a sure way to lose as a CPU manufaturer. Look at Sparc. The dual core UltraSparc IV is a joke, it gets creamed in almost all benchmarks by POWER, IPF and x86. And predictably SUN's Sparc-related business has gone downhill for the past years.
Gubbi said:But giving up on single-thread performance is a sure way to lose as a CPU manufaturer. Look at Sparc. The dual core UltraSparc IV is a joke, it gets creamed in almost all benchmarks by POWER, IPF and x86. And predictably SUN's Sparc-related business has gone downhill for the past years.
DemoCoder said:Gubbi said:But giving up on single-thread performance is a sure way to lose as a CPU manufaturer. Look at Sparc. The dual core UltraSparc IV is a joke, it gets creamed in almost all benchmarks by POWER, IPF and x86. And predictably SUN's Sparc-related business has gone downhill for the past years.
Actually, its the opposite. First of all, Sun lost out because of Linux and commodity boxes.
DemoCoder said:If Sun manages to get a Niagara processor into a pizza-box rackmounted server, and if they stick Linux on it, or make it cheap, they will destroy x86 solutions in database, web serving, and file serving performance.