It's a trade off, while it's performance per MHz dropped, it's performance per watt doubled. Aggressive OOO is a huge power hog, dropping it allowed them to add other things which increased performance and functionality without increasing power.
If you want to qualify things to say that aggressive OoO is a huge power hog.
OoO doesn't appear to be a huge power hog for other example chips, in part because other factors can intrude.
I doubt they would have even got close to the performance of POWER6. P6 added a lot of extra stuff which has nothing to do with being in-order, the I/O system was all changed,
Massively increased bandwidth and lower interconnect latency is rather helpful for an in-order core that has a much lower tolerance for latency.
POWER6 without the enhanced infrastructure would suffer on all performance fronts. The cache size and enhanced IO contribute a measurable amount to the performance of the chip.
there were a load of new instructions (Altivec, Decimal FP) as well as reliability features. If this was added to POWER5+ it would have left little room for clock speed gains. P5+ didn't hit it's clock speed goals as it was so while the new process and design techniques will help I can't see the same sorts of gains being made.
None of those things directly impacts clock speeds, however.
IBM's reasons for going the route it did are many, but if POWER5 were to have been given a number of the gifts POWER6 enjoys, the gap would not be all that great.
...and you use the same compiler with the same level of tuning and assuming it's not data set sensitive.
POWER6 is priviledged in that it has an entire platform from CPU, system, compiler, to OS stack taylored specifically for it.
A number of competitors do not have that advantage.
One is designed to run largely single threaded apps in a desktop box. The other is designed to run largely multithreaded apps and sit in a large multiprocessor box. The tradeoffs involved are completely different so comparisons of power and cost are meaningless. e.g. P6 has a hefty I/O system needed to communicate with a large number of other processors, that alone means the power figures are going to be very different.
That does not mean POWER6 would be as impressive a performer if it didn't have those other compensating advantages.
There are also other factors, the high clock on P6 increases leakage, 40% of it's power goes to this. Intel has a lower clock rate so leakage is vastly reduced.
It can go with a lower clock rate because it is more efficient per clock. Both OoO and clock speed scaling hit diminishing returns when taken to the extreme.
Anyway, I note you quoted SPECint, how about quoting SPECfp rate, very different story there.
In part due to a number of significant ISA and implementation peculiarities on both designs that makes it hard to tease out in-order or OoO as primary causes.
POWER6 will have been in development for years, it probably started before Xenon started. While there could have been some knowledge flow I doubt it would have been of any significance because the circuit design techniques will have come from the R&D division.
The alliance between Sony, IBM and Toshiba started design work on Cell in 2001.
At that point, IBM had released POWER4, and was in the thick of designing POWER5.
A good portion of the design effort between the three in-order chips from IBM would have overlapped. It is not a coincidence that all three designs had the same goal of minimizing the logic complexity of pipeline stages to the degree they did.
IBM did benefit by having Sony and Microsoft foot the bill for work that would help IBM everywhere else.
In the context that I made that comment it was correct, that the same is true for superscalar wasn't relevant and I clarified this.
An OoO scalar pipeline would have the same number of register ports as a scalar in-order. A dual pipeline in an OoO design has enough read ports for each of the two instructions that it can issue at a time, same as an in-order because neither OoO or in-order need more operands than what is needed for the issue stage.
The width of the pipeline is the primary issue, as superscalar dependency checking and wide bypass networks scale quadratically.
A modest OoO implementation would not scale that badly.