I wouldn't be so sure about that, in floating point SIMD I'd expect a Xenon to demolish even a Woodcrest.
Number of ops is about the same, 8 per cycle and comparable cycles. Woodcrest has an advantage in that muls and adds can be scheduled indepently (and of course it's OOO).
Xenon's SIMD engines have 128 registers and while it doesn't have a huge cache it gets around added latency by using 2 threads.
The cache isn't big because it's being accessed by 4 processors (the 3 cores and the GPU) and can do things like cache locking.
Having two threads helps keep utilization up, but in effect turns your 3.2GHz CPU into 2 1.6GHz CPUs.
The caches aren't big, because the die diet ate the rest.
But really low power devices like Phones and PDAs all use in-order processors. High throughput chips like GPUs (which completely destroy CPUs) are also in-order.
They are also scalar CPU with low operating frequency, using only few (1-4) mm^2 of die space. You wouldn't want to stick 100 ARM 7 cores on a die would you?
As I said, OOO is useful for certain types of workload.
100% agree.
What complicates matters is OOO is also a useful bandaid for designs which have small numbers of registers - e.g. x86. Without OOO you'd lose all the rename registers and performance would likely plummet (see VIA C3 benchmaks). In that case OOO probably does save power since it's boosting performance so much.
However PowerPC has always had 32 registers so doesn't need OOO quite so much and doesn't have so much of an effect, according to IBM's figures OOO only boosts performance by 30-40%.
You have a good point there x86 benefits alot more from aggresive micro architecture than more straight up RISC architectures do. But 30-40% is still an astonishing amount of performance, especially from the modest amount of silicon real estate you have to spend to get that.
That said the PPE is NOT a pure in-order machine, it does OOO loads...
Non-blocking loads has been in in-order CPUs for a long time. The PPE/360CPU also does delayed execution, which is a limited form of OOO, in the FP (+SIMD) pipeline, similar to G3/G4s
If history goes back to 1998 yes, before that the in-order Alpha was outgunning the out-of-order PA-RISC. Before the Alpha the fastest CPUs were all huge multi-chip things, the fastest of the fast being Cray's machines, all high clocked in-order designs, all beating the living S**t out of IBM's OOO mainframes.
You make it sound so dramatic, "outgunning".
In the end of 1996 400MHz Alpha 21164s scored 10.1 SpecINT95 (baseline) vs. 9.43 for 160MHZ PaRISC 8000s. In the same time frame 200MHz R10000s (also OOO) scored 10.7 - in my world that is neck and neck.
Heck, when the Pentium PRO debuted is was king of the hill in SpecINT for two months, faster than the Alphas of the day, that should tell you something about the usefulnes of OOO.
But that was back then. Cycle times today are a decimal order of magnitude smaller, while memory latency is lower by a factor of two.
It's also incredibly complex and needs to run very fast, i.e. it's gets hot. 8% of a die may not should like that much but consider that more than half of the die is taken up by cache and it only uses a few percent of the CPU's power budget. Being small doesn't mean it's not a potential problem.
It's not 8% of a die, it's 8% of a core (level 1 caches are counted as part of the core). For a K8 die with 1MB L2 cache the core only takes up 30% of the die area. So it's 8% of 30%!!!
IPC is generally limited by code, not the hardware, the average IPC you can extracet from code is around 2 - exactly what the PPE and SPEs were designed for. In reality however IPC is usually lower.
IPC is almost always limited by the memory subsystem these days, be it bandwidth or latency.
440, 750, 970, POWER5 and probably several others besides, if OOO was that important they would have got it.
The 440, G3 and G4 has a limited form of OOO, it's the same as the X360 CPU and PPE has in their FP units, it's called delayed execution and can only handle the latencies incurred by these execution units (6-7 cycles tops).
Power4 and the derived PPC 970 were the first really aggressive OOO superscalar IBM designed.
OOO was dropped because of space and power concerns, and because the workload (SIMD floating point) doesn't benefit from it much, if at all.
Do you have a source for that ? Because other than FFTs and single precision Linpack I haven't seen any workload where FP throughput is anywhere near peak.
Yes, but the PPE/Xenon's integer core was from an older project and was a plug-in they could both use.
Erh, so now you're saying that they _are_ the same ?
Cheers