DeanoC said:
I disagree, OoO only makes sense while it was possible to execute a single instruction stream faster than simple decoding allowed. A single instruction stream usually have high data depenencies that the memory subsystems can't keep up with. OoO is basically too good at its job, it starves the data caches without breaking into a sweat.
OOO makes sense as soon as you have latencies that you (or your compiler) have a hard time scheduling around.
DeanoC said:
Modern console processor designs have shifted to multiple instruction streams, at worse case you can do a crude manual form of OoO (each thread running the same code at different points), at best you have totally different execution patterns that stress the cache systems in different ways.
That's not really OOO, but it's true that mixing the workload would probably result in better utilization.
DeanoC said:
Of course ideally you would have lots of fast OoO cores and threads but realistically by spending the gates on lots of fast in order cores you achieve better overall results.
It's true that P4's (Prescott's) scheduler is *huge*, but it's also capable of having 128 instructions in its global scheduling window, 256 renaming registers (128 integer, 128 floating point), support for multiple thread contexts
and has 5 issue ports.
The next gen XBOX cpu is rumoured to have 3 cores each with 2 threads. If we design our OOO capabilities so that we can schedule around a 30 cycle level 2 cache hit latency (given contention from the other CPUs and the target speed, that is likely IMO) with our 2-way superscalar core we need to sustain 60 instructions in flight, if we only have 2 issue ports in our scheduler, one for integer and one for SIMD instructions, we need 30 extra registers of each type.
This is very close to the capabilities we see in the Pentium PRO/2/3, with it's 40 instructions schedule window, 48 renaming register (mind you, for only 8 architected registers) and 3 issue ports. The ROB and scheduler for the original PPRO only took up 10% of the total die-area, in later revisions with various SIMD execution units tacked on, even less so.
So my opinion is that it
is possible to make a fast and narrow OOO capable CPU where the (limited) sceduler takes up less than 10% of the total core area. And the performance advantage far exceeds 10% compared to an in-order CPU.
All IMO, of course.
Cheers
Gubbi