About some info from XB2 leaked documents.

Urian · Dec 1, 2004

I have seen a mistake in the analisys of the people.

The supposed leaked document says that the CPU is 3 PowerPC core capable of do 6 instrucctions per clock cycle.

A lot of people talks about a versiÃ³n of the PowerPC 970mp with 3 cores inside.

But my PowerPC970 can do 5 instrucctions per clock cycle, I donÂ´t believe that 3xPPC 970 could do only 6 instructions per clock cycle. And because the PowerPC 440 do 2 intructions per cycle I believe two things:

First that the leaked document is a fake, second that could be no fake but te analysts of the leaked document had a mistake in the CPU analisys.

London Geezer · Dec 1, 2004

Urian said:
I have seen a mistake in the analisys of the people.

The supposed leaked document says that the CPU is 3 PowerPC core capable of do 6 instrucctions per clock cycle.

A lot of people talks about a versiÃ³n of the PowerPC 970mp with 3 cores inside.

But my PowerPC970 can do 5 instrucctions per clock cycle, I donÂ´t believe that 3xPPC 970 could do only 6 instructions per clock cycle. And because the PowerPC 440 do 2 intructions per cycle I believe two things:

First that the leaked document is a fake, second that could be no fake but te analysts of the leaked document had a mistake in the CPU analisys.

And you couln't post that on, say, the thread with the mentioned leaked document, discussion, flame war and arguments because.........?

(I've been a bit of a pain lately haven't i? It's just that people seem to feel the need to open new threads on what's been discussed 23210 times before, thinking there are new and exciting points to be discussed that for some reason do not fit in the 88923 threads that deal with the issues)

nAo · Dec 1, 2004

It's not a matter of instructions issued per clock.
Xenon CPU should be able to run 2 threads per core.
When one thread is running the other one is 'sleeping'.
Each core should be capable to switch thread each time it needs to hide latencies such as L2 cache misses.

ciao,
Marco

passerby · Dec 1, 2004

I'm still surprised that multiple cores in a console can be clocked to such speeds by 2005. Technical question coming. This is supposedly possible by excluding from the CPU 'many functions that are not necessary for a game console'. So can anyone tell me just how many % of that latest, god-awful expensive Athlon/Pentium processor is really not necessary to run Doom3/Half-Life 2? Examples of such 'unneccessary' functions or instructions?

London Geezer · Dec 1, 2004

passerby said:
I'm still surprised that multiple cores in a console can be clocked to such speeds by 2005. Technical question coming. This is supposedly possible by excluding from the CPU 'many functions that are not necessary for a game console'. So can anyone tell me just how many % of that latest, god-awful expensive Athlon/Pentium processor is really not necessary to run Doom3/Half-Life 2? Examples of such 'unneccessary' functions or instructions?

That's quite an interesting question, but i'm not sure anyone will be able to tell up what is "asleep" and unused in CPUs (and therefore expendable in consoles i guess) while running games like Doom3 and HL2.

DeanoC · Dec 1, 2004

passerby said:
I'm still surprised that multiple cores in a console can be clocked to such speeds by 2005. Technical question coming. This is supposedly possible by excluding from the CPU 'many functions that are not necessary for a game console'. So can anyone tell me just how many % of that latest, god-awful expensive Athlon/Pentium processor is really not necessary to run Doom3/Half-Life 2? Examples of such 'unneccessary' functions or instructions?

Out of order execution units.
All that decoding of rubbish instruction streams into micro-ops is whats consumes vast amounts of silicon in G5 or P4 and Athlon64.

A fast in-order execution unit is a magnitude smaller than the a fast out-of-order execution unit. Hence you can pack N in the same space.

Panajev2001a · Dec 1, 2004

nAo said:
It's not a matter of instructions issued per clock.
Xenon CPU should be able to run 2 threads per core.
When one thread is running the other one is 'sleeping'.
Each core should be capable to switch thread each time it needs to hide latencies such as L2 cache misses.

ciao,
Marco

According to the specs (the leaked ones) the cores have been enhanced with SMT capabilities.

With Simultaneous Multi-Threading (under whatever nick-name you find it nowadays... like for example Intel calls it Hyper-Threading) you can have more than one thread "Active" and running at any given time (the trace cache's decoded u-ops have a thread ID as well as other info attached to them) or you can have a single thread using up all the processor resources.

What you described AFAIK is somethign known as switch-on-event Multi-Threading (in your example you are switching threads on L2 cache misses).

passerby · Dec 1, 2004

Thanks for the fast reply Deano.

So that means that games don't benefit a lot from OoO at all. Another question.

Is it because OoO is more beneficial for desktop systems running multiple applications instead? As opposed to only one game application running on a console.

Thanks!

Gubbi · Dec 1, 2004

DeanoC said:
Out of order execution units.
All that decoding of rubbish instruction streams into micro-ops is whats consumes vast amounts of silicon in G5 or P4 and Athlon64.

A fast in-order execution unit is a magnitude smaller than the a fast out-of-order execution unit. Hence you can pack N in the same space.

A binary order of magnitude smaller, but also a binary order of magnitude slower.

You'd want to at least be able to schedule around the latencies from the on-die caches. And with 3 cores those latencies are not going to be predictable (ie. hard to hand/static schedule).

Anyway, implementing the context-tracking apparatus associated with SMT and you're more than half way there towards OOOE.

A simple 2-stage scheduler would suffice (like in Power 4/5 and K7/8 ). Have a global scheduling window holding groups of instructions.

From here instructions gets sent to either the VMX-box or the other-box. The VMX box executes all SIMD instructions, the other box all integer and load/store ops.

Each of the schedulers in these boxen has only one issue port and hence is simple and fast.

Retire 1 group at a time (like Power 4/5).

Cheers
Gubbi

AlNom · Dec 1, 2004

(for those not in the know, what is OoO ?

)

Jov · Dec 1, 2004

Alstrong said:
(for those not in the know, what is OoO ? )

Out-of-Order.

Panajev2001a · Dec 1, 2004

It is all a game of balancing things out.

True, you can put mor ein-order execution units and designa fast chip, but you can do it also with OOO units (see the now deceased Alpha EV8 which was supposed to be an 8-way super-scalar processor or look at the records-breaking POWER5 processor by IBM).

Yes, a fast in-order unit will take less logic and can be ran at a higher clock-speed, but will it sit idle all the time because memory latency is very high ? Sure you can fight back with MT (like switch-on-event MT and switching on L2-L3 cache misses), but you might need more than that to enhance single thread performance.

OOOe is not a big need for IA-64 right now because thanks to Intel's manufacturing capability they can afford immense amounts of low-latency SRAM based cache (all three levels of cache are all on-chip) so they can afford to worry less about main RAM latency and FSB speed being bottlenecks.

In Montecito they are talking about more than 24 MB of cache on-chip when they add to that dual processing cores and MT, OOOe will not seem as a huge need.

Designing a wide and fast OOOe CPU like IBM does with their POWER line and like Alpha did is nto easy, but neither is designing a wide and fast in-order CPU.

It took Intel a decade of HARD work with HP to complete the IA-64 ISA and the necessary compiler technology: they had a good idea... let's move complexity from the hardware to the software (compilers) as much as we can.

That idea took a LONG time before starting to show its benefits even though Intel and HP had some of their best engineers working on it for several years and with a quite large budget available.

Taking a middle ground approach between the strategy behind POWER5 (and its successors) and Itanium 2 (and its successors) seems far easier in theory than in practice and has so far resulted in the stale SPARC architecture.

I do not know why you put Pentium 4 on the bad penalty for x86 instruction decoding DeanoC: they have only one x86 decoder that feeds the trace cache and from that point forward the CPU core feeds off basically all pre-decoded u-ops (decoding was taken off the critical path with the Pentium 4: one of the decisions I really liked about the design was the trace cache to tell you the truth).

Guden Oden · Dec 1, 2004

Gubbi said:
these boxen

gAh!

I SO detest when people pluralize boxes into boxen, it's simply WRONG.

Wrong in the sense that one does NOT write red text on a purple background, or mix milk and orange juice.

It's just something the universe has decided we're NOT SUPPOSED TO DO!

DeanoC · Dec 1, 2004

Panajev2001a said:
I do not know why you put Pentium 4 on the bad penalty for x86 instruction decoding DeanoC: they have only one x86 decoder that feeds the trace cache and from that point forward the CPU core feeds off basically all pre-decoded u-ops (decoding was taken off the critical path with the Pentium 4: one of the decisions I really liked about the design was the trace cache to tell you the truth).

Because I was simplifying things, having a trace cache itself is an OoO style system. If not technically in abstract there are 2 schools of thought.
1) In Order. Take a bunch of instructions and execute them in turn, doing as little as possible to each instruction.
2) Out of Order. Take a bunch of instruction and re-order them in some way to increase throughput.

P4 is definately a type 2, console CPU are (almost) always type 1.

Why this is a win on consoles and not on PC is complex but basically comes down to the fact that the console CPU is a fixed target. It never has to handle a bad instruction stream (unless written by a bad ASM programmer ;-) ).

Gubbi · Dec 1, 2004

DeanoC said:
Because I was simplifying things, having a trace cache itself is an OoO style system.

The P4 trace cache is there to speed up instruction fetch/decode by reusing previous decodings. It's perfectly possible to build a CPU that has a trace cache but issues and executes these instructions in-order. Nobody has done it because you always pick the low hanging fruit first, and adding OOO execution is *alot* lower hanging than adding a trace cache.

DeanoC said:
1) In Order. Take a bunch of instructions and execute them in turn, doing as little as possible to each instruction.
2) Out of Order. Take a bunch of instruction and re-order them in some way to increase throughput.

P4 is definately a type 2, console CPU are (almost) always type 1.

Both the XBOX and the Gamecube has OOOE CPUs.

DeanoC said:
Why this is a win on consoles and not on PC is complex but basically comes down to the fact that the console CPU is a fixed target. It never has to handle a bad instruction stream (unless written by a bad ASM programmer ;-) ).

That is one of the reasons, another reason is that software running on consoles is execution unit-centric. The most demanding parts of a game is typically found in rather small kernels crunching physics, particle systems or vertices. So there's definately less use for advanced OOOE capabilities than in a PC executing Microsoft Spaghetti.

However this is likely to change. Next generation consoles has 2 features that benefit from a self-scheduling ability.

1.) Larger relative latency to memory arrays. Not only main memory, but also on-die caches are now tens of cycles away.
2.) Multiple cores/threads. These will contend for memory arbitration making what might be predictable latencies completely unpredictable.

Shared memory systems compound 2.) by adding contention for main memory by the GPU.

Cheers
Gubbi

darkblu · Dec 1, 2004

Panajev2001a said:
nAo said:

It's not a matter of instructions issued per clock.
Xenon CPU should be able to run 2 threads per core.
When one thread is running the other one is 'sleeping'.
Each core should be capable to switch thread each time it needs to hide latencies such as L2 cache misses.

ciao,
Marco

Click to expand...

According to the specs (the leaked ones) the cores have been enhanced with SMT capabilities.

With Simultaneous Multi-Threading (under whatever nick-name you find it nowadays... like for example Intel calls it Hyper-Threading) you can have more than one thread "Active" and running at any given time (the trace cache's decoded u-ops have a thread ID as well as other info attached to them) or you can have a single thread using up all the processor resources.

What you described AFAIK is somethign known as switch-on-event Multi-Threading (in your example you are switching threads on L2 cache misses).

historical flashback: the acronym SMT originated as Symmetrical Multi-Threading. some cpu vendors, though, whose multi-threading cores were not quit symmetrical, changed the acronym into "Simultaneous Multi-Threading" (eg. intel's HyperThreading is "simultaneous mt"). A true SMT system in the sense of symmetrical mt system should behave identically (or very close to) an SMP system, i.e. it should be able to carry out multiple threads w/o latter blocking each other through (implicitly) "mutexed" cpu resources.

FatherJohn · Dec 1, 2004

In the current generation Xbox has an ooo CPU, while PS2 has an in-order CPU. So PS2 game developers have to think about instruction ordering much more than Xbox devs do.

Sony developer support said in a GDC presentation that many PS2 games have trouble getting decent performance out of the CPU. But they say that tools for showing CPU stalls are improving, and they say the AAA titles (like Jax and Daxter) are able to get decent CPU utilization.

DeanoC · Dec 1, 2004

Gubbi said:
The P4 trace cache is there to speed up instruction fetch/decode by reusing previous decodings. It's perfectly possible to build a CPU that has a trace cache but issues and executes these instructions in-order. Nobody has done it because you always pick the low hanging fruit first, and adding OOO execution is *alot* lower hanging than adding a trace cache.

Almost by defination an in-order processor has a simple instruction decoder else it stalls badly. I guess a trace cache could be used to fit a complex ISA (say x86) to an in-order back en.

Gubbi said:
However this is likely to change. Next generation consoles has 2 features that benefit from a self-scheduling ability.

1.) Larger relative latency to memory arrays. Not only main memory, but also on-die caches are now tens of cycles away.
2.) Multiple cores/threads. These will contend for memory arbitration making what might be predictable latencies completely unpredictable.

Shared memory systems compound 2.) by adding contention for main memory by the GPU.

I disagree, OoO only makes sense while it was possible to execute a single instruction stream faster than simple decoding allowed. A single instruction stream usually have high data depenencies that the memory subsystems can't keep up with. OoO is basically too good at its job, it starves the data caches without breaking into a sweat.
Modern console processor designs have shifted to multiple instruction streams, at worse case you can do a crude manual form of OoO (each thread running the same code at different points), at best you have totally different execution patterns that stress the cache systems in different ways.

Of course ideally you would have lots of fast OoO cores and threads but realistically by spending the gates on lots of fast in order cores you achieve better overall results.

Gubbi · Dec 1, 2004

darkblu said:
historical flashback: the acronym SMT originated as Symmetrical Multi-Threading. some cpu vendors, though, whose multi-threading cores were not quit symmetrical, changed the acronym into "Simultaneous Multi-Threading" (eg. intel's HyperThreading is "simultaneous mt"). A true SMT system in the sense of symmetrical mt system should behave identically (or very close to) an SMP system, i.e. it should be able to carry out multiple threads w/o latter blocking each other through (implicitly) "mutexed" cpu resources.

The only way you get symmetrical multithreading is if you replicate everything. When you do that on one die it's called CMP (chip multi processing), examples are Power 4/5 and upcoming multicore chip from Intel and AMD. If it's not on one die it's called SMP, symmetrical multi processing.

The only multi threading CPU preceding the P4 is IBM's Northstar which was used in their AS/400 product line, and that was a switch-on-event (event being a level 2 cache miss), IBM dubbed that DMT (dual multi threading).

So SMT has always meant simultaneous multi threading; The S in SMT indicating that a CPU can have instructions from different thread contexts in the same stage in the pipeline simultaneously. This is the way Intel uses it in the P4 documentation and it is the way it was orginally disclosed in the Alpha EV8 descriptions.

Cheers
Gubbi

Gubbi · Dec 1, 2004

DeanoC said:
I disagree, OoO only makes sense while it was possible to execute a single instruction stream faster than simple decoding allowed. A single instruction stream usually have high data depenencies that the memory subsystems can't keep up with. OoO is basically too good at its job, it starves the data caches without breaking into a sweat.

OOO makes sense as soon as you have latencies that you (or your compiler) have a hard time scheduling around.

DeanoC said:
Modern console processor designs have shifted to multiple instruction streams, at worse case you can do a crude manual form of OoO (each thread running the same code at different points), at best you have totally different execution patterns that stress the cache systems in different ways.

That's not really OOO, but it's true that mixing the workload would probably result in better utilization.

DeanoC said:
Of course ideally you would have lots of fast OoO cores and threads but realistically by spending the gates on lots of fast in order cores you achieve better overall results.

It's true that P4's (Prescott's) scheduler is *huge*, but it's also capable of having 128 instructions in its global scheduling window, 256 renaming registers (128 integer, 128 floating point), support for multiple thread contexts and has 5 issue ports.

The next gen XBOX cpu is rumoured to have 3 cores each with 2 threads. If we design our OOO capabilities so that we can schedule around a 30 cycle level 2 cache hit latency (given contention from the other CPUs and the target speed, that is likely IMO) with our 2-way superscalar core we need to sustain 60 instructions in flight, if we only have 2 issue ports in our scheduler, one for integer and one for SIMD instructions, we need 30 extra registers of each type.

This is very close to the capabilities we see in the Pentium PRO/2/3, with it's 40 instructions schedule window, 48 renaming register (mind you, for only 8 architected registers) and 3 issue ports. The ROB and scheduler for the original PPRO only took up 10% of the total die-area, in later revisions with various SIMD execution units tacked on, even less so.

So my opinion is that it is possible to make a fast and narrow OOO capable CPU where the (limited) sceduler takes up less than 10% of the total core area. And the performance advantage far exceeds 10% compared to an in-order CPU.

All IMO, of course.

Cheers
Gubbi

About some info from XB2 leaked documents.

Urian

London Geezer

nAo

Nutella Nutellae

passerby

London Geezer

DeanoC

Trust me, I'm a renderer person!

Panajev2001a

passerby

Gubbi

AlNom

Moderator

Jov

Panajev2001a

Guden Oden

Senior Member

DeanoC

Trust me, I'm a renderer person!

Gubbi

darkblu

FatherJohn

DeanoC

Trust me, I'm a renderer person!

Gubbi

Gubbi

Similar threads