Xenon= Modified G5 Triple Core Theory

Gubbi · May 30, 2006

Urian said:
Sorry for saying this, but the PPE/Xenon seems too large for its technical specs as a processor.

Or maybe people overestimate how much silicon is actually used for the OOO apparatus.

The big execution units are similar in the two architectures: VMX arithmetic and VMX permute.

The PPE/Xenon has two contexts instead of one, in Xenon's case that's 256 128 bit VMX registers + the fp and integer ones. On top of that they are more aggressively pipelined. More pipe stages means more latches between pipe stages.

Cheers

Panajev2001a · May 30, 2006

Urian said:
The ADC documentation that I have talks that the G5 can manage 5 instructions per cycle but with dependancy of the compiler.

My theory is simple, I know that you cannot take down the OOOE unit like taking out the chasis of a car, but the only 12-14 months of development and the similar number of transistors to a 970 core made me think that perhaps IBM has put an artificial penalty to the G5 core and has retouched it for creating the first generation of PPE.

Sorry for saying this, but the PPE/Xenon seems too large for its technical specs as a processor.

Urian, the basis for the PPE core and thus the Xenon/Waternoose core is a project that IBM had started a lot earlier than these projects began.

MS and SCE had different objective with each processor: MS wanted more L2 cache, better VMX implementation and some custom instructions to better adapt it to their graphics API's and to collaborate with the GPU; SCE wanted to have a lower penalty SMT implementation with less of a speed hit when both threads are active and trying to work at full speed (supposedly there is much more resources duplication inside the PPE compared to how SMT i implemented in each of Xenon's cores: the fact that MS reccomends one worker thread paired with a more lightweight one [one more memory bound thread rather than compute time bound] MIGHT give some more confidence in this claim).

The DD2+ revisions of the CBE processor show a PPE that is 2x the size of the earlier DD1 revision and larger than each of the 3 cores included inside the Xenon CPU chip. Since, comparing the PPE with each of Xenon's cores, fundamentally the integer register file is the same (32x64 bits GPR's), the L2 cache of the PPE is 0.5 MB vs 1.0 MB for whole Xenon and CELL's VMX implementation has half the number of architectural registers (32x128 bits versus 128x128 bits) it makes you think about why the PPE still manages to take so much space (and I do not think it is because of bad chip design

).

Nemo80 · May 30, 2006

Panajev2001a said:
The DD2+ revisions of the CBE processor show a PPE that is 2x the size of the earlier DD1 revision and larger than each of the 3 cores included inside the Xenon CPU chip. Since, comparing the PPE with each of Xenon's cores, fundamentally the integer register file is the same (32x64 bits GPR's), the L2 cache of the PPE is 0.5 MB vs 1.0 MB for whole Xenon and CELL's VMX implementation has half the number of architectural registers (32x128 bits versus 128x128 bits) it makes you think about why the PPE still manages to take so much space (and I do not think it is because of bad chip design ).

Agree. DD2+ PPE have been included a 2nd VMX execution unit, doubling the number of execution logics as compared to a Xenon Core.

Xenon cores are based on the very first release of the CELL PPU, which IBM is able to license and sell by themselves.

Just look over here http://209.200.64.147/page.cfm?ArticleID=RWT072405191325&p=1

or this one, DD1 PPE is almost identical to a Xenon core (with exception of the number of registers):

Crossbar · May 30, 2006

Panajev2001a said:
The DD2+ revisions of the CBE processor show a PPE that is 2xthe size of the earlier DD1 revision ...

According to the snippet (from the Cell Handbook) below the hw cost of adding the multithreading should not be more than 5%.

Edit: got some answers from the previous post.

Nemo80 · May 30, 2006

Crossbar said:
According to the snippet (from the Cell Handbook) below the hw cost of adding the multithreading should not be more than 5%.

Edit: got some answers from the next post.

I know this paper, though it reads alot like it's referring to a DD1 PPU

Urian · May 30, 2006

Panajev2001a said:
Urian, the basis for the PPE core and thus the Xenon/Waternoose core is a project that IBM had started a lot earlier than these projects began.

MS and SCE had different objective with each processor: MS wanted more L2 cache, better VMX implementation and some custom instructions to better adapt it to their graphics API's and to collaborate with the GPU; SCE wanted to have a lower penalty SMT implementation with less of a speed hit when both threads are active and trying to work at full speed (supposedly there is much more resources duplication inside the PPE compared to how SMT i implemented in each of Xenon's cores: the fact that MS reccomends one worker thread paired with a more lightweight one [one more memory bound thread rather than compute time bound] MIGHT give some more confidence in this claim).

The DD2+ revisions of the CBE processor show a PPE that is 2x the size of the earlier DD1 revision and larger than each of the 3 cores included inside the Xenon CPU chip. Since, comparing the PPE with each of Xenon's cores, fundamentally the integer register file is the same (32x64 bits GPR's), the L2 cache of the PPE is 0.5 MB vs 1.0 MB for whole Xenon and CELL's VMX implementation has half the number of architectural registers (32x128 bits versus 128x128 bits) it makes you think about why the PPE still manages to take so much space (and I do not think it is because of bad chip design ).

Thanks for the correction, I didnÂ´t knew that DD2 PPE has double size compared to DD1 PPE. It seems that I need to learn a lot about microprocessors and all this stuff.

Crossbar · May 30, 2006

Nemo80 said:
I know this paper, though it reads alot like it's referring to a DD1 PPU

It's dated April 19, 2006.

Are you suggesting it's a typo? Should I trust "Real-World Technologies" speculations more than IBM?

Nemo80 · May 30, 2006

Crossbar said:
It's dated April 19, 2006.

Are you suggesting it's a typo? Should I trust "Real-World Technologies" speculations more than IBM?

No it's not. But such documents are not always very specialised (in fact they write thiings like "threads may fight for execution resources" which is not very precise, indicating future changes).

19th of Apilr might be the release date, but it's for sure written months before that date and then reviewed/released later on.

Crossbar · May 30, 2006

Nemo80 said:
19th of Apilr might be the release date, but it's for sure written months before that date and then reviewed/released later on.

Do you think this image is flawed as well? I mean it just contains one VMX unit. :?:

Urian · May 30, 2006

Crossbar said:
Do you think this image is flawed as well? I mean it just contains one VMX unit.

I see a large pipeline.

Is good for performance that an In-Order CPU has this long pipeline?

My logic says no, but a part of me says that I am wrong.

Gubbi · May 30, 2006

Urian said:
I see a large pipeline.

Is good for performance that an In-Order CPU has this long pipeline?

My logic says no, but a part of me says that I am wrong.

Not really that important if the branch prediction works as it's supposed to.

The important thing is the issue-execute loop latency which is down to 2 cycles for the fastests instructions.

This is still higher than the one cycle seen in P-M, P-4 and K8, but similar to PPC 970.

The killer is load-to-use latency.

Cheers

Nemo80 · May 30, 2006

Crossbar said:
Do you think this image is flawed as well? I mean it just contains one VMX unit.

There is only one "VMX Unit" as you call it. The GFLOPS did not change with DD2+. The differences is that hardware thread execution units/resources have been doubled so that 2 threads do not share that much execution units on the chip anymore, and can run more idenpendently than with constant context and thread swapping as it's done on the 360 because of its single VMX execution unit per thread.

Crossbar · May 30, 2006

Nemo80 said:
There is only one "VMX Unit" as you call it. The GFLOPS did not change with DD2+. The differences is that hardware thread execution units/resources have been doubled so that 2 threads do not share that much execution units on the chip anymore, and can run more idenpendently than with constant context and thread swapping as it's done on the 360 because of its single VMX execution unit per thread.

I am not really sure I understand what you are trying to say. Do you have any links to a description of the threading model of Xenon supporting this?

BTW the Cell threading priority scheme with 4 priorities: disabled, low, medium and high is pretty neat IMO.

LunchBox · May 31, 2006

I thought the Xenon was from the 4 core chip IBM was originally pitching? And it was modified by cuttin the extra core and added VMX units per core?

Crossbar said:
I am not really sure I understand what you are trying to say. Do you have any links to a description of the threading model of Xenon supporting this?

Not to intrude on the discussion but I vaguely remember a particular discussion in this very forum around May of last year. Try checking the search function. I would like to assist you in getting the links but my google/search function skills is really horrid.

If someone else has saved the link please kindly re-up on it.

Nemo80 · May 31, 2006

LunchBox said:
I thought the Xenon was from the 4 core chip IBM was originally pitching? And it was modified by cuttin the extra core and added VMX units per core?

Not to intrude on the discussion but I vaguely remember a particular discussion in this very forum around May of last year. Try checking the search function. I would like to assist you in getting the links but my google/search function skills is really horrid.

If someone else has saved the link please kindly re-up on it.

Yeah im also looking for it. Think there also was a DIE shot of Xenos Core which was perfectly comparable to the DD1 CELL PPE.

ADEX · May 31, 2006

The PPE is derived from an old IBM project called GuTS which was the first CMOS processor above 1GHz (in 1997/8).

This was redesigned in 2000 and then redesigned again for Cell / Xenon.

If you look at a die photo the front end (ie not the FPU / VMX section) of Xenon and PPE (DD2) look pretty much identical. Nothing much has been said about what's changed since then but PPE did appear to get "large page" support as TLB flushes were apparently causing problems.

ADEX · May 31, 2006

PPE is completely unrelated to the G5. The G5 is a modified POWER4 core which was built using automated layout tools with static logic.

PPE was done by hand using dynamic logic. I'd expect there to be some design reuse (adders etc.) but overall they are completely different.

BTW the 970GX was presented at ISSCC complete with 3.0GHz figures, they did eventually get there.

nelg · May 31, 2006

ADEX said:
PPE was done by hand using dynamic logic.

Those must be some small hands.

Farid · May 31, 2006

ADEX said:
BTW the 970GX was presented at ISSCC complete with 3.0GHz figures, they did eventually get there.

Way too late, some folks at Cupertino would say.

Asher · May 31, 2006

Apple's beef was with the power use, not clockspeed.

Xenon= Modified G5 Triple Core Theory

Gubbi

Panajev2001a

Nemo80

Crossbar

Nemo80

Urian

Crossbar

Nemo80

Crossbar

Urian

Gubbi

Nemo80

Crossbar

LunchBox

Nemo80

ADEX

ADEX

nelg

Farid

Artist formely known as Vysez

Asher

Similar threads