Fall Processor Forum

Jawed · Oct 26, 2005

Not to mention that XB360 PPEs' VMX is quite different from any other variant of VMX:

128 registers
dot-product instruction
AoS and SoA support
compression instructions
other M$-specific stuff, I'm sure

Anyway, I hope, soon, we'll get to find out lots more stuff about Xenon - been long enough.

Jawed

[maven] · Oct 26, 2005

Jawed said:
Anyway, I hope, soon, we'll get to find out lots more stuff about Xenon - been long enough.

I've been looking for a PDF of the paper or at least slides, but nothing's come up...

/waiting anxiously

EndR · Oct 26, 2005

The INQ had a little more..

JEFF BROWN TODAY HAD the lucky task of outing the Xbox360 CPU chip. So said Jeff, an IBM chip developer, but as you know, the project is in conjunction with Microsoft. Add in a lot of work from Chartered, the company fabbing it. Microsoft definitely had input at all stages.
The chip itself is a three way SMP PPC with specialised function VMX extensions and two threads per core. It has 1MB of L2 cache, a FSB of 21.6GB/s. It has 165M transistors, and is built on IBM's 10KE 90nm SOI process.
The L1 Icache is 32K 2 way set associative, and has a 128 byte cache line size. It can issue 2 instructions per clock, in order, but can do delayed execution to cover load to use delays. The chip, still somewhat unnamed, has 2 fixed point units, and has a 2 cycle op latency. The Dcache is also 32K but is 4 way set associative, and is non-blocking. The FPU is combined with the VMX unit and can also handle two threads.
The full pipeline is 11 FO4 in length, and has a 10 cycle Scalar DP FPU latency, 2 cycle load latency, 4 cycles for simple VMX and 14 for dot product VMX. This is important because of the target for the chip, gaming. The VMX extensions are going to be heavily used here, and part of the MS mods were upping the number of VMX registers from 32 to 128. It also adds Direct3D pack and unpack instructions.
The 1MB L2 cache is shared by all three cores, and is 8-way set associtive. It is ECC protected, and supports the MESI coherency protocol. The FSB beyond that is specific to the Xbox360, and was designed for the machine itself. It connects to the ATI GPU at 10.8GBps in each direction, hence the 21.6GBps noted earlier. Interestingly, the IBM chip runs it's link layer at 1.35GHz with an 8 bit width, and ATI does it at 675MHz at 16 bits of width.

http://www.theinquirer.net/?article=27221
----

could this be anything?

Urian · Oct 26, 2005

EndR said:
The INQ had a little more..

JEFF BROWN TODAY HAD the lucky task of outing the Xbox360 CPU chip. So said Jeff, an IBM chip developer, but as you know, the project is in conjunction with Microsoft. Add in a lot of work from Chartered, the company fabbing it. Microsoft definitely had input at all stages.
The chip itself is a three way SMP PPC with specialised function VMX extensions and two threads per core. It has 1MB of L2 cache, a FSB of 21.6GB/s. It has 165M transistors, and is built on IBM's 10KE 90nm SOI process.
The L1 Icache is 32K 2 way set associative, and has a 128 byte cache line size. It can issue 2 instructions per clock, in order, but can do delayed execution to cover load to use delays. The chip, still somewhat unnamed, has 2 fixed point units, and has a 2 cycle op latency. The Dcache is also 32K but is 4 way set associative, and is non-blocking. The FPU is combined with the VMX unit and can also handle two threads.
The full pipeline is 11 FO4 in length, and has a 10 cycle Scalar DP FPU latency, 2 cycle load latency, 4 cycles for simple VMX and 14 for dot product VMX. This is important because of the target for the chip, gaming. The VMX extensions are going to be heavily used here, and part of the MS mods were upping the number of VMX registers from 32 to 128. It also adds Direct3D pack and unpack instructions.
The 1MB L2 cache is shared by all three cores, and is 8-way set associtive. It is ECC protected, and supports the MESI coherency protocol. The FSB beyond that is specific to the Xbox360, and was designed for the machine itself. It connects to the ATI GPU at 10.8GBps in each direction, hence the 21.6GBps noted earlier. Interestingly, the IBM chip runs it's link layer at 1.35GHz with an 8 bit width, and ATI does it at 675MHz at 16 bits of width.

http://www.theinquirer.net/?article=27221
----

could this be anything?

In other words...

The only thing that the Xenon is using from the PowerPC 970 is the PowerPC 64 bits ISA and we arenÂ´t sure because IBM never confirmed this (can be PowerPC 32 bits ISA since you can code with it in the PowerPC 970 that were in the Alpha Kits).

It seems that the processor is fast in the pipeline, more faster than the 970.

Guden Oden · Oct 26, 2005

Urian said:
It seems that the processor is fast in the pipeline, more faster than the 970.

I've no idea what this means...

As for the FSB, I would hazard a guess and say it's based on the point to point bus used in the 970. As those who've looked at for example Ars' breakdown of that chip have noticed, that bus is also bidirectional and serial-like in nature. I don't think IBM would reinvent the wheel here, particulary if they're under a time constraint (which they were in this case).

Interesting to see it's asymmetrical in width/clockspeed, yet delivers the same bandwidth in both directions, probably because the differences in clock of the two chips, Xenos running at roughly a sixth of XCPU. Perhaps the narrower interface is upstream and the wider downstream, giving the CPU quicker access to small, scattered reads from memory... This is just crazy speculation on my part. I'm sure there's not really any technical issue at work here, even though xenos runs at only a sixth of XCPU speed (less actually), it still supports one half of the 1.35GHz link plus two 1.4GHz GDDR memory interfaces.

Wunderchu · Oct 27, 2005

Jeff Brown himself blogged some info. here:

http://www.gametomorrow.com/blog

direct links:

Big XBox chip newsâ€¦

XBOX 360 CPU Details Described at MPR Fall Processor Forum (this is the one posted by Jeff Brown himself)

(I first found that gametomorrow.com/blog site from here http://channel9.msdn.com/showpost.aspx?postid=130037 , which I got to by following the link in this post: http://forum.teamxbox.com/showthread.php?t=383102 )

[maven] · Oct 27, 2005

Wunderchu said:
Jeff Brown himself blogged some info.

Thanks for digging that up. I still can't imagine the IBM types to blog voluntarily, but I won't complain...

one · Oct 27, 2005

The slides from the Xbox 360 CPU presentation
http://pcweb.mycom.co.jp/articles/2005/10/27/fpf1/

Wunderchu · Oct 27, 2005

[maven] said:
Thanks for digging that up.

NP

one said:
The slides from the Xbox 360 CPU presentation
http://pcweb.mycom.co.jp/articles/2005/10/27/fpf1/

ooo, cool

Hardknock · Oct 27, 2005

one said:
The slides from the Xbox 360 CPU presentation
http://pcweb.mycom.co.jp/articles/2005/10/27/fpf1/

Nice find!

Guden Oden · Oct 27, 2005

From looking at http://pcweb.mycom.co.jp/photo/articles/2005/10/27/fpf1/images/Photo03l.jpg it seems XCPU is 4 execution units wide (2 integer units, FPU, VMX), to which 2 instructions maximum may be issued per clock, leaving the other two units potentially idle. I should have guessed this to be the case I suppose since it has "always" been known XCPU is dual-issue, but I assumed that to refer to the integer part of the chip, with float instructions to be independently handled. Apparantly, this is not the case then.

So there exists a big limit on the actual peak performance of this chip, in that heavy integer use is going to block the float units and other way around. Unless I'm wrong that is.

*Edit:
http://pcweb.mycom.co.jp/photo/articles/2005/10/27/fpf1/images/Photo05l.jpg implies the data path from the cache is common for all three cores - ie, L2 is single-ported, at least from the perspective of the cores (might be separate port for the FSB interface). Of course, this might just be a schematic image that doesn't represent the actual architecture, but I thought the point of a presentation such as this was to present the architecture... Interesting to see cache runs at half core clock, wonder what bus width from cache is. Probably 256 bits, possibly 512? L1 cache line width is 128 bytes, so that would mean 8 core clocks to fill one line at 256 bits. Too much? No idea, I'm not a CPU designer.

Then again, if bus is shared between all three cores, they'll have to share, so make that a potential 12 core clocks to fill a line. And that might not be worst case scenario either.

Mefisutoferesu · Oct 27, 2005

Guden Oden said:
From looking at http://pcweb.mycom.co.jp/photo/articles/2005/10/27/fpf1/images/Photo03l.jpg it seems XCPU is 4 execution units wide (2 integer units, FPU, VMX), to which 2 instructions maximum may be issued per clock, leaving the other two units potentially idle. I should have guessed this to be the case I suppose since it has "always" been known XCPU is dual-issue, but I assumed that to refer to the integer part of the chip, with float instructions to be independently handled. Apparantly, this is not the case then.

So there exists a big limit on the actual peak performance of this chip, in that heavy integer use is going to block the float units and other way around. Unless I'm wrong that is.

Ho! My memory isn't serving me well here, but I recall some time back a bit of information released on, I think, cloth physics or something where they were talking about how the simulation scaled linearly to the number of SPEs... Anyway, in that blurb it seemed that the PPE was performing a bit under par with what people (B3D people) were expecting... perhaps a similar case with the above? Yes, I know, SO specific, sorry.

Shifty Geezer · Oct 27, 2005

That was the Alias Wavefront cloth simulator converted to Cell, in which the PPE alone wasn't even half the performance of a P4 4GHz.

Jawed · Oct 27, 2005

And there was some question whether it was run on DD1, and whether DD1 even has a VMX unit.

Did we get any answers on that?...

Jawed

Gubbi · Oct 27, 2005

Guden Oden said:
From looking at http://pcweb.mycom.co.jp/photo/articles/2005/10/27/fpf1/images/Photo03l.jpg it seems XCPU is 4 execution units wide (2 integer units, FPU, VMX), to which 2 instructions maximum may be issued per clock, leaving the other two units potentially idle. I should have guessed this to be the case I suppose since it has "always" been known XCPU is dual-issue, but I assumed that to refer to the integer part of the chip, with float instructions to be independently handled. Apparantly, this is not the case then.

So there exists a big limit on the actual peak performance of this chip, in that heavy integer use is going to block the float units and other way around. Unless I'm wrong that is.

You always have more execution units than issue ports, otherwise it would be a VLIW. Other CPUs are exactly the same in this respect.

Guden Oden said:
*Edit:
http://pcweb.mycom.co.jp/photo/articles/2005/10/27/fpf1/images/Photo05l.jpg implies the data path from the cache is common for all three cores - ie, L2 is single-ported, at least from the perspective of the cores (might be separate port for the FSB interface). Of course, this might just be a schematic image that doesn't represent the actual architecture, but I thought the point of a presentation such as this was to present the architecture... Interesting to see cache runs at half core clock, wonder what bus width from cache is. Probably 256 bits, possibly 512? L1 cache line width is 128 bytes, so that would mean 8 core clocks to fill one line at 256 bits. Too much? No idea, I'm not a CPU designer. Then again, if bus is shared between all three cores, they'll have to share, so make that a potential 12 core clocks to fill a line. And that might not be worst case scenario either.

The time to fill a line is fairly irrelevant, since the L2 will almost certainly deliver critical word first, that is the exact chunk of the cacheline that caused the L1 miss. The only time it will cause problems is when you have back to back L1 misses that hit in L2, but with typical L1 hitrate in the >95% range, these won't be common.

As for contention for the L2. Read out and stores to the L2 array itself can be wider than the buses to the cores themselves. In that way the L2 would be better equipped to handle contention.

Edit: Since the L1s are write through I would say that the fat pipe to the L2 array is almost certain since every store in one of the L1s is written through to the L2.

Cheers
Gubbi

Wunderchu · Oct 31, 2005

http://www.extremetech.com/article2/0,1697,1877351,00.asp

(I first heard about this ExtremeTech article from: http://forum.teamxbox.com/showthread.php?t=384124 )

one · Oct 31, 2005

A wafer for CELL, and a package
http://www.theinquirer.net/?article=27340

Wunderchu · Nov 3, 2005

one said:
A wafer for CELL, and a package
http://www.theinquirer.net/?article=27340

big chip

nAo · Nov 3, 2005

100 CPUs per wafer..

London Geezer · Nov 3, 2005

Wunderchu said:
big chip

It's not like the whole thing is going into a PS3!!!

Imagien that! A 100X Cell PS3...

Fall Processor Forum

Jawed

[maven]

EndR

Urian

Guden Oden

Senior Member

Wunderchu

[maven]

one

Unruly Member

Wunderchu

Hardknock

Guden Oden

Senior Member

Mefisutoferesu

Shifty Geezer

uber-Troll!

Jawed

Gubbi

Wunderchu

one

Unruly Member

Wunderchu

nAo

Nutella Nutellae

London Geezer

Similar threads