Fall Processor Forum

Not to mention that XB360 PPEs' VMX is quite different from any other variant of VMX:
  • 128 registers
  • dot-product instruction
  • AoS and SoA support
  • compression instructions
  • other M$-specific stuff, I'm sure
Anyway, I hope, soon, we'll get to find out lots more stuff about Xenon - been long enough.

Jawed
 
Jawed said:
Anyway, I hope, soon, we'll get to find out lots more stuff about Xenon - been long enough.
I've been looking for a PDF of the paper or at least slides, but nothing's come up... :(
/waiting anxiously
 
The INQ had a little more..

JEFF BROWN TODAY HAD the lucky task of outing the Xbox360 CPU chip. So said Jeff, an IBM chip developer, but as you know, the project is in conjunction with Microsoft. Add in a lot of work from Chartered, the company fabbing it. Microsoft definitely had input at all stages.
The chip itself is a three way SMP PPC with specialised function VMX extensions and two threads per core. It has 1MB of L2 cache, a FSB of 21.6GB/s. It has 165M transistors, and is built on IBM's 10KE 90nm SOI process.
The L1 Icache is 32K 2 way set associative, and has a 128 byte cache line size. It can issue 2 instructions per clock, in order, but can do delayed execution to cover load to use delays. The chip, still somewhat unnamed, has 2 fixed point units, and has a 2 cycle op latency. The Dcache is also 32K but is 4 way set associative, and is non-blocking. The FPU is combined with the VMX unit and can also handle two threads.
The full pipeline is 11 FO4 in length, and has a 10 cycle Scalar DP FPU latency, 2 cycle load latency, 4 cycles for simple VMX and 14 for dot product VMX. This is important because of the target for the chip, gaming. The VMX extensions are going to be heavily used here, and part of the MS mods were upping the number of VMX registers from 32 to 128. It also adds Direct3D pack and unpack instructions.
The 1MB L2 cache is shared by all three cores, and is 8-way set associtive. It is ECC protected, and supports the MESI coherency protocol. The FSB beyond that is specific to the Xbox360, and was designed for the machine itself. It connects to the ATI GPU at 10.8GBps in each direction, hence the 21.6GBps noted earlier. Interestingly, the IBM chip runs it's link layer at 1.35GHz with an 8 bit width, and ATI does it at 675MHz at 16 bits of width.


http://www.theinquirer.net/?article=27221
----

could this be anything?
 
EndR said:
The INQ had a little more..

JEFF BROWN TODAY HAD the lucky task of outing the Xbox360 CPU chip. So said Jeff, an IBM chip developer, but as you know, the project is in conjunction with Microsoft. Add in a lot of work from Chartered, the company fabbing it. Microsoft definitely had input at all stages.
The chip itself is a three way SMP PPC with specialised function VMX extensions and two threads per core. It has 1MB of L2 cache, a FSB of 21.6GB/s. It has 165M transistors, and is built on IBM's 10KE 90nm SOI process.
The L1 Icache is 32K 2 way set associative, and has a 128 byte cache line size. It can issue 2 instructions per clock, in order, but can do delayed execution to cover load to use delays. The chip, still somewhat unnamed, has 2 fixed point units, and has a 2 cycle op latency. The Dcache is also 32K but is 4 way set associative, and is non-blocking. The FPU is combined with the VMX unit and can also handle two threads.
The full pipeline is 11 FO4 in length, and has a 10 cycle Scalar DP FPU latency, 2 cycle load latency, 4 cycles for simple VMX and 14 for dot product VMX. This is important because of the target for the chip, gaming. The VMX extensions are going to be heavily used here, and part of the MS mods were upping the number of VMX registers from 32 to 128. It also adds Direct3D pack and unpack instructions.
The 1MB L2 cache is shared by all three cores, and is 8-way set associtive. It is ECC protected, and supports the MESI coherency protocol. The FSB beyond that is specific to the Xbox360, and was designed for the machine itself. It connects to the ATI GPU at 10.8GBps in each direction, hence the 21.6GBps noted earlier. Interestingly, the IBM chip runs it's link layer at 1.35GHz with an 8 bit width, and ATI does it at 675MHz at 16 bits of width.


http://www.theinquirer.net/?article=27221
----

could this be anything?

In other words...

The only thing that the Xenon is using from the PowerPC 970 is the PowerPC 64 bits ISA and we aren´t sure because IBM never confirmed this (can be PowerPC 32 bits ISA since you can code with it in the PowerPC 970 that were in the Alpha Kits).

It seems that the processor is fast in the pipeline, more faster than the 970.
 
Urian said:
It seems that the processor is fast in the pipeline, more faster than the 970.
I've no idea what this means...

As for the FSB, I would hazard a guess and say it's based on the point to point bus used in the 970. As those who've looked at for example Ars' breakdown of that chip have noticed, that bus is also bidirectional and serial-like in nature. I don't think IBM would reinvent the wheel here, particulary if they're under a time constraint (which they were in this case).

Interesting to see it's asymmetrical in width/clockspeed, yet delivers the same bandwidth in both directions, probably because the differences in clock of the two chips, Xenos running at roughly a sixth of XCPU. Perhaps the narrower interface is upstream and the wider downstream, giving the CPU quicker access to small, scattered reads from memory... This is just crazy speculation on my part. I'm sure there's not really any technical issue at work here, even though xenos runs at only a sixth of XCPU speed (less actually), it still supports one half of the 1.35GHz link plus two 1.4GHz GDDR memory interfaces.
 
Last edited by a moderator:
From looking at http://pcweb.mycom.co.jp/photo/articles/2005/10/27/fpf1/images/Photo03l.jpg it seems XCPU is 4 execution units wide (2 integer units, FPU, VMX), to which 2 instructions maximum may be issued per clock, leaving the other two units potentially idle. I should have guessed this to be the case I suppose since it has "always" been known XCPU is dual-issue, but I assumed that to refer to the integer part of the chip, with float instructions to be independently handled. Apparantly, this is not the case then.

So there exists a big limit on the actual peak performance of this chip, in that heavy integer use is going to block the float units and other way around. Unless I'm wrong that is.

*Edit:
http://pcweb.mycom.co.jp/photo/articles/2005/10/27/fpf1/images/Photo05l.jpg implies the data path from the cache is common for all three cores - ie, L2 is single-ported, at least from the perspective of the cores (might be separate port for the FSB interface). Of course, this might just be a schematic image that doesn't represent the actual architecture, but I thought the point of a presentation such as this was to present the architecture... Interesting to see cache runs at half core clock, wonder what bus width from cache is. Probably 256 bits, possibly 512? L1 cache line width is 128 bytes, so that would mean 8 core clocks to fill one line at 256 bits. Too much? No idea, I'm not a CPU designer. :) Then again, if bus is shared between all three cores, they'll have to share, so make that a potential 12 core clocks to fill a line. And that might not be worst case scenario either.
 
Last edited by a moderator:
Guden Oden said:
From looking at http://pcweb.mycom.co.jp/photo/articles/2005/10/27/fpf1/images/Photo03l.jpg it seems XCPU is 4 execution units wide (2 integer units, FPU, VMX), to which 2 instructions maximum may be issued per clock, leaving the other two units potentially idle. I should have guessed this to be the case I suppose since it has "always" been known XCPU is dual-issue, but I assumed that to refer to the integer part of the chip, with float instructions to be independently handled. Apparantly, this is not the case then.

So there exists a big limit on the actual peak performance of this chip, in that heavy integer use is going to block the float units and other way around. Unless I'm wrong that is.

Ho! My memory isn't serving me well here, but I recall some time back a bit of information released on, I think, cloth physics or something where they were talking about how the simulation scaled linearly to the number of SPEs... Anyway, in that blurb it seemed that the PPE was performing a bit under par with what people (B3D people) were expecting... perhaps a similar case with the above? Yes, I know, SO specific, sorry.
 
That was the Alias Wavefront cloth simulator converted to Cell, in which the PPE alone wasn't even half the performance of a P4 4GHz.
 
And there was some question whether it was run on DD1, and whether DD1 even has a VMX unit.

Did we get any answers on that?...

Jawed
 
Guden Oden said:
From looking at http://pcweb.mycom.co.jp/photo/articles/2005/10/27/fpf1/images/Photo03l.jpg it seems XCPU is 4 execution units wide (2 integer units, FPU, VMX), to which 2 instructions maximum may be issued per clock, leaving the other two units potentially idle. I should have guessed this to be the case I suppose since it has "always" been known XCPU is dual-issue, but I assumed that to refer to the integer part of the chip, with float instructions to be independently handled. Apparantly, this is not the case then.

So there exists a big limit on the actual peak performance of this chip, in that heavy integer use is going to block the float units and other way around. Unless I'm wrong that is.

You always have more execution units than issue ports, otherwise it would be a VLIW. Other CPUs are exactly the same in this respect.

Guden Oden said:
*Edit:
http://pcweb.mycom.co.jp/photo/articles/2005/10/27/fpf1/images/Photo05l.jpg implies the data path from the cache is common for all three cores - ie, L2 is single-ported, at least from the perspective of the cores (might be separate port for the FSB interface). Of course, this might just be a schematic image that doesn't represent the actual architecture, but I thought the point of a presentation such as this was to present the architecture... Interesting to see cache runs at half core clock, wonder what bus width from cache is. Probably 256 bits, possibly 512? L1 cache line width is 128 bytes, so that would mean 8 core clocks to fill one line at 256 bits. Too much? No idea, I'm not a CPU designer. :) Then again, if bus is shared between all three cores, they'll have to share, so make that a potential 12 core clocks to fill a line. And that might not be worst case scenario either.

The time to fill a line is fairly irrelevant, since the L2 will almost certainly deliver critical word first, that is the exact chunk of the cacheline that caused the L1 miss. The only time it will cause problems is when you have back to back L1 misses that hit in L2, but with typical L1 hitrate in the >95% range, these won't be common.

As for contention for the L2. Read out and stores to the L2 array itself can be wider than the buses to the cores themselves. In that way the L2 would be better equipped to handle contention.

Edit: Since the L1s are write through I would say that the fat pipe to the L2 array is almost certain since every store in one of the L1s is written through to the L2.

Cheers
Gubbi
 
Last edited by a moderator:
Back
Top