Xbox2 CPU : 240 Gflops..?

version said:
"Oh...I forgot about that! ...So is that 1 TOPS of 8bit Integers, peak, fully pipelined! :p"

no, not multiply-add instruction, hence only 500 GOPS :)

+PPE altivec with 64 GOPS

but parallel with permute 1 TOPS :D

Marketing will like this! :p
 
version said:
"Well not really, But altiVec can operate across an array of 16 bytes..."

hmm, altivec compute with 8*halfword(16 bit) too ??

Yes...

no, not multiply-add instruction, hence only 500 GOPS Smile

+PPE altivec with 64 GOPS

but parallel with permute 1 TOP

If you had the same dispatch and execution layout as the 745x/744x CPUs then you could process 2 vector integer ALU ops per clock...
 
archie4oz said:
version said:
"Well not really, But altiVec can operate across an array of 16 bytes..."

hmm, altivec compute with 8*halfword(16 bit) too ??

Yes...

no, not multiply-add instruction, hence only 500 GOPS Smile

+PPE altivec with 64 GOPS

but parallel with permute 1 TOP

If you had the same dispatch and execution layout as the 745x/744x CPUs then you could process 2 vector integer ALU ops per clock...


1.1 TOPS , 300 Gflops
GPU ? sony ,where is it ???? :D
 
Jaws said:
But three beefed up PPE cores would be pushing it at 90nm and we may only see two.
Why would this be pushing it?
Eyeballing it from the Cell die, 3 PPEs(but slightly larger) with 1MB L2 wouldn't be much larger then one half of the 8/1 Cell size... and also targetting lower Mhz.

Say we give it ~160mm2, is that really so unreasonable?
Now... should there be some other changes... (such as say, 2MB of L2) then I would start wondering about the size of the chip...

Archie said:
Well not really, But altiVec can operate across an array of 16 bytes...
Doesn't Every integer SIMD? :p

Akira said:
Actually according to Aaron Spink (a real engineer) a better estimate is 105 GFLOPS, taking into account the main FPUs on the PPC cores as well as the FP SIMD units
That only works if the core can dual issue FP and VMX instructions together (There's a reason why PSP primary cpu is only rated at 2.6GFlops rather then 3.3 even though the total number of FMACs would give you the latter number).
So... calling any Xenon insiders... can it :?:
 
Fafalada said:
Why would this be pushing it?
Eyeballing it from the Cell die, 3 PPEs(but slightly larger) with 1MB L2 wouldn't be much larger then one half of the 8/1 Cell size... and also targetting lower Mhz.

Say we give it ~160mm2, is that really so unreasonable?
Now... should there be some other changes... (such as say, 2MB of L2) then I would start wondering about the size of the chip...
Absolutely if we beleive that Cell is possible for PS3, there is absolutely no reason to suggest that XeCPU won't be. If we assume that the Xe enhanced VMX takes 50% more die space per core, its still not that big compared to Cell.

Fafalada said:
That only works if the core can dual issue FP and VMX instructions together (There's a reason why PSP primary cpu is only rated at 2.6GFlops rather then 3.3 even though the total number of FMACs would give you the latter number).
So... calling any Xenon insiders... can it :?:
FPU and VMX are independent IIRC.

Edit: Removed comment about PSP
 
DeanoC said:
Fafalada said:
Why would this be pushing it?
Eyeballing it from the Cell die, 3 PPEs(but slightly larger) with 1MB L2 wouldn't be much larger then one half of the 8/1 Cell size... and also targetting lower Mhz.

Say we give it ~160mm2, is that really so unreasonable?
Now... should there be some other changes... (such as say, 2MB of L2) then I would start wondering about the size of the chip...
Absolutely if we beleive that Cell is possible for PS3, there is absolutely no reason to suggest that XeCPU won't be. If we assume that the Xe enhanced VMX takes 50% more die space per core, its still not that big compared to Cell.

Fafalada said:
That only works if the core can dual issue FP and VMX instructions together (There's a reason why PSP primary cpu is only rated at 2.6GFlops rather then 3.3 even though the total number of FMACs would give you the latter number).
So... calling any Xenon insiders... can it :?:
FPU and VMX are independent IIRC.

OT Whats this about PSP and 222Mhz not 333Mhz?

FPU + VPFPU = 10 FP ops/clock = 3.33 GFLOPS at 333 MHz

Since you cannot co-issue from both pipes at the same time, they settled with ~2.6 GFLOPS at 333 MHz.

Deano, do you think that PPE's VMX then is different than Xbox 2's CPU's VMX units ?
 
DeanoC said:
FPU and VMX are independent IIRC.
Well I assumed that much - but FPU and VFPU are also independant but that doesn't mean you can co-issue them :p
Or if look at an actual dual issue cpu core example - R5900 FPU and VU0 macro instructions couldn't coissue either.

It would certainly be handy if it can be done though.
 
kaigaip026.jpg


Fafalada said:
Jaws said:
But three beefed up PPE cores would be pushing it at 90nm and we may only see two.
Why would this be pushing it?
Eyeballing it from the Cell die, 3 PPEs(but slightly larger) with 1MB L2 wouldn't be much larger then one half of the 8/1 Cell size... and also targetting lower Mhz.

Say we give it ~160mm2, is that really so unreasonable?
Now... should there be some other changes... (such as say, 2MB of L2) then I would start wondering about the size of the chip...
...

Hmm, dunno Faf,...I think you must be referring to your 294.912 mm2 version of CELL! :p

Looking at it again, I ignored the XIO and FlexIO areas on the die, took a more obese version of the PPE core, i.e. more than just the extra large 128.128 bit VMX registers per core and adding an extra unknown to the core like perhaps the FPUs also have 128.128 bit registers etc....and then 3 cores look like approaching the CELL die taking those factors into account and look like pushing it.

I'm expecting Xe CPU sub 200 mm2 from MS talk of being more profitable with Xenon and prior precedent with Xbox CPU die size but who knows, they can push 4 cores and clock lower. If they used two cores instead, I'd expect them to clock them 4 Ghz +...
 
Fafalada said:
Jaws said:
But three beefed up PPE cores would be pushing it at 90nm and we may only see two.
Why would this be pushing it?
Eyeballing it from the Cell die, 3 PPEs(but slightly larger) with 1MB L2 wouldn't be much larger then one half of the 8/1 Cell size... and also targetting lower Mhz.

Say we give it ~160mm2, is that really so unreasonable?
Now... should there be some other changes... (such as say, 2MB of L2) then I would start wondering about the size of the chip...

They could use an individual L2 cache for each core. This way they could create the capability to lock a chunk of the L2, in effect creating something similar to a SPU's local storage (but virtually addressed) and gain the capabilities of a streaming processor without the SPU drawbacks. Either 512KB/core and no Level 3 or 256KB per core and shared 2MB L3 seems reasonable.

Cheers
Gubbi
 
They could use an individual L2 cache for each core. This way they could create the capability to lock a chunk of the L2, in effect creating something similar to a SPU's local storage (but virtually addressed) and gain the capabilities of a streaming processor without the SPU drawbacks.
Why indiivudal caches though? L2 locking is already there(that's how GPU reads out from it in the first place) and once you locked it, isn't it easier to allocate the chunks for each core by software? So you even get to decide how much "local memory" each "streaming processor" will get.
 
Fafalada said:
They could use an individual L2 cache for each core. This way they could create the capability to lock a chunk of the L2, in effect creating something similar to a SPU's local storage (but virtually addressed) and gain the capabilities of a streaming processor without the SPU drawbacks.
Why indiivudal caches though? L2 locking is already there(that's how GPU reads out from it in the first place) and once you locked it, isn't it easier to allocate the chunks for each core by software? So you even get to decide how much "local memory" each "streaming processor" will get.

Well, it would require software, that is an OS call to request locking.

You might be right that having a single large L2 cache with locking is easier. It certainly makes scheduling easier, otherwise you'd want the OS to schedule threads that access the locked region on the core that has locked the memory.

BTW. Why would you need to lock part of the L2 to have the GPU read from it? The caches are kept coherent with the rest of the memory system so a memory request would be served by the L2 if data was found in it.

Cheers
Gubbi
 
Akira said:
Actually according to Aaron Spink (a real engineer)
:)
Being "a real engineer" isn't exotic enough to warrant instant credibility here. There's a lot of 'em about. Ph. Ds, scientists, there are even rumours of professors stalking the fora....
Be afraid. Be Very afraid.

Of course, being flesh and blood people and typically anonymous, they can still be quite opinionated (cough!). Nor do they always speak in their area of expertise. :)
 
I for one do not believe Xenon CPU cores are based on Cell's PU/PPE. I think Xenon's cores are probably beefier Power/POWER cores than the PU/PPE in Cell.

I guess I will have to eat my words if I am wrong though.
 
Megadrive1988 said:
I for one do not believe Xenon CPU cores are based on Cell's PU/PPE. I think Xenon's cores are probably beefier Power/POWER cores than the PU/PPE in Cell.
I tend to agree, but not sure how 'beefy' it is... As I wrote in some other thread Cell has the hardware virtualization feature which MS won't need for Xenon while the assembler for VC8 seems to be able to interpret instructions for the Vanderpool Technology from Intel in addition to the Xbox 2 PPC opcodes. It's nice to run 2 OSs simultaneously on a desktop PC media center, but it's too much for Xenon.

The PPE is meant to run OSs and orchestrate the SPEs (BTW this patent app from Hewlett Packard in 2003 may be interesting), but in the case of Xenon each core looks much more important.
 
Gubbi said:
You might be right that having a single large L2 cache with locking is easier. It certainly makes scheduling easier, otherwise you'd want the OS to schedule threads that access the locked region on the core that has locked the memory.
I also imagine the hardware logic requirement to implement individual caches would be somewhat higher then one unified one? But anyway, I think it's just nicer if I could sliceup the locked portion any way I see fit rather then have a predetermined setup.

BTW. Why would you need to lock part of the L2 to have the GPU read from it? The caches are kept coherent with the rest of the memory system so a memory request would be served by the L2 if data was found in it.
Well it just makes more sense to me, GPU generally wants to do batch memory transfers, not read/write single cache lines, and in cases where CPU and GPU cooperate on processing same data like this, I will most likely do some kind of streaming setup on CPU, so I would want to lock cache for that purpose also.
Granted I'm just assuming things, but GCN also has this kind of setup for L1 cache GPU access.

I agree that GPU accessing cache as is would be usefull too, just IMO not quite as often.
 
Jaws said:
I'm expecting Xe CPU sub 200 mm2 from MS talk of being more profitable with Xenon and prior precedent with Xbox CPU die size but who knows, they can push 4 cores and clock lower. If they used two cores instead, I'd expect them to clock them 4 Ghz +...

Xbox was a money hole because it (a) bought the GPU chips directly through nVidia instead of having rights to the tech and making orders themselves with a fab (b) bought their CPUs from Intel and (c) did not, and do not, have the rights to take the tech, to integrate them into one chip, and shrink them. (d) They also have an expensive HDD.

MS business decisions on its new console have clearly focused on these issues that resulted in losing a lot of money. While I am not saying that they will have a 200+ mm^2 CPU, I think we are assuming a lot by connecting the dots: "MS wants to be more profitable" + "Large chips cost more money" = "MS wont have a large chip". This logic ignores what specifically led to the Xbox being a money pit.

Dave made a comment about the line of thought above. MS wants to be profitable this next gen, but they have already made moves to cover up past mistakes. "Making money" does not necessitate skimping at launch. Making money means that 2-4 years down the road, when a large percentage of consoles sales are made, the HW is not losing hundreds of dollars per unit. But this does not mean losing money per unit in the first year. To prevent losing money means getting better yeilds on the chips, shrinking them down and eventually integrating them into one chip. It also means not having fixed contracts with companies to buy a chip that will never accomplish the above price saving measures. MS with the tech deals with ATi and IBM seems to have corrected these mistakes. But this says nothing about whether they will, or will not, have a big CPU. Look at the size of the PS2 (or CELL for that matter) CPU at launch. Yet Sony was able to get on a smaller process quickly and to brake even on every unit sold quickly be planning ahead.

I do not know MS's specific design goals, but I am sure they have a plan and are sticking to it. And I am pretty sure that plan is a nice piece of HW that they may lose money on initially (as most consoles lost money initially) but is a thought out design that through future manufacturing processes will allow the console to be built and sold with little to no loss.

Whether this means a smaller CPU is not something any of us know (less ::cough::conaed::cough::). Assuming they plan to be more profitable = a smaller die size is a leap of logic IMO. There are a lot of areas they can save money other than making smaller chips at launch. Considering the CPU will most likely be 90nm and a 65nm shrink will probably happen by the end of 2006 my guess would be that MS could easily go with a large chip at launch and be down to a chip 1/2 the size within 12mo.
 
I think your missing my point of a smaller die and clocking higher, e.g. assuming identical cores,

3 cores @ 3GHz = 2 cores @ 4.5 GHz for FLOPS

Just an analogy but the higher clocked die should give better single threaded performance. They could go with 2-4 cores depending on clock, heat, die size and performance for single/multi-threading. They already have a very capable R500 for multi-threading performance so may decide to boost single threaded performance on the CPU. A smaller die CPU should give better yields/cost also.
 
Jaws said:
I think your missing my point of a smaller die and clocking higher, e.g. assuming identical cores,

3 cores @ 3GHz = 2 cores @ 4.5 GHz for FLOPS

Just an analogy but the higher clocked die should give better single threaded performance. They could go with 2-4 cores depending on clock, heat, die size and performance for single/multi-threading. They already have a very capable R500 for multi-threading performance so may decide to boost single threaded performance on the CPU. A smaller die CPU should give better yields/cost also.

A faster clocked die doesn't have to have better performance at all. Look at the Pentium M compared to the Pentium 4. What you need to look at is IPC * speed, not just speed.
 
a688 said:
Jaws said:
I think your missing my point of a smaller die and clocking higher, e.g. assuming identical cores,

3 cores @ 3GHz = 2 cores @ 4.5 GHz for FLOPS

Just an analogy but the higher clocked die should give better single threaded performance. They could go with 2-4 cores depending on clock, heat, die size and performance for single/multi-threading. They already have a very capable R500 for multi-threading performance so may decide to boost single threaded performance on the CPU. A smaller die CPU should give better yields/cost also.

A faster clocked die doesn't have to have better performance at all. Look at the Pentium M compared to the Pentium 4. What you need to look at is IPC * speed, not just speed.

Yes, absolutely, which is why I stated in my assumption obove as the cores being identical.
 
I was thinking this morning about the X2 CPU (or Xenon CPU) and it could be.

The Power5 CPU is a Server-CPU composed by:

-4 Cores at 3Ghz that supports Mutithreading.
-38MB Cache L3.
-Cache L3 is the 50% of the chip area.
-The chip area has a size around of 400mm.

If you take the 38MB cache L3 you have a 200mm 4 cores, but you can change one of the cores for 3 VMX units and you can mantain the bus channel of the L3 Cache of Power5 for the direct transfer between ATI GPU and IBM CPU.

This is my idea of Xenon CPU.
 
Back
Top