Predict what processor will be used for the Xbox 360

McFly · Apr 26, 2005

From our calculations here, Cell should be a little more than three times faster than the XeCPU at the same frequency. Right?

Fredi

blakjedi · Apr 26, 2005

london-boy said:
blakjedi said:

if the XeCPU and the Cell chips were clocked at exactly the same rate either both 3 GHz or both 4 GHZ which chip would be more powerful? my guess (uneducated at best) is that you could "more" with the XeCPU.

Click to expand...

What makes you think that?
If there is one thing that's for sure, Cell will be hard to beat.
Anything could happen though. And it still has to be seen what each CPU will have to take care of.

im no librarian so i cant cite all the MILLIONS of threads on this topic from just this site but... it seems that XeCPU can do "more different types" of work than Cell by virtue of of having more PPE's available.

Cell is better at streaming/FP because it has more SPEs which I consider to be less capable than but faster than a VMX unit. But those SPE's only have value when attached to a PPE.

Maybe I should go pull up the numbers in terms of "work" expected to be done by an SPE versus a VMX unit but.... in a very touchy feely-based-on-my-memory- kind of way I sense that at the same clock rate the XeCPU could do more - not faster than the Cell.

Shifty Geezer · Apr 26, 2005

I think you're off there. If spending zillions on a fresh processor design spawns something that just going multi-core on existing tech can outperform, why waste the money? :?

SPE's aren't just for decompressing MPEGs! There's programmable usefulness that can chug through lots of maths very quickly, moreso than any other single block of silicon. I don't think anyone's (not under NDA) got enough info on the instruction set to know what is and is not possible, but by accounts it's fairly extensive.

The only realm it seems XeCPU might have an upper hand from my questions on this might be AI, but even then I expect smart developers to find different ways of modelling AI that fits well into a stream processor architecture.

My guess...um...Cell will be 2 or 3 times the power of XeCPU. What comes out of it at the other end is dependant on the rest of the system though and beef-cake CPUs can still get constrained by other hardware limitations.

London Geezer · Apr 26, 2005

blakjedi said:
im no librarian so i cant cite all the MILLIONS of threads on this topic from just this site but... it seems that XeCPU can do "more different types" of work than Cell by virtue of of having more PPE's available.

Cell is better at streaming/FP because it has more SPEs which I consider to be less capable than but faster than a VMX unit. But those SPE's only have value when attached to a PPE.

Maybe I should go pull up the numbers in terms of "work" expected to be done by an SPE versus a VMX unit but.... in a very touchy feely-based-on-my-memory- kind of way I sense that at the same clock rate the XeCPU could do more - not faster than the Cell.

Well then the next question would be:

Do more what? Faster at what?
Cell will push an insane amount of floating point calculations, much more than the XeCPU will, that's pretty much all we know.
Also, the "more things" XeCPU might be able to do, will they be what a game needs?
Cell seems to be a chip that, pushing a lot of FLOPs, will be very good for games and multimedia conent, which is what these platforms are.
If the XeCPU has a 3-core-whatever-PPC-they-decide, it has to be seen how that will perform in games.

They'll both be very good chips, and i have the feeling they will be doing different things (it would explain the focus on FLOPs in Cell) so comparisons will be kinda useless.

Titanio · Apr 26, 2005

blakjedi said:
im no librarian so i cant cite all the MILLIONS of threads on this topic from just this site but... it seems that XeCPU can do "more different types" of work than Cell by virtue of of having more PPE's available.

The SPEs can do general work. They're no PPE, but they can do it. So Cell can use its own PPE and its 6-8 SPEs for "more different types of work". The question would then become, how much faster is a PPE at "more different types of work" than a SPE? Assuming both CPUs are concentrating on the same things, and only the same things, It'd need to be greater than 3/4x as fast as a SPE for three PPEs to outperform 1 PPE and 6/8 SPEs (and yes, this is assuming you can split the task 3 or 4 ways so the SPEs can all contribute to it).

Asides from all that, I don't think it matters. Cell has been optimised for use in a games console. When you're doing that kind of optimising, you look at the biggest bottlenecks. Cell will excel at those things which take longest per frame to compute - I've read that the biggest offenders include collision detection and physics, things which should mesh very nicely with Cell (I would think)*. The "more different types of work" probably aren't going to be taking up a hell of a lot of time per frame, anyway, so there's not much point in dedicating your silicon to them.

You can't simply say, Cell excels at X,Y,Z but a three PPEs are better for A,B,C,D,E, hence 3 PPEs > Cell. You have to look at how much time X,Y,Z take versus A,B,C,D,E - my impression is that STI have optimised for those things that generally take longest, and that the impact of other "types of work" will be relatively small compared to them.

Some of the above is just my own speculation - anyone care to disagree?

*I had read in Christer Ericson's "Real-time Collsion Detection" that collision detection can take over 30% of frametime..I imagine physics ain't a walk in the park either.

blakjedi · Apr 26, 2005

Thanks for the education guys. I guess i have been getting the sense that lately that the "expected" XeCPU was evening up with the "expected" Cell in terms of performance.

I remember some folks from the board writing out chip vs chip performance peaks and rating the XeCPU between 90 -240 GFLOPS at 3 or 3.5 GHZ and the cell at 256 Gflops based on 4~4.6 GHZ.

On another note it would hard for me to believe that MS with basically very extensive prior knowledge of the Cell project would not develop with Cell in mind. The fact that XeCPU is a clean sheet design is one reason. The fact that we know a whole lot more about Cell than the XeCPU may mean that there is more than meets the eye (and the calculator) of the XeCPU is another.

*****This is pure speculation*****

The "beefier" VMX that L-B talked about may just be a dual issue VMX (if it doesnt exist already). Dual issue VMX units would even things up a bit I think. The XeCPU would have the equivalent of 6 single issue SPEs and three dual issue PPE's. In all 9 Vector threads and 3 three integer threads simultaneously every clock cycle.

Of course this is all meaningless if there is a PPU which is also speculation.
*****End of speculation*****

Sorry for having taken you through this exercise just cause I'm busy at work and dont have the time to search.

Gubbi · Apr 26, 2005

Titanio said:
blakjedi said:

im no librarian so i cant cite all the MILLIONS of threads on this topic from just this site but... it seems that XeCPU can do "more different types" of work than Cell by virtue of of having more PPE's available.

Click to expand...

You can't simply say, Cell excels at X,Y,Z but a three PPEs are better for A,B,C,D,E, hence 3 PPEs > Cell. You have to look at how much time X,Y,Z take versus A,B,C,D,E - my impression is that STI have optimised for those things that generally take longest, and that the impact of other "types of work" will be relatively small compared to them.

True, there'll be no apples to apples comparisons this generation.

Titanio said:
Asides from all that, I don't think it matters. Cell has been optimised for use in a games console. When you're doing that kind of optimising, you look at the biggest bottlenecks. Cell will excel at those things which take longest per frame to compute - I've read that the biggest offenders include collision detection and physics, things which should mesh very nicely with Cell (I would think)*. The "more different types of work" probably aren't going to be taking up a hell of a lot of time per frame, anyway, so there's not much point in dedicating your silicon to them.

*I had read in Christer Ericson's "Real-time Collsion Detection" that collision detection can take over 30% of frametime..I imagine physics ain't a walk in the park either.

Collision detection is a part of physics I'd say.

The problem with collision detection is that you'll need a space decomposition structure to speed up queries. This is almost always some sort of tree (octree).

In a normal CPU you'd sort/bundle the objects for spatial locality. That way you get a fair amount of reuse of the octree nodes thanks to the caches of a general purpose CPU.

A SPE doesn't have automatically demand loaded caches, it has explicitly loaded local RAM. So if you have to do something to either up the reuse of data or to hide the latency for loading the nodes. This can be done in a variety of ways:

1.) Explicitly load the node needed from main memory when traversing the tree. Vertically thread you collision detection code to have many queries executing simutaneously and thereby hide the main memory latency.
2.) Have a software cache system. Instead of explicitly fetching each node from main memory, use a cache_load function or macro to implicitly load a node into a chunk of local RAM, first query to hit a node will load the node into the cache, subsequent queries that hit in the cache will get the locally cached node.
3.) Do something completely different than an octree.

1. Is going to be hard since main memory latency will be in the order of 40-50 ns (my guess), cycletime will be 0.26ns and hence you'd have to cover 160-200 cycles worth of latency (or 320-400 instructions).

2. Will induce latency since every memory reference will be through a software layer. A simple flat cache (1-way associative) will require a mask, a compare and two loads. In order to get a cached value you'd need to:
a) Mask the low bits of the address (a simple AND)
b) Load the address in the cache index (with the masked value)
c) Compare it to the requested address
d) If ok, return the cached value (a branch and subsequent load)
e) Otherwise start main memory fetch

A cache hit (a-d) would be adding 20-something cycles of latency to each load as compared to 3-4 cycles of latency of common L1 caches.

Having a multiple way cache is going to add cost since it requires multiple compares and branch mispredicts are expensive (19 cycles).

3. Since latency is the killer and CELL appears to have ample bandwidth you'd probably stuff multiple nodes into one supernode and trade off bandwidth for latency.

To sum it up: Collision is one area where a CPU with demand loaded caches will do better than a SPE, IMO.

The SPEs will do well on workloads with stream properties, including vector workloads. Have any kind of memory indirection and the overhead becomes staggering.

Cheers
Gubbi

Titanio · Apr 26, 2005

blakjedi said:
I remember some folks from the board writing out chip vs chip performance peaks and rating the XeCPU between 90 -240 GFLOPS at 3 or 3.5 GHZ and the cell at 256 Gflops based on 4~4.6 GHZ.

The estimates are:

Tri-core X360 CPU @ 3Ghz: 90Gflops
8-SPE Cell chip @ 4Ghz: 296Gflops

I don't expect PS3's CPU to be 8-SPE or 4Ghz though. My own guess is 6 useable SPEs at 3-3.5Ghz (so 174-203Gflops).

blakjedi said:
On another note it would hard for me to believe that MS with basically very extensive prior knowledge of the Cell project would not develop with Cell in mind.

I don't know, everyone has different priorities. It's a massive investment for Sony. Of course, MS could match that investment from a purely financial point of view, but Sony can perhaps be more confident of the volumes they can leverage with PS3 sales to make back their money and then some. MS wants to make money off of X360s sold this time around - putting something like Cell into their system would be difficult if you're trying to profit and if you don't have the (virtual) guarantee of high tens of millions of systems being sold.

blakjedi said:
The "beefier" VMX that L-B talked about may just be a dual issue VMX (if it doesnt exist already). Dual issue VMX units would even things up a bit I think. The XeCPU would have the equivalent of 6 single issue SPEs and three dual issue PPE's. In all 9 Vector threads and 3 three integer threads simultaneously every clock cycle.

Well, dual-issue is different from dual threads, IIRC. And in the vast majority of cases, I don't think it'd compensate for the lack of extra physical units. Dual-issue has its limitations, it's not like doubling your power. You can't say it'd be like having 6 single-issue cores.

Titanio · Apr 26, 2005

Gubbi said:
Titanio said:

blakjedi said:

im no librarian so i cant cite all the MILLIONS of threads on this topic from just this site but... it seems that XeCPU can do "more different types" of work than Cell by virtue of of having more PPE's available.

Click to expand...

You can't simply say, Cell excels at X,Y,Z but a three PPEs are better for A,B,C,D,E, hence 3 PPEs > Cell. You have to look at how much time X,Y,Z take versus A,B,C,D,E - my impression is that STI have optimised for those things that generally take longest, and that the impact of other "types of work" will be relatively small compared to them.

Click to expand...

True, there'll be no apples to apples comparisons this generation.

Titanio said:

Asides from all that, I don't think it matters. Cell has been optimised for use in a games console. When you're doing that kind of optimising, you look at the biggest bottlenecks. Cell will excel at those things which take longest per frame to compute - I've read that the biggest offenders include collision detection and physics, things which should mesh very nicely with Cell (I would think)*. The "more different types of work" probably aren't going to be taking up a hell of a lot of time per frame, anyway, so there's not much point in dedicating your silicon to them.

*I had read in Christer Ericson's "Real-time Collsion Detection" that collision detection can take over 30% of frametime..I imagine physics ain't a walk in the park either.

Click to expand...

Collision detection is a part of physics I'd say.

The problem with collision detection is that you'll need a space decomposition structure to speed up queries. This is almost always some sort of tree (octree).

In a normal CPU you'd sort/bundle the objects for spatial locality. That way you get a fair amount of reuse of the octree nodes thanks to the caches of a general purpose CPU.

A SPE doesn't have automatically demand loaded caches, it has explicitly loaded local RAM. So if you have to do something to either up the reuse of data or to hide the latency for loading the nodes. This can be done in a variety of ways:

1.) Explicitly load the node needed from main memory when traversing the tree. Vertically thread you collision detection code to have many queries executing simutaneously and thereby hide the main memory latency.
2.) Have a software cache system. Instead of explicitly fetching each node from main memory, use a cache_load function or macro to implicitly load a node into a chunk of local RAM, first query to hit a node will load the node into the cache, subsequent queries that hit in the cache will get the locally cached node.
3.) Do something completely different than an octree.

1. Is going to be hard since main memory latency will be in the order of 40-50 ns (my guess), cycletime will be 0.26ns and hence you'd have to cover 160-200 cycles worth of latency (or 320-400 instructions).

2. Will induce latency since every memory reference will be through a software layer. A simple flat cache (1-way associative) will require a mask and a compare. In order to get a cached value you'd need to:
a) Mask the low bits of the address (a simple AND)
b) Load the address in the cache index (with the masked value)
c) Compare it to the requested address
d) If ok, return the cached value (a branch and subsequent load)
e) Otherwise start main memory fetch

A cache hit (a-d) would be adding 20-something cycles of latency to each load as compared to 3-4 cycles of latency of common L1 caches.

Having a multiple way cache is going to add cost since it requires multiple compares and branch mispredicts are expensive (19 cycles).

3. Since latency is the killer and CELL appears to have ample bandwidth you'd probably stuff multiple nodes into one supernode and trade off bandwidth for latency.

To sum it up: Collision is one area where a CPU with demand loaded caches will do better than a SPE, IMO.

The SPEs will do well on workloads with stream properties, including vector workloads. Have any kind of memory indirection and the overhead becomes staggering.

Cheers
Gubbi

Thank you! This is exactly the kind of feedback I like when I start thinking of how certain things might work..

I don't feel qualified enough to tackle some of these points properly - possibly best left to someone with more experience. It's really hard for me to tell how much of an issue some of these things will be..I'm not sure how bad the second option would be for example. If the piece of space you're dealing with - its vertices - fit within the SPE's local memory, would this be a problem at all (if you can guarantee everything you need to know about the subspace is locally available)? Can you not just load seperate sets of vertices that are isolated from one another and let the SPE crunch away on them? Although perhaps I'm threading into option 3 territory..(?) If it really is a problem, would the greater memory latency really outweigh computational advantages and a greater multiplicity of available cores?

Methinks we might be waiting for a definitive answer..

Also, I tend to seperate out collision detection and response. I consider "response" to be the physics side, but I guess that's just semantics

blakjedi · Apr 26, 2005

Titanio said:
Well, dual-issue is different from dual threads, IIRC. And in the vast majority of cases, I don't think it'd compensate for the lack of extra physical units. Dual-issue has its limitations, it's not like doubling your power. You can't say it'd be like having 6 single-issue cores.

Damn. dual thread vs dual issue... never really considered teh difference there. Ok its so hard keeping up with you guys... y'all so damn informed/smart on this board...back to lurker status for me 8)

pc999 · Apr 26, 2005

Just to make one note.

Most of you seems to think that PS3 is the reason of existence of cell, but it could not be.

Cell should be very good to reanime the markets like TV (once that most of persons only buy a new TV because the old one is broken, and TVs with new features are too expensive to most of people), and these markets do have a lot more potencial costumers (=money) than PS3.

While Cell,no matter what, looks a great CPU assume that is only built for PS3 is a dangerous assumption.

BOOMEXPLODE · Apr 26, 2005

Yeah you shouldn't think Cell is only for PS3, but don't assume it has unlimited applications either. Even a scaled down Cell is going to be relatively large and high in transistor count for most embedded applications, when ARM processors etc. can be bought for much less. I really wonder how Sony will ever make back their investment in Cell. I wouldn't be surprised if it blows away the XeCPU though.

Mythos · Apr 26, 2005

While Cell,no matter what, looks a great CPU assume that is only built for PS3 is a dangerous assumption.

I think this is an obviuos point and it's of general know that the cell will be produced alswhere.

However, I'm incline to think that we might be looking at Cell and XCPU in a different way so far. Meaning the assumption is that Xcpu is a balanced CPU versus a straight power forward concept like cell. {Historically that's been Sony's emaphizes on raw power} However, I think there might be more to cell then just whats been shown and announced thus far (specially with different patent variations on cell). Also, the emphases on what Cell and Xcpu are going to handle might be different as well. Each architeture might be rendering different things or act well within there graphics model.

Overall, maybe the whole Cell, Xcpu, Rev. CPU is on differential model as opposed to the familiar PC graphics.

Titanio · Apr 26, 2005

pc999 said:
Most of you seems to think that PS3 is the reason of existence of cell, but it could not be.

PS3 is the driving force behind Cell. Without a new Playstation, work on Cell would never have started.

I'd be willing to bet that PS3 will be Cell's most popular application.

Carl B · Apr 26, 2005

pc999 said:
Just to make one note.

Most of you seems to think that PS3 is the reason of existence of cell, but it could not be.

Cell should be very good to reanime the markets like TV (once that most of persons only buy a new TV because the old one is broken, and TVs with new features are too expensive to most of people), and these markets do have a lot more potencial costumers (=money) than PS3.

While Cell,no matter what, looks a great CPU assume that is only built for PS3 is a dangerous assumption.

PS3 and the entire plan for the Playstation evolution was the reason for Cell; Kutagari has stated as much. The fact that Cell also lends itself to multimedia functions is just a function of the larger vision that went into Cell's mission. Without the definite need for a processor for the Playstation 3, and the foregone conclusion that that console would sell millions of units, I think that the Cell project would have been too much a shot in the dark to initiate soley for the consumer electronics controller chip and media workstation markets.

BOOMEXPLODE said:
Yeah you shouldn't think Cell is only for PS3, but don't assume it has unlimited applications either. Even a scaled down Cell is going to be relatively large and high in transistor count for most embedded applications, when ARM processors etc. can be bought for much less. I really wonder how Sony will ever make back their investment in Cell. I wouldn't be surprised if it blows away the XeCPU though.

I have no doubt that Sony will make back their investment. What have they invested? $3 billion? $4? Cell may not turn into a gigantic money maker on it's own, but being able to use Cell chips in their console means another chip that they don't have to fab, purchase, and/or develop - so there's a definite cost trade-off there; they had to get a chip from somewhere anyway, right? Same thing with consumer electronics. If you can use a stripped down Cell chip instead of another chip, you're probably spending an equivelent sum, or slightly more, and perhaps giving your electronics an edge in the marketplace; at least Sony and Toshiba hope.

To get more specific - Sony is building a fab and joining in the expansion of two others, comprising the bulk of their investment. Fabs is an investment you can derive benefits from for a long long time - they don't HAVE to build Cell chips there afterall. Their investment in the architectural research, a little under $600 million, I feel is also well worth it if it makes the PS3 the 'awesome' console it should be, and even more worth it when you consider that the architecture itself should remain viable for many many years.

jvd · Apr 26, 2005

Titanio said:
pc999 said:

Most of you seems to think that PS3 is the reason of existence of cell, but it could not be.

Click to expand...

PS3 is the driving force behind Cell. Without a new Playstation, work on Cell would never have started.

I'd be willing to bet that PS3 will be Cell's most popular application.

No but work on bluegrene would hae continued

blakjedi · Apr 26, 2005

Titanio said:
blakjedi said:

The "beefier" VMX that L-B talked about may just be a dual issue VMX (if it doesnt exist already). Dual issue VMX units would even things up a bit I think. The XeCPU would have the equivalent of 6 single issue SPEs and three dual issue PPE's. In all 9 Vector threads and 3 three integer threads simultaneously every clock cycle.

Click to expand...

Well, dual-issue is different from dual threads, IIRC. And in the vast majority of cases, I don't think it'd compensate for the lack of extra physical units. Dual-issue has its limitations, it's not like doubling your power. You can't say it'd be like having 6 single-issue cores.

Based on this thread...
http://www.beyond3d.com/forum/viewtopic.php?t=22250&start=100

If the XECPU has six VMX units (2 per PPE) then what? or what if it turns out that each core is actually dual cored meaning six physical cores on die... anything is possible i guess

still only 144 GFlops rating on the CPU?

Pozer · Apr 26, 2005

Your all wrong, XCPU will be named Nucleus and will be composed of 2500 Zilog z80 cores.

Seriously, maybe a possibility of Cell ending up in future Apples is the murmers I've been hearing and Sony getting out of the clone business.

Carl B · Apr 26, 2005

Pozer said:
Seriously, maybe a possibility of Cell ending up in future Apples is the murmers I've been hearing and Sony getting out of the clone business.

I don't see why not. Or setting up an environment to emulate x86. Both are probably a ways out for now, and perhaps just not worth the effort period, but as Sony is able to bring more and more functionaility to Cell from the software side of things, I see no reason why they should continue to do things in line with the old WinTel paradigm.

Titanio · Apr 26, 2005

blakjedi said:
Based on this thread...
http://www.beyond3d.com/forum/viewtopic.php?t=22250&start=100

If the XECPU has six VMX units (2 per PPE) then what? or what if it turns out that each core is actually dual cored meaning six physical cores on die... anything is possible i guess

still only 144 GFlops rating on the CPU?

Where you get 144Gflops from?

If each core had 2 VMX units, assuming they were standard and could be working away simultaneously, that'd yield 162Gflops @ 3Ghz.

It's not very likely that'll happen, though. I think they'll tweak the VMX units to allow two hardware threads to run on them (would explain the register file sizes in the leaked docs), but you won't have 2 seperate physical units, I don't think. With another physical VMX unit the die size would be bigger than Cell, probably, once you factored in cache!

Regarding the thread you linked to, it's not at all clear that there's another VMX unit in there. In fact, I think the claim is that there's "just" another FPU in the VMX unit, but even that's in doubt. That would allow for hardware threads on the VMX units though (probably?).

3 dual-cores isn't going to happen. It'd get you your 6 physical VMX units, but the die size would be enormous.

Predict what processor will be used for the Xbox 360

McFly

blakjedi

Shifty Geezer

uber-Troll!

London Geezer

Titanio

blakjedi

Gubbi

Titanio

Titanio

blakjedi

pc999

BOOMEXPLODE

Mythos

Titanio

Carl B

Friends call me xbd

jvd

blakjedi

Pozer

Carl B

Friends call me xbd

Titanio

Similar threads