Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

Carl B · Nov 16, 2005

scificube said:
I would think the overall need to parallelize code in general would where the real difficulties lie.

Then again I don't see why devs wouldn't take the easy way out. Both physic and graphics are supposed embarrassingly parallelizable (sp?) tasks so why not spread these tasks over the cores of these CPUs and then run AI and game code in a serial fashion on a separate core dedicated to just these things. I would think the equivalent of a 3.2 Ghz processor dedicated solely to AI would be enough to get developers impressive results.

Well, MS actually presented a schematic for how basic functions should be distributed among the cores, so they're certainly thinking along similar lines. (do we have that slide around here somewhere?)

scooby_dooby · Nov 16, 2005

Shifty Geezer said:
Cell should be able to maintain very good efficiency in feeding these in many situations due to the LS and structured data access too, so I'd expect the real-world float performance efficiency to be higher than most processors too in these situations.

As long as you realize that's an extremely optimistic outlook based only on handpicked benchmarks from IBM, and one-sided technical documents also from IBM.

There's no way to know what the efficiency of CELL will be in the real world, as it functions as a game CPU. Everyone is just guessing, and the combined sales pitch from IBM, and the EE-like hype from Sony have sold many people this amazing power. It's gotten so bad, that anyone who even suggests CELL might not be that great is basically ripped to shreds.

Just remember it still has to be proven in the real world. And also remember who you're dealing with, Sony always talks a big game, ALWAYS

scificube · Nov 16, 2005

xbdestroya said:
Cell actually has superior integer performance to the XeCPU, though the majority of that is on the SPE's - thus access/harder to utilize(?). That 'three times' the 'general purpose' talk came from Major Nelson comparing the Cell's single PPE to the XeCPU's triple Power-core arrangement. Granted what would you expect with him dismissing them for the 'DSPs' they are?

Again...I find I learn something new everyday

scificube · Nov 16, 2005

xbdestroya said:
Well, MS actually presented a schematic for how basic functions should be distributed among the cores, so they're certainly thinking along similar lines. (do we have that slide around here somewhere?)

I'd love to see it

(guess I could have said these last two things in a single post...sorry)

scooby_dooby · Nov 16, 2005

xbdestroya said:
Cell actually has superior theoretical integer performance to the XeCPU, though the majority of that is on the SPE's - thus access/harder to utilize(?).

Fixed

Titanio · Nov 16, 2005

A fair number of devs have spoken now about how they're distributing threads and power on the CPUs in their next-gen games, and to be honest the typical case would seem to mesh with Cell very well indeed. Beyond one core and the main game logic that runs on it, they're talking about spending "extra" threads/cores on physics, animation, audio, rendering optimisation, soft body dynamics (cloth, hair, fluids), particle systems, graphics etc. Something worth keeping in mind if considering the choices STI made with Cell and its likely "usefulness".

I'd be interested to see the MS "template"/guidelines too. I could dig up these other devs' comments I reference above if anyone wants them.

Guden Oden · Nov 16, 2005

SubD said:

Thank you for that clearly argued and well-founded addition to the thread. Well done, two thumbs up, etc. We need more of your kind here at B3D.

Seriously though, this topic is so loaded it's hard to really give any kind of clear answer. A pro for cell versus xenon can just as well be seen as a con instead if one looks at it from the other direction.

What intrigues me most I think about Cell is the massive bandwidth of the EIB (768 bits/clock) and "chaining up" SPUs to perform a software pipeline operation. I'm sure programmers could do some really interesting things that way as the SPUs could operate pretty much exclusively on the chip itself, without having to hit main memory except for in the ends of the chain. And on-chip memory is reportedly enormously fast even for random accesses.

It's not really interesting what one chip does better than another, but rather how the strengths of each can be levered to produce interesting effects. We know Cell has 25GB/s of XDR memory bandwidth dedicated to it, whereas Xenon shares 22.5GB/s of GDDR bandwidth with xenos. However, will that lead to any practical advantage? Especially as x360 has twice the main RAM of PS3. It's so hard to say for sure.

Alpha_Spartan · Nov 16, 2005

scificube said:
Each SPE has it's own local memory and being that they aren't caches cache misses in themselves are eliminated.

So all I'm getting here is that it doesn't have all the disadvantages of having "true" cache. But doesn't that also mean that the SPE's lack the advantages of having true cache?

The PPE has 512 L2 cache to itself where Xenon's core will have share a single 1MB L2 cache.

Yeah, that's a bit incidental as the 1MB L2 cache thing is the most cost effective. However, I haven't read the XCPU whitepapers (if there are any), but couldn't an engine dedicate 512 MB of the L2 cache for one core (the "main" core) while splitting the other 512 between the other cores as needed. Assuming that the cores are going to have different workloads (physics, A.I., audio, etc), they're going to have different cache needs.

Cell has the ability to access XDR and the GDDR3 memory pools for it's own needs and in assistance to the GPU via it's on chip XDram controller and Flexio interface. Both the XDram controller and Flexio appear to be faster than Xenon's FSB.

The fact that the Xbox 360 as a whole uses UMA makes this point kind of a wash, doesn't it? I'm assuming that with unified memory, XCPU can read and write anywhere in memory.

Also, how does Cell compare in terms of general purpose processing (I know both chips are kinda weak in this area, but that's no big deal since that would be like construction workers criticizing a dump truck for not being as fast as a Honda Civic *cough* Anand *cough*)?

iknowall · Nov 16, 2005

Shifty Geezer said:
Depends what you mean by integer disadvantage.
Process geometry, yes.

That's what i wanted to know for sure. Thanks.

to design a new CPU and produced something useless. Do you honestly believe them that stupid and incompetant?

No i think that generally speaking risc processors have more flop power and are more suited for pushing geometry and the cisc processors are more suited in things like ia .

But how much power do you need for ai ? i don't think this will be a problem for next gen cpu.

So i see the direction where xcpu and cell goes is to have more geometry power and this sound logic to me.

Now we have to see how much easly this spe power will be avaible .

drpepper · Nov 16, 2005

Titanio said:
I'd be interested to see the MS "template"/guidelines too. I could dig up these other devs' comments I reference above if anyone wants them.

hey, I'd like to read up on them! Any interviews, essays, etc... of the topic at hand.

TIA

scificube · Nov 16, 2005

Titanio said:
A fair number of devs have spoken now about how they're distributing threads and power on the CPUs in their next-gen games, and to be honest the typical case would seem to mesh with Cell very well indeed. Beyond one core and the main game logic that runs on it, they're talking about spending "extra" threads/cores on physics, animation, audio, rendering optimisation, soft body dynamics (cloth, hair, fluids), particle systems, graphics etc. Something worth keeping in mind if considering the choices STI made with Cell and its likely "usefulness".

I'd be interested to see the MS "template"/guidelines too. I could dig up these other devs' comments I reference above if anyone wants them.

Please do. I for one would love some insight into our beloved developer's minds.

Titanio · Nov 16, 2005

Alpha_Spartan said:
So all I'm getting here is that it doesn't have all the disadvantages of having "true" cache. But doesn't that also mean that the SPE's lack the advantages of having true cache?

You could implement a cache in software for a SPU if you wanted, although I'm not sure what the results would be like.

There is no hardware cache, correct. But on either system I think you really want to avoid or be clever about main memory accesses. On Xenon you're going to want your code to be very cache-aware and you'll want your SPU code on Cell to be very aware of local memory. The SPU forces you really to make memory management explicit and to try and map it out in advance.

Alpha_Spartan said:
Also, how does Cell compare in terms of general purpose processing (I know both chips are kinda weak in this area, but that's no big deal since that would be like construction workers criticizing a dump truck for not being as fast as a Honda Civic *cough* Anand *cough*)?

Define "general purpose"?

scooby_dooby · Nov 16, 2005

From the CELL paper I read way back, they said that to use cache produced un-predictable latencies, so they removed the cache, used a register instead(???) so that latencies were always bad, but at least they were predictable(they never mentioned the downside though).

It's a good example of the one-sidedness of these tech papers though. They making something like removing cache, sound like a good thing.

Titanio · Nov 16, 2005

scooby_dooby said:
From the CELL paper I read way back, they said that to use cache produced un-predictable latencies, so they removed the cache, used a register instead(???) so that latencies were always bad, but at least they were predictable(they never mentioned the downside though).

They used local sram instead. The latencies are far from "always bad", the exact opposite in fact. Something like 4 to 6 cycles, I believe, possibly faster than L2 cache (although someone may want to correct me on that? I always assumed one of the big advantages of the local sram was its speed and or access behaviour).

scooby_dooby said:
It's a good example of the one-sidedness of these tech papers though. They making something like removing cache, sound like a good thing.

If it means, for example, that more silicon to use for more execution units, might that not be considered a good thing? You've got to accept that the choices they made, they made with the best interests of the chip. It's not like they were landed with a set of choices they subsequently had to justify or make "sound good". They had the choice to put cache in there or not, and they made the choice not to. It was a voluntary thing.

drpepper said:
hey, I'd like to read up on them! Any interviews, essays, etc... of the topic at hand.

TIA

scificube said:
Please do. I for one would love some insight into our beloved developer's minds.

Here are a couple that spring to mind.

In Crytek's next engine, from a gamestar.de interview they were asked how they planned to use CPU potential. They said:

"We scale the individual modules such as animation, physics and parts of the graphics with the CPU, depending on how many threads the hardware offers."

John Carmack was asked how he was planning to use the CPU in his next engine, and he said:

"we’ve got the game and the renderer running as two primary threads and then we’ve got targets of opportunity for render surface optimization and physics work going on the spare processor, or the spare threads"

The NFactor 2 engine from Inis - that one with the freaky looking character with the hair - splits threads like this: a thread for the main game loop, and then a thread each for rendering, physics, hair simulation and audio.

Tim Sweeney told Anandtech how he was planning to use the SPUs - for physics, animation, particle systems, sound and "possibly" a few other areas. Perhaps most encouragingly, though, he said that the things the SPUs weren't very well suited for don't take much time anyway, and would run fine on the PPE - which would seem to rather explicitly support the choices STI made in terms of what they optimised and aimed the chip at.

There may be others, but these are the ones I can most readily remember.

scooby_dooby · Nov 16, 2005

Titanio said:
If it means, for example, that more silicon to use for more execution units, might that not be considered a good thing? You've got to accept that the choices they made, they made with the best interests of the chip. It's not like they were landed with a set of choices they subsequently had to justify or make "sound good". They had the choice to put cache in there or not, and they made the choice not to. It was a voluntary thing.

All I'm saying is those white-papers are totally one-sided and never tell both sides of the story. So that should be taken into account if we're basing our opinions amost entirely off of them, not to mention the equally one-sided and controlled tech demo's.

Guden Oden · Nov 16, 2005

scooby_dooby said:
All I'm saying is those white-papers are totally one-sided and never tell both sides of the story.

Hey, if you were around way back when, you could have read a Matrox whitepaper on the Millennium Mystique and how it was a good idea they didn't include alpha blending as one of the supported features, but rather used stippling... Of course whitepapers are one-sided, everybody knows that. Well everybody with an ounce of common sense anyway.

They probably served a function once, but have since become a tool of the company PR and marketing department.

Titanio · Nov 16, 2005

scooby_dooby said:
All I'm saying is those white-papers are totally one-sided and never tell both sides of the story. So that should be taken into account if we're basing our opinions amost entirely off of them, not to mention the equally one-sided and controlled tech demo's.

What does controlled mean? Anyway, my point is they had multiple paths they could have taken, but they chose one. I don't think they'd go out of their way to choose the worst path, do you? Obviously they're going to present and defend their work subsequently, but in the genuine belief that the decisions they took were the right ones, and with the ability to explain and justify those decisions.

scificube · Nov 16, 2005

Alpha_Spartan said:
So all I'm getting here is that it doesn't have all the disadvantages of having "true" cache. But doesn't that also mean that the SPE's lack the advantages of having true cache?

Yeah, that's a bit incidental as the 1MB L2 cache thing is the most cost effective. However, I haven't read the XCPU whitepapers (if there are any), but couldn't an engine dedicate 512 MB of the L2 cache for one core (the "main" core) while splitting the other 512 between the other cores as needed. Assuming that the cores are going to have different workloads (physics, A.I., audio, etc), they're going to have different cache needs.

The fact that the Xbox 360 as a whole uses UMA makes this point kind of a wash, doesn't it? I'm assuming that with unified memory, XCPU can read and write anywhere in memory.

Also, how does Cell compare in terms of general purpose processing (I know both chips are kinda weak in this area, but that's no big deal since that would be like construction workers criticizing a dump truck for not being as fast as a Honda Civic *cough* Anand *cough*)?

Ok...I did say some things could be considered negative as I left room for people to make their own judgments.

I'll try to address what you ask me though.

Are LS's at an advantage to caches? No. Both have their advantages. I was noting an advantage in this case because I think it is significant. The disadvantage is to my knowledge is that devs will have to do allot of micro-memory management on their own barring some high level tool doesn't do it for them. I am of the thinking that developers will not mis manage the SPE's LS (given it's such an obvious point of concern) so with this being the case the absence of cache misses becomes the highlight. I contrast this with cache misses due to 6 threads thrashing one cache on Xenon. It is true that devs can lock a portion of the cache off for whatever reason but I don't think this eliminates the possibility of a cache miss nor the fact that 6 threads must share 1 MB of cache whether one let's them duke it out on their own or one cuts out explicit portions for each thread.

UMA vs. NUMA? My point really didn't go this deep as I really was trying to isolate Cell and Xenon. The Cell in the PS3 has 25.6Gb/s + (20Gb/s read 15Gb/s write) available to it via it's XDram and Flexio interfaces. Xenon has a 21.6Gbs (10.8Gb/s read/write) available to it via it's FSB. That's as far as I meant to take that statement. UMA in contrast to NUMA is a deeper discussion and UMA in X360 contrasted with NUMA in the PS3 is yet deeper still. The issue of being able to write/read anywhere in memory was not something I noted as advantagous here as there would appear to be no advantage in being able to do that. Will all that bandwith be used...I can't predict that it will be but we've seen plenty of ideas as to how it could be.

As far as general computing goes? I couldn't tell you. I do have to say I disagree with how Anandtech categorizes both Cell and Xenon but that's a whole other can of worms. If I had to wager I'd say Xenon was better at general purpose computing where Cell was better at more computationally expensive tasks. The question then become...does general purpose computing need more power? In my own personal experience I don't buy CPUs to make FireFox run faster for me...I buy them to make my games, video encoding or some other computationally intensive tasks run faster. It is also interesting how STI claims Cell can handle any OS and not just one at a time. This again is probably another topic for another time as 'general computing' is has different meanings to many people.

I hope this helps understand me a little better.

----------------------------------------------------------------------

edit:

Thanks Titanio

I've seen a couple of those statements over the last few months. It would seem I'm just seeing what's obvious to developers out there. Even so I can take some pleasure in knowing at least some of my favorite developers will be putting these CPUs throught their paces.

SynapticSignal · Nov 16, 2005

Cell with integer and general purpose takes a hit of 90%
Cell can't do 9 general purpose threads
Cell can do 2 general purpose threads
spu are not general purpose cores
Cell has only one PPe with little cache, this, with the in order question, makes of cell a low performance cpu for gaming and pc general uses, high performance cpu for multimedia-streaming tasks
the spu can do some work as physics, not others, like game code, or ai, rich of logical jumps
developing Cell is "pain in the ass" (courtesy of Carmak, Gabell)

the cpu of ps2 have many more Flops the the celeron of xbox1, but celeron put it to the dust, so I don't jump in the sony's hype chariot of "fantaflops"

scooby_dooby · Nov 16, 2005

Titanio said:
Obviously they're going to present and defend their work subsequently, but in the genuine belief that the decisions they took were the right ones, and with the ability to explain and justify those decisions.

Sure. And at the same time they do not discuss, or go into any detail on the negative implications of their decisions(which there must be).

Does Cell Have Any Other Advantages Over XCPU Other Than FLOPS?

Carl B

Friends call me xbd

scooby_dooby

scificube

scificube

scooby_dooby

Titanio

Guden Oden

Senior Member

Alpha_Spartan

iknowall

drpepper

scificube

Titanio

scooby_dooby

Titanio

scooby_dooby

Guden Oden

Senior Member

Titanio

scificube

SynapticSignal

scooby_dooby

Similar threads