Kentsfield as an alternative to Cell in PS3

pjbliverpool · Feb 13, 2007

Ostepop said:
Which is a stupid assumption because the code wasnt optimized for the Quad-core, while it has been for the cell.

I was working from the assumption that Havok 4 is a PC based middleware and would probably be more optimized for PC cores while 4.5 is supposedly optimized for all "next gen multicore architectures". Polygons points about not knowing what the 3 cores were is something to think about though and we also don't know how much more performance 4.5 has over 4 on the PC cores - although its unlikely to be the 5-10x quoted for Cell.

High memory bandwidth?? What exactly are you refering to, this is something that depends on how you connect the chip, and what kind of memory you have, not something about the cell itself.

Im thinking in terms of FlexIO vs Intels FSB. The FSB is only capable of a little over 8GB/s for everything regardless of what memory you use while flexIO can handle at least 25.6 GB/s to memory with plenty left over for other off chip commincations.

Higher speed interface with the GPU, is again depending on how you connect it, in a traditional PC environment compared to the cell, it is faster.

Same as above, Kentsfield would need to connect over its FSB (which its already using to talk to memory) so it maxes out at 8GB/s while we know Cell can handle at least 35 GB/s.

Cheezdoodles · Feb 13, 2007

pjbliverpool said:
Same as above, Kentsfield would need to connect over its FSB (which its already using to talk to memory) so it maxes out at 8GB/s while we know Cell can handle at least 35 GB/s.

But, since this is going to be in a console in this thread, we could switch the FSB?

Gubbi · Feb 13, 2007

pjbliverpool said:
Im thinking in terms of FlexIO vs Intels FSB. The FSB is only capable of a little over 8GB/s for everything regardless of what memory you use while flexIO can handle at least 25.6 GB/s to memory with plenty left over for other off chip commincations.

I'm fairly certain that most physics problems you can solve at a reasonable number of iterations per second (ie. for use in a game) would be completely cache contained.

Cheers

pjbliverpool · Feb 13, 2007

SPM said:
Errr... Isn't the 3 core CPU they are referring to Xenon, and they can't say so because of an NDA with Microsoft requiring that they do not disclose benchmarks of Microsoft of Microsoft products? Microsoft has such clauses written into their operating system EULAs so you can only publish benchmark results that Microsoft permits, I would be surprised if they didn't have one for Xenon. Nothing else really makes sense. No other 3 core cpu exists and Xenon is a 3 core CPU that Havok has worked on.

It definatly wasn't Xenon, the same benchmarked showed a single PPE performance and it was much lower than 1/3 of the "3 PC cores" score so we are definatly talking about more powerful cores than the PPU.

My guess is that its either a dual core where they have extrapolated the scores up or a Kentsfield with one core dedicated to other tasks/disabled. My money is on the dual core since the date of the test makes it pretty early for Kentsfield.

pjbliverpool · Feb 13, 2007

Ostepop said:
But, since this is going to be in a console in this thread, we could switch the FSB?

Isn't the FSB a fundamental part of the architecture though?

To change it would mean to modify (and improve) the CPU which nullifies the comparison

pjbliverpool · Feb 13, 2007

Sethamin said:
Oh God, please, please, please don't do this. Looking at some narrowly defined data point and trying to draw wide-ranging conclusions from it is absolutely ludicrous. In fact, unless you have a great understanding of this "benchmark's" instruction mix compared to Cell's microarchitecture, you shouldn't really draw any conclusions from it. You might as well look at the 0-60 of a car running on sand and conclude that it's the best handling car on the road today.

Im certainly taking the Havok claims with a pinch of salt but I don't think they were talking about 1 specific benchmark. The statement was more along the lines performance is up by 5-10x across typical scenario's.

Although I guess you could argue that the original Cell vs 3 PC cores test was skewed in Cells favour. It certainly made Xenon look bad. I just don;t see why Havok would choose to do that when there main market is the PC. Would loved to have seen some HavokFX comparisons on there though

Still, this thread is less aimed at physics performance and more overal applicability as a "gaming CPU". I guess the bottom line is, is Cells architecture better suited to those tasks which are commonly handled by the CPU in a game than a traditional x86 mutilcore CPU.

And the answer seems to be - we don't know yet. Still, im sure there are devs out there with a good idea.....

SPM · Feb 13, 2007

Ostepop said:
High memory bandwidth?? What exactly are you refering to, this is something that depends on how you connect the chip, and what kind of memory you have, not something about the cell itself.

But the CPUs memory interface you have to use with a chip is generally determined by the chipset it is designed to work with. Traditional PC architectures generally have lower bandwidth CPU memory than contemporary consoles, because they have to wait for motherboard manufacturers to support a new memory technology and then the memory spec is stuck for around 4-5 years because PC memory is designed to be expandable, whereas consoles can use the latest and greatest memory interface as soon as it comes out.

Unless your refering to how fast the cache is, in which case i believe the Quad-core not only has several times more cache, but its also faster..

For the sort of physics calcs Havok is doing where you fit code and data into localstore, local store + gather/scatter DMA is much faster in terms of total memory bandwidth, cache misses, branch miss penalty, and bus contention than the traditional oooe+cache approach. This explains the 5-10x speed. Kentsfield will have difficulty going against Cell for physics calculations, despite the fact it is much bigger. I would rather have a dual Cell than a Quad Conroe.

Higher speed interface with the GPU, is again depending on how you connect it, in a traditional PC environment compared to the cell, it is faster.

For CPU/GPU communication, traditional PC architecture uses PCIe x16 which is a lot slower (4.0 GB/s each way max).

Considering that the Cell doesnt support dynamic branching, AI is going to be much better on a kentfield, the quad-core also has much local memory, which also helps. (8mb L2 cache vs 512kb)

If you use procedural code for AI (ie. blocks of game code with AI decision branches in between), then Cell is at a disadvantage. However in procedural code for AI which uses long branches which Cell can't fit into the local store, AI is hardly ever the CPU intensive part. Therefore in this type of AI code, the difference between ioe with less cache and oooe with more cache, isn't going to be as significant to overall performance as is made out.

On the other hand demonstrations like the 16000 chicken simulation shows that Cell is very efficient at handling CPU intensive AI - better than traditional PC CPU architecture - if the coding uses appropriate AI algorithms which are separate from other game code, and which treat the AI information as data rather than use branch status to store AI information. Doing this allows AI code to fit into and execute in local store since it is not scattered between blocks of other code, and avoids long branches that cause cache misses, and requires reloading of code - both of which Cell is not as good at handling as traditional big cache + oooe CPU architectures. Aggregating branchy AI code in one place so the code and branches fit within a local store (by removing the blocks of non-AI code in between), and batch processing and updating large numbers of AI data objects in sequence and storing AI status as data for the game code to pick up later, is necessary to avoid the big context switching penalty involved in loading and unloading code into the local store.

SPM · Feb 13, 2007

Sethamin said:
Oh God, please, please, please don't do this. Looking at some narrowly defined data point and trying to draw wide-ranging conclusions from it is absolutely ludicrous. In fact, unless you have a great understanding of this "benchmark's" instruction mix compared to Cell's microarchitecture, you shouldn't really draw any conclusions from it. You might as well look at the 0-60 of a car running on sand and conclude that it's the best handling car on the road today.

But we are looking at a narrowly defined data point - how fast Havok will run on various architectures.

3dilettante · Feb 13, 2007

SPM said:
If you use procedural code for AI (ie. blocks of game code with AI decision branches in between), then Cell is at a disadvantage.

Can you describe a situation where this would occur?
The coders for a strat game's AI decided to sprinkle mouse input queries at random throughout the pathfinding procedures?

It seems improper to have such a lack of modularity.

On the other hand demonstrations like the 16000 chicken simulation shows that Cell is very efficient at handling CPU intensive AI - better than traditional PC CPU architecture - if the coding uses appropriate AI algorithms which are separate from other game code, and which treat the AI information as data rather than use branch status to store AI information.

That's a bunch of qualifications at the end there.

appropriate AI algorithms - you mean appropriately simple AI. Cell's superior throughput is very telling when running a large collection of simple state machines. It is not clear that this can be done for every problem or environment, or should be. It is also unclear that requiring such low-level AI optimization is worth the cost in extra development effort, or if the ability of a game designer to express desired behavior is compromised. Because modern AI expertise lacks a lot of basic structure, this problem is probably worsened.

separate from other game code - I think this is commonly done just to keep the code from turning into a mess. I don't think this is a concern for sane implementations.

treat the AI as information - AI state is stored as data

Doing this allows AI code to fit into and execute in local store since it is not scattered between blocks of other code, and avoids long branches that cause cache misses, and requires reloading of code - both of which Cell is not as good at handling as traditional big cache + oooe CPU architectures.

Skipping the scattered between blocks of code red herring, the big thing seems to be that the AI algorithm works best on Cell if it's simple.

For many AI problems (I'll express my bias by saying that this includes the set of "interesting" ones), the load is often dominated by the explosion of problem complexity, not necessarily the number of AI objects.

Doubling the number of chickens is not the same as handling chickens that have double the amount of state or have double the number of factors in their calculations.

and batch processing and updating large numbers of AI data objects in sequence and storing AI status as data for the game code to pick up later, is necessary to avoid the big context switching penalty involved in loading and unloading code into the local store.

I'm wondering which games don't do this. It's simply good policy not to thrash a cache and the branch predictors by handling a few AI objects and then wandering over to the sound code.
True, Cell would suck worse at lousy code, so we can give Kentsfield that advantage.

For low-level AI, Cell's advantage in raw throughput is telling.
There are a wide number of tasks, including a large amount of the setup for decision-making AI that fall in the number-crunching category.

However, there is a core of decision-making and bounds placement that becomes increasingly important the more complex the AI agents and their interactions with the environment become.

For middle-range problems that still stay within the LS, the picture is less clear. It is quite possible for non-compute bound portions of the AI where the highest advantage expected would be 2x Kentsfield: because there are double the number of cores.

For more complex problems, where blind number crunching becomes more and more likely to be wasted effort, the situation will likely reverse.

It's more of an indictment of today's AI than a bonus for Cell that so much "AI" seems to do so well.

Then again, I am an AI snob. I believe the examples being run have their place, but they don't interest me as much.

edit:
That's not to say that I think Kentsfield would be a viable alternative for the PS3.

SPM · Feb 13, 2007

3dilettante said:
Can you describe a situation where this would occur?
The coders for a strat game's AI decided to sprinkle mouse input queries at random throughout the pathfinding procedures?

It seems improper to have such a lack of modularity.

I am talking about the sort of code you might get in a role playing game, where you might write code which procedurally follows decision making, and the tendency there is for the pure AI parts of the code to be interspersed with code that collects information required for the AI to process and does what needs to be done as a result of the decision or AI status the AI code comes up with. Strip out everything else, and AI decision code will actually be extremely compact.

The approach most programmers take to coding AI is to draw out a flow chart with decision boxes and convert that to code. The resulting code works well on a conventional processor like the PPE but not on SPEs - the SPEs can't fit all the code in local store, and if you do split the code between the PPE and an SPE, the SPE sits idle most of the time because the AI processing is imtermittent in the procedural code, whereas the PPE running everything wouldn't be idle.

Writing AI code for SPEs requires a different, approach - a rules based simulation approach rather than a flow chart approach. That doesn't mean to say that you don't need procedural AI code. Certain things are easier to do this way, you have the PPE there, and it can handle this type of code efficiently while the SPE can't.

That's a bunch of qualifications at the end there.

appropriate AI algorithms - you mean appropriately simple AI. Cell's superior throughput is very telling when running a large collection of simple state machines. It is not clear that this can be done for every problem or environment, or should be. It is also unclear that requiring such low-level AI optimization is worth the cost in extra development effort, or if the ability of a game designer to express desired behavior is compromised. Because modern AI expertise lacks a lot of basic structure, this problem is probably worsened.

separate from other game code - I think this is commonly done just to keep the code from turning into a mess. I don't think this is a concern for sane implementations.

Simple AI processing stacked in layers, becomes very sophisticated AI - that's how our brains work, and the results aren't simple. The main difficulty is that programmers are taught to think procedurally - flow charts and decision boxes. True, some AI problems are difficult to think of except procedurally, but other AI problems - for example crowd behaviour - are much more natural and easier to implement in terms of simple rules governing how each person interacts with others and the environment. Also it is not one or the other, you can use both.

treat the AI as information - AI state is stored as data

Not all is. The position of execution in the code also represents the results of previous decisions in the flow chart and therefore represents some of the AI information. Where the AI code is designed by mapping the AI status into a flow chart, the natural tendency is to put in code that needs to be executed in the path of execution rather than saving the status as data for another process to pick up.

Skipping the scattered between blocks of code red herring, the big thing seems to be that the AI algorithm works best on Cell if it's simple.

Yes, but you can do a lot with simple building blocks.

For many AI problems (I'll express my bias by saying that this includes the set of "interesting" ones), the load is often dominated by the explosion of problem complexity, not necessarily the number of AI objects.

http://www.frams.alife.pl/index.html
Layering of very simple rules can produce results that are much more complex than a programmer has time to hand code. Also there is nothing to stop you using the two together eg. to skew the rules of herd migration according to a programmed set of parameters, or for the programmer to set the positions of pieces of an exploding "replicator" while letting the pieces wiggle around based on simple AI interaction rules.

Doubling the number of chickens is not the same as handling chickens that have double the amount of state or have double the number of factors in their calculations.

But what is to stop you doubling the number of factors in their calculations with two pass AI processing?

I'm wondering which games don't do this. It's simply good policy not to thrash a cache and the branch predictors by handling a few AI objects and then wandering over to the sound code.
True, Cell would suck worse at lousy code, so we can give Kentsfield that advantage.

It may be obvious, but I am just pointing it out because this answers the core of the argument regarding Cell SPE vs conventional big cache CPU. If you are going to traverse code once, maybe with a bit of looping, then cache + CPU always wins - run it on the PPE. If you can fit the code and data required for processing into the SPE, AND you are going to run the code hundreds or thousands of times and then unload it to release the SPE for other things AND you don't have to wait for anything else to do it, then the SPE always wins.

For low-level AI, Cell's advantage in raw throughput is telling.
There are a wide number of tasks, including a large amount of the setup for decision-making AI that fall in the number-crunching category.

However, there is a core of decision-making and bounds placement that becomes increasingly important the more complex the AI agents and their interactions with the environment become.

Can mix and match according to the task.

For middle-range problems that still stay within the LS, the picture is less clear. It is quite possible for non-compute bound portions of the AI where the highest advantage expected would be 2x Kentsfield: because there are double the number of cores.

Quite likely more than 2x for a lot of things. However the non-compute bound AI isn't compute intensive, so getting peak performance isn't a big deal, and for efficiency (ie. you don't want to tie up an SPE to process intermittent tasks), you would do this on the PPE along with other tasks.

For more complex problems, where blind number crunching becomes more and more likely to be wasted effort, the situation will likely reverse.

Certainly the computional effort might be better spent on something else. However AI number crunching isn't expensive compared to graphics number crunching - at least not to the extent it is implemented in current games, and it is the least exploited area and the one where the biggest improvements are there to be made.

3dilettante · Feb 13, 2007

SPM said:
I am talking about the sort of code you might get in a role playing game, where you might write code which procedurally follows decision making, and the tendency there is for the pure AI parts of the code to be interspersed with code that collects information required for the AI to process and does what needs to be done as a result of the decision or AI status the AI code comes up with. Strip out everything else, and AI decision code will actually be extremely compact.

You mean when there's a scripting language that the designer or a player can use to define behaviors? In that case, the script is the data that gets interpreted by the game engine.

That's the point. A scripting language is more accessible and requires less expertise for a character designer or people making player-generated content. Sometimes peak performance is traded off for greater accessibility.

Perhaps that is something unimportant for most PS3 games, it is important for a number of PC games.

The approach most programmers take to coding AI is to draw out a flow chart with decision boxes and convert that to code. The resulting code works well on a conventional processor like the PPE but not on SPEs - the SPEs can't fit all the code in local store, and if you do split the code between the PPE and an SPE, the SPE sits idle most of the time because the AI processing is imtermittent in the procedural code, whereas the PPE running everything wouldn't be idle.

Most programmers? I'd like some stats on that.
That's just an argument for multithreading and avoiding unncessary synchronization, not Cell specifically.

A single x86 core can play the role of a PPE and use the others as SPEs. Their peak rates would be lower (if the data is SIMD friendly, if the data can fit in LS, if the access patterns work well with DMA, if the threads are relatively straight-line, etc), but then again one can send off more complex operations to a full core that would take additional conversion work when passing it to a SPE.

Writing AI code for SPEs requires a different, approach - a rules based simulation approach rather than a flow chart approach. That doesn't mean to say that you don't need procedural AI code. Certain things are easier to do this way, you have the PPE there, and it can handle this type of code efficiently while the SPE can't.

The reality is that there are many approaches to creating AI. No single solution works everywhere and no single solution is necessarily better.

For approaches that only rely on processing power and can meet all those qualifications you bring up, Cell is likely to be best.
For approaches that for whatever reason fail to fit in the predefined box, a solution other than Cell is likely to be best.

Simple AI processing stacked in layers, becomes very sophisticated AI - that's how our brains work, and the results aren't simple. The main difficulty is that programmers are taught to think procedurally - flow charts and decision boxes. True, some AI problems are difficult to think of except procedurally, but other AI problems - for example crowd behaviour - are much more natural and easier to implement in terms of simple rules governing how each person interacts with others and the environment. Also it is not one or the other, you can use both.

The more rules you slather on, the more likely you get a mess. That problem has not been solved, and it is platform-independent.

What you posit is a very low-level view of the AI implementation. It necessitates rules-based methods because systems that are more complicated become unmanagable.

Methods that try to minimize this issue sometimes have issues running well on Cell. They don't always have the same problems if asked to run on Kentsfield.

Not all is. The position of execution in the code also represents the results of previous decisions in the flow chart and therefore represents some of the AI information. Where the AI code is designed by mapping the AI status into a flow chart, the natural tendency is to put in code that needs to be executed in the path of execution rather than saving the status as data for another process to pick up.

That's once again an argument for multithreading and intelligently written code.
AI code is not put directly into a game engine's main loop. It is already partitioned based on that.
A decent simulation is trivially capable of similar optimizations for AI modules.
Unlike the SPE, there isn't as strong a need for another bout with low-level optimizations that not all the dev team has training in.

Yes, but you can do a lot with simple building blocks.

I really like peanut butter on my sandwiches. It's so awesome when I to use it there.
My car battery died, and I needed to get to work. Since the peanut butter worked on my sandwiches, I spread it on the engine. Then my car caught on fire, and since the peanut butter works so well on my toast, I tried to spread more on.

Perhaps other constraints make a low level rules-based approach awkward, excessive, or too time consuming.

http://www.frams.alife.pl/index.html
Layering of very simple rules can produce results that are much more complex than a programmer has time to hand code. Also there is nothing to stop you using the two together eg. to skew the rules of herd migration according to a programmed set of parameters, or for the programmer to set the positions of pieces of an exploding "replicator" while letting the pieces wiggle around based on simple AI interaction rules.

I want my trooper to secure the ammo supply at the top of my base. Do I really need the system to go back to first principles every 1/60th of a second to make sure it's a right turn instead of a left?

But what is to stop you doubling the number of factors in their calculations with two pass AI processing?

How does that stop doubling the number of calculations?
A+B=C, D+E=F
If I put the first operation in pass one and the second in pass two, it's still two operations.
Multiple passes might reduce the burden of the amount of memory that is needed at any given instant, but it incurs a cost.
The act of separating the passes means there is extra setup work.

It may be obvious, but I am just pointing it out because this answers the core of the argument regarding Cell SPE vs conventional big cache CPU. If you are going to traverse code once, maybe with a bit of looping, then cache + CPU always wins - run it on the PPE. If you can fit the code and data required for processing into the SPE, AND you are going to run the code hundreds or thousands of times and then unload it to release the SPE for other things AND you don't have to wait for anything else to do it, then the SPE always wins.

Traversing code once would be more of a tie or an advantage to the SPE. Cache doesn't do quite as much for first and only access.

If instruction throughput is very important (it probably isn't most of the time), it is likely the x86 would win, since its instruction cache has half the latency and the core can decode more instructions at once.

I think you are focusing too much on the instruction stream as a source of issues.

Quite likely more than 2x for a lot of things. However the non-compute bound AI isn't compute intensive, so getting peak performance isn't a big deal, and for efficiency (ie. you don't want to tie up an SPE to process intermittent tasks), you would do this on the PPE along with other tasks.

I only stated that such situations will exist, not that 2x was the only amount that can be expected.

I do like how you like to dump the work that 7 SPEs don't want to do on the poor PPE that can handle less than half of what a single Kentsfield core can do.

Certainly the computional effort might be better spent on something else. However AI number crunching isn't expensive compared to graphics number crunching - at least not to the extent it is implemented in current games, and it is the least exploited area and the one where the biggest improvements are there to be made.

That's because modern AI is stupid, not that AI can't be compute-intensive.
The chip involved has little to do with this problem.

It's not that Cell isn't good for a good portion of AI problems, but the supposed advantages it has over a multicore x86 from an AI perspective are highly variable.

pjbliverpool · Feb 14, 2007

SPM said:
I would rather have a dual Cell than a Quad Conroe.

A dual Cell? Do you mean 2 Cell chips as in 2 PPU's + 14 SPU's?

Wouldn't one be enough?

deathkiller · Feb 14, 2007

pjbliverpool said:
A dual Cell? Do you mean 2 Cell chips as in 2 PPU's + 14 SPU's?

Wouldn't one be enough?

Kentsfield is made of two dual core processor chips so comparing it to 2 Cells is reasonable.

Kentsfield:

pjbliverpool · Feb 14, 2007

deathkiller said:
Kentsfield is made of two dual core processor chips so comparing it to 2 Cells is reasonable.

Kentsfield:

I can see the logic of it but at the end of the day its still a single socket/package CPU which operates like a single unit with 4 cores.

Its not an ideal (some would say native) quad core but personally I believe it should be compared against other single socket/package solutions like Cell. For example you wouldn't consider comparing a PentiumD to 2 Athlon X2's.... well I wouldn't anyway

TheChefO · Feb 14, 2007

TeraScale

With the reasonable die size of this chip (275mm2 on 65nm), what do you guys think about the opportunity to use this chip as an alternative to cell2?

Capeta · Feb 14, 2007

TheChefO said:
With the reasonable die size of this chip (275mm2 on 65nm), what do you guys think about the opportunity to use this chip as an alternative to cell2?

You would have to add some general purpose cores into the mix and remove some of the 80 cores. Say 40 of those FP cores with 4 general purpose cores. Those 80 cores look pretty useless for general purpose stuff. I'm thinking a NiagaraII would be pretty nice. That Terascale chip doesn't seem to have any cache whatsoever.

jonabbey · Feb 14, 2007

TheChefO said:
With the reasonable die size of this chip (275mm2 on 65nm), what do you guys think about the opportunity to use this chip as an alternative to cell2?

Unless Sony decides to ditch backwards compatibility in the next round, the next Playstation will most assuredly use "Cell2".. i.e., a chip with compatible SPU and PPU instruction sets, but on a smaller process node, with more PPEs and more SPEs.

TheChefO · Feb 14, 2007

jonabbey said:
Unless Sony decides to ditch backwards compatibility in the next round, the next Playstation will most assuredly use "Cell2".. i.e., a chip with compatible SPU and PPU instruction sets, but on a smaller process node, with more PPEs and more SPEs.

Indeed Sony would be a fool to ditch the tech they've invested billions in. I'm saying as an alternative to the cell2 which will be in use for ps4.

TheChefO · Feb 14, 2007

Capeta said:
You would have to add some general purpose cores into the mix and remove some of the 80 cores. Say 40 of those FP cores with 4 general purpose cores. Those 80 cores look pretty useless for general purpose stuff. I'm thinking a NiagaraII would be pretty nice. That Terascale chip doesn't seem to have any cache whatsoever.

Agreed - I'm thinking more along the lines of using it as a coprocessor. At 32nm this chip would cost peanuts.

inefficient · Feb 14, 2007

TheChefO said:
With the reasonable die size of this chip (275mm2 on 65nm), what do you guys think about the opportunity to use this chip as an alternative to cell2?

Terascale is about as exciting as a truck with 80 wheels each one with a small 2 stroke engine and about a liter of fuel.

Kentsfield as an alternative to Cell in PS3

pjbliverpool

B3D Scallywag

Cheezdoodles

+ 1

Gubbi

pjbliverpool

B3D Scallywag

pjbliverpool

B3D Scallywag

pjbliverpool

B3D Scallywag

SPM

SPM

3dilettante

SPM

3dilettante

pjbliverpool

B3D Scallywag

deathkiller

pjbliverpool

B3D Scallywag

TheChefO

Capeta

jonabbey

TheChefO

TheChefO

inefficient

Similar threads