A bit of info on Cell's physic's abilities.

scatteh316

Newcomer
"Alias Cloth Technology" the cloth simulation algorithm requires complex mathematics and processes large matrices of data, we were able to leverage the vector support of the Cell processor and achieve even greater performance by vectorizing the code that performs arithmetic functions on the mesh data structures. At the current stage of our experiments, a prototype 2.4 GHz Cell processor runs approximately five times as many simulation frames per second as a 3.6 GHz Pentium 4 class processor

http://www.research.ibm.com/cell/whitepapers/alias_cloth.pdf

Note : Sorry for any spelling errors.
 
scatteh316 said:
"Alias Cloth Technology" the cloth simulation algorithm requires complex mathematics and processes large matrices of data, we were able to leverage the vector support of the Cell processor and achieve even greater performance by vectorizing the code that performs arithmetic functions on the mesh data structures. At the current stage of our experiments, a prototype 2.4 GHz Cell processor runs approximately five times as many simulation frames per second as a 3.6 GHz Pentium 4 class processor

http://www.research.ibm.com/cell/whitepapers/alias_cloth.pdf

Note : Sorry for any spelling errors.


Alias? NICE! :D
At last we'll have decent cloth animation.
Not that Maya is parfect, animators still have to hand-tweak it sometimes, but it's definately better than nothing!
 
What are we talking about here?

Is it for one character, ten or complete battlefields (not very likely I guess)?
 
They cheated quite a bit with their parallelization claims:
They have a linear speed-up with the number SPEs used. Why is that? Because they're running 8 instances of the cloth simulation, one on each SPE.
Also note that the rendering is done by a PowerMac G5 and not on the Cell.

In summary, nothing interesting; conclusions: use (fast) DMA transfers from the SPEs, need to double-buffer memory transfers, and clear the cache lines referencing main memory that is used in a DMA transfer to avoid the slow-down needed to keep cache and memory in sync.
 
I guess it depends on the complextiy of the simulation, whether it's used just for the "flowing" of cloth, or is the cloth interacting with other characters and objects...

For sure, Tekken 6 will have more impressive cloth simulation thatn something like Dynasty Warriors for PS3.

The Alias demo had some six (or was it nine?) separate simulations running simultaneously, asll with different conditions... some had other objects the cloth interact with, some coloth were "anchored" at points, I think even the cloth parameters (thickness, how "stiff" the clot is etc...) were different too.

In a game a character can have different material clothes, that are interacting diffferentely with the character and objects, are either "loose" like some cape, or more "tight fit" like trousers etc..

I'd guess this level of cloth simulation would be limited to games with maybe two to four characters on screen.
 
It's still 5x faster in a similarly sized package though, no? And 7x faster at the same clockspeed. Which isn't too bad.

Though I am disappointed in the main that in this instance they couldn't leverage more performance from the SPE's. I wonder what the limiting factor was?
 
Shifty Geezer said:
It's still 5x faster in a similarly sized package though, no? And 7x faster at the same clockspeed. Which isn't too bad.

Though I am disappointed in the main that in this instance they couldn't leverage more performance from the SPE's. I wonder what the limiting factor was?
Also, if 5X faster than a P4 means it runs at 10fps (while the P4 one runs at 2fps), that's not very useful is it ;) Not saying that's the case, but it would be nice to have more info.
 
Why is a single PPE so weak, about 20% of the P4? Is it related to only having 32 VMX registers?

Presumably this is a DD1 Cell, judging from the 8 SPEs and 2.4GHz clock. Is the DD1's PPE that severely cut-down compared with the DD2?

It seems common knowledge that DD1's VMX is cut down, but this appears to suggest it's non-existant.

Jawed
 
london-boy said:
Also, if 5X faster than a P4 means it runs at 10fps (while the P4 one runs at 2fps), that's not very useful is it ;) Not saying that's the case, but it would be nice to have more info.

this is not game related at all ,so ,it can even be 1frame / minute on complex simulations.
 
_phil_ said:
this is not game related at all ,so ,it can even be 1frame / minute on complex simulations.

lol i thought so. :D
It's Alias afterall, they're probably dealing with Cell in workstations business, with Cell workstations running Maya and trying to get more performance out of them.
 
They cheated quite a bit with their parallelization claims:
They have a linear speed-up with the number SPEs used. Why is that? Because they're running 8 instances of the cloth simulation, one on each SPE.

How is that cheating? ;)

Well not if you have to use ALL SPEs just for that!

One can assume that the whole P4 is being used for that also, so it's a fair comparison. In fact with the PPE free, there'd be more power left over on Cell than the P4.

It sounds a little like DD1, at least from the PPE point-of-view. "Although SPEs can run generic code at a comparable speed or faster than the PPE," is pretty interesting, and suggests DD1.

A 5x speedup is impressive, and this isn't trivial cloth simulation. The self-collision is pretty complex, I doubt you'd have that going on in a game (or at least on a large scale).

This also sounds like the demo they showed at E3.
 
Last edited by a moderator:
I'm sorry, but a 5x speed-up is not impressive.

How many times more peak GFLOPs do 8 SPEs at 2.4GHz have over the P4? An awful lot of that power has gone missing and the paper seems to avoid even touching on that subject.


Jawed
 
Jawed said:
I'm sorry, but a 5x speed-up is not impressive.

How many times more peak GFLOPs do 8 SPEs at 2.4GHz have over the P4? An awful lot of that power has gone missing and the paper seems to avoid even touching on that subject.


Jawed

Maybe because floating point can only help so much? And maybe because i personally don't expect Alias to ever create an application that runs fast? ;)
 
Jawed said:
I'm sorry, but a 5x speed-up is not impressive.

How many times more peak GFLOPs do 8 SPEs at 2.4GHz have over the P4? An awful lot of that power has gone missing and the paper seems to avoid even touching on that subject.


Jawed

STI initially claimed up to a 10x speedup over "conventional processors". With this software, a 3.2Ghz Cell should be pushing up on a 7x speedup over a 3.6Ghz P4 (scale the P4 to 3.2Ghz and the speedup increases further). If they got the PPE to work on the task itself too, there'd be further gains. I think if you have a speedup that roughly matches the number of SPUs, a roughly linear speedup, that's pretty damn good - i.e. one SPU can be nearly as good as one P4 for this task.

Speedups beyond that are likely dependent on how well the task maps to, or has been mapped to, the memory architecture in Cell. We've seen massive speedups in other examples likely due to that, but this task, or the current implementation, may not benefit as much? Remember, also that this is a port, unlike other examples which were built from scratch for Cell. As it is, though, expressing disappointment at this kind of improvement - almost a linear speedup - makes you sound spoilt :p

On a side note, the only commentary I found on this from E3 was from IGN:

The next demo was based on a new cloth simulation algorithm being worked into Maya. Again using two Cell processors, the demo was able to run 16 separate simulations simultaneously. Each piece of cloth was defined by 300 vertices, but the real kicker with this demo is that the algorithm incorporated self-intersecting physics, keeping the cloth from flowing through itself. This sort of simulation is much more computationally-intensive than simulating a cloth against another object.

edit - I should look closer at things, the chart in fact shows than a single SPU @ 2.4Ghz is better than a P4 @ 3.6Ghz for this, looks to be maybe 1.2x. With 8 SPUs, it's a 5x speedup - but if you scaled them to 3.6Ghz, it'd be 7.5x (assuming performance scales linearly with clockspeed), and thus across the SPUs the speedup would be pretty much linear.
 
Last edited by a moderator:
london-boy said:
And maybe because i personally don't expect Alias to ever create an application that runs fast? ;)

hey this is a good point ! :)
it's coded in MEL .


(note:i'm not serious)
 
The P4 is rated at around 6 or 7 GFLOPs, is it not?

We're seeing a speed-up that's only 20% of the "peak" figures. It just seems to me yet more proof that "peak" figures are nonsense.

Jawed
 
Jawed said:
The P4 is rated at around 6 or 7 GFLOPs, is it not?

A 3.6Ghz P4 should be about 14Gflops, I think?

I don't know, you're a hard man to impress Jawed. A virtually linear speedup for ported code is pretty damn good in my book. Not everything is going to be 50x faster ;) Would you say in instances where we have seen such improvement, and much more still, that Gflops are also useless, but uselessly conservative? :p

Just playing devil's advocate. No one here's suggesting using Gflops as a pure and sole measure of performance. I think the more impressive an encouraging aspect of this case is that with 8 much more simple "cores" you'd be getting nearly a 8x speedup at the same clock as a big bad P4.
 
Last edited by a moderator:
Back
Top