IBM on CELL as online games server(do physics simulation).

rounin said:
Why didn't they bench against AMD stuff? Doesn't it make more sense to run benchmarks against, say, a Dual-Core AMD chip?
AFAIK SIMD units in AMD CPUs are usually slower than those of Pentium 4.
 
Do I understand this correctly?

It was noted that the bulk of the exectution time on the SPE was consumed by computation but is also seems implicitly true the SPEs were not exectuting a good amount of time while waiting on the PPE. I'm looking at how after 2 or three SPEs were made used the sim did not see a significant speed increase. To me that suggests that although the SPE exectution time was comsumed by integration they were not getting to exectute as much as possible due to the increased workload on the PPE to feed them data.

If this overhead were removed you would see the SPEs executing more often and thus see a greater speed boost.

Also if collision detection or the vectorizable portion of it were moved to the SPEs they would have something else to exectute during the periods where they have to wait on the PPE. Since it seem the SPEs are faster than the PPE and would at the same time would be utilized more this would also translate into a greater speed boost.

What is confusing is whether the measured exectution time includes time a core is sitting idle doing nothing. If an SPE is waiting on the PPE is that included in it's time doing integration? or is that included in the PPEs time doing other things?

Another concern I have about the SPE speedup is as to just how much integration work there is to be done. If there is X amount of this work to be done in the sim and then it requires only Y number of SPEs to get that amount of work done then it seems adding more SPEs to the task is a not a good use of resources...unless you increase the X amount of work in the sim(which could be done in parallel). I am curious if what is witnessed is a also a result of the SPEs being too sufficient to the task or rather you could say this may be a result of a bad "B" in Amdahl's law (thanks DeanoC for a good discussion on this!). In this particular situation could it be possible the researchers have hit the limit with regards to the number of SPEs executing in parallel on this task? (but the sime could still be faster with faster SPEs) This would seem more true of the test B if this is indeed what is going on.

Sorry for the long post but I'm just trying to keep up with you guys.
 
scificube said:
What is confusing is whether the measured exectution time includes time a core is sitting idle doing nothing.

I think they were just looking at what the SPE was doing in the time it was active.

Beyond a certain number of SPEs, the bound shifted from integration to the PPE's work, so beyond that number, there would be idle time for the SPEs, I think.

scificube said:
Another concern I have about the SPE speedup is as to just how much integration work there is to be done. If there is X amount of this work to be done in the sim and then it requires only Y number of SPEs to get that amount of work done then it seems adding more SPEs to the task is a not a good use of resources

True. You're still doing the integration faster than you were before, but integration is no longer the limit on performance, so you don't see that in the "overall" framerate. If you just looked at integration performance, I think you'd see the performance continuing to increase with more SPEs.

scificube said:
In this particular situation could it be possible the researchers have hit the limit with regards to the number of SPEs executing in parallel on this task?

Not sure if I'd put it that way, simply that with a certain number of SPEs, you speed up integration to the point where other things are the bound on performance, and no longer integration. Again, if you solely looked at the integration performance, you'd probably see it continue to improve with more SPEs, but it doesn't matter, since the PPE is slower with its work, and taking longer to do its part for the simulation frame.
 
Thanks Titanio.

Here's another way of saying what I'm thinking.

Ignoring inefficiency etc. you have a 3Ghz CPU and then you have 3 1Ghz CPUs which are exact to the 3Ghz CPU in every was except speed. You then have X amount of work to do on a single task. It takes the first CPU 3 seconds to do the work and you split the work 3 ways evenly over the other 3 CPUs and each finishes their work in 1 second.

The idea here is that the task is being done as fast as it can be done regardless in both instances. If you have 3 more CPUs you could allocate to the task the result would be the same as it takes them less time to complete their share of the work but all in all since no more work is been asked of them the resulting total time to completion stays the same.

This does not mean that all six CPUs on the task now are being taxed as hard as they can go. If there was 2 times as much work to be done then this would be true. I am wondering if a similar situation is the case with these tests. I realise the circumstances of the test are a good bit different from the simple sitation I'm describing.

I'm not sure I can see if it's like what you say from the data the IBM researchers present or if indeed a third test is need to really "push" the SPEs. I suppose that could have been the purpose of test 1 but I think I only see where they took the baseline performance of the P4 and contrasted this with putting the same amount of work on the CBEA.

I guess what I'm trying to ask is whether these test flesh out the full capability of the CBEA instead of measuring the performance speedup of the CBEA with similar workloads.

Don't get me wrong as what you say is entirely plausible Titanio. I just not sure if we can say for sure one way or the other.

Am I wrong...is it not the best we can do but guess here?
 
Can someone translate? Why does the ppe perform so much worse than the spe and the p4? And what does this mean about the performance of the xcpu which I thought had 3 ppes?
 
I'd have to guess because the PPE is less about calculating then it is about controlling. The PPE of Cell is more about handling all the processes that get distributed to the SPE's, and controlling the OS and security functions rather then actually breaking down complex floating point math.

The SPE's are the physics machines, whereas the PPE isn't. That would be my guess on why the PPE is so weak at this physics simulation stuff. Same would go with the Xbox360 PPE's, which is why they each have 1 attached Vector Unit to help physics calculations a little.
 
ralexand said:
Can someone translate? Why does the ppe perform so much worse than the spe and the p4? And what does this mean about the performance of the xcpu which I thought had 3 ppes?


Nothing here can be directly translated to XCPU performance. The closes you can come is the PPE + VMX, but Cell's VMX unit only provides about a tenth of the processing power of the XCPU's VMX128.
 
10 times?

Powderkeg said:
Cell's VMX unit only provides about a tenth of the processing power of the XCPU's VMX128.

This is very mysterious statement my friend. I will appreciate more details from you on this subject. Thank you.
 
It's somewhat difficult to draw conclusions about XB360 perfomance from the PPE results.

There are significant differences between the PPE's and the XB360 processors especially their vector units and the paper is really unclear about what it's measuring on the PPE.
 
Ppe

ERP said:
It's somewhat difficult to draw conclusions about XB360 perfomance from the PPE results.

There are significant differences between the PPE's and the XB360 processors especially their vector units and the paper is really unclear about what it's measuring on the PPE.

Paper says for their tests 2.4Ghz PPE is 20% of P4 performance for all aspects.

Also, I would like to learn more about what is significant differences between the PPE's and Xbox360 processors especially their vector units. Thank you.
 
There are several techniques for doing stream based parallel collision detection. Their problem is they ported an existing serial implementation in C++, instead of designing it from the ground up for the CELL architecture. (for example, search for GPGPU based collision detection algorithms or ParFUM if you want an example. ParFUM obtains a 60% speedup)

Using STL data structures and C++ pointer/reference chasing is definately going to kill performance. Hell, it can even sap performance on a P4 if you don't watch how you use the cache.
 
I truly hope this is an older revision of Cell and that in turn the newer Cell has a more robust PPE that would perform at a level better than 20% of a what a P4 can.

It'd be helpful to know just what version of Cell was unsed in this experiment.
 
ERP said:
There are significant differences between the PPE's and the XB360 processors
Regardless of the differences - they both have some rather nasty performance killing gotchas, and I reckon that's one likely explanation of the results in the above paper.

especially their vector units and the paper is really unclear about what it's measuring on the PPE.
I think it's pretty safe to assume VMX had no significant role in those results - if there was a large portion of time spent in vector code I just can't see performance delta being anywhere near as big.
 
Is there some structural difference in the ppe in comparison to the spe that makes it such a poor physics op performer?
 
scificube said:
The idea here is that the task is being done as fast as it can be done regardless in both instances. If you have 3 more CPUs you could allocate to the task the result would be the same as it takes them less time to complete their share of the work but all in all since no more work is been asked of them the resulting total time to completion stays the same.

No, if performance scaled linearly, the time to completion would halve.

If you have a task that takes 3 seconds on one CPU, and it's trivially parallel and performance scales linearly with more cores, it'll take 1 second with 3 CPUs, half a second with 6. Remember, they're working in parallel, you don't add their completion times.

Now, to extend your example a little. Imagine there was a 7th CPU that had to do work for the frame, and it took 2 seconds to complete. Then regardless of the fact that the 6 other CPUs finish their work for the frame in 0.5 seconds, the total frame time is still 2 seconds, because you can only complete as fast as your slowest component lets you. As my understanding goes, that's what's happening here with the PPEs and the SPEs. The SPEs would be idle for some time after they complete, or between calculations, waiting for the PPE to finish up its work and prep them for the next task.
 
Last edited by a moderator:
scificube said:
I guess what I'm trying to ask is whether these test flesh out the full capability of the CBEA instead of measuring the performance speedup of the CBEA with similar workloads.
In my view the team set out to demonstrate (yet again, sigh) how much faster Cell is than P4.

This lead them to aim for the "low hanging fruit" of the problem, which they also believed to be the most computationally intensive, and therefore the most ripe for showing how much faster Cell is than P4.

Unfortunately they were hoisted by their own petard - which I find greatly amusing. Cell bit them on the arse because they didn't manage bandwidths and bottlenecks while they were busy writing a distributed integration harness.

At least they're honest about it. Gawd knows how the marketing department would spin these results.

Jawed
 
This benchmark is a trick to make Cell look good

Jawed said:
Interesting - even if I did skip over the maths involved in the physics modelling.
That's a very important part that you missed:
IBM said:
However, an alternate method was chosen for this project that was more robust, efficient, and better suited to SIMD hardware, namely semi-implicit integration6 of a penalty force-based system, that is, a dynamic system that enforces constraints by applying a restorative force when the constraint is violated.

They basically just skipped over the toughest part of doing physics and did it the lazy, slow way. Just applied a restoration force to any joint that violated it's constraints.

I tried this when doing a physics engine, because I successfully did something similar to this in a simple 3D truck simulation way back in high school (in Turbo Pascal :)). Let me tell you that for real-world situations it converges veeeery slowly for many systems. They borrowed the technique from a cloth simulation paper, but cloth behaves very differently from rigid bodies. Even simply stacking a couple of boxes is really tough for this method, because there is strong coupling between points that creates "loops" of dependency in the system. You need tons of iterations to make it work for anything but the most simple of scenarios. Obviously they're not comparing apples to oranges, and did this for all the architectures, but it's a horrible way to do physics for a game like HL2, for example.

Here's my point: Their physics algorithm is probably a factor of 10 or 100 slower than a real physics engine (in the real world) just to make it run relatively faster on Cell.

If you want to make a real physics solver, your massive system of differential equations needs to be solved by a method other than conjugate gradient. If you're smart, you can try to decouple this system into smaller equations, which helps tremendously with speed, but you have to do it right. I'm just blown away that IBM can call this even remotely typical of a game server.
 
Intel Pentium4 3.0 GHz launched in April 2003.

Dissapointing performance against a 3 years old CPU.
A third-party benchmarked against a dual core Athlon64 optimized for fitting in the cache and other optimizations missing in the PC code would show even worse results.
 
DarkRage said:
Intel Pentium4 3.0 GHz launched in April 2003.

Dissapointing performance against a 3 years old CPU.
A third-party benchmarked against a dual core Athlon64 optimized for fitting in the cache and other optimizations missing in the PC code would show even worse results.

What about optimisations for the Cell code? Read the paper, this clearly isn't an optimal implementation for Cell. And what about the delta in clockspeed?
 
Back
Top