IBM on CELL as online games server(do physics simulation).

So far in the papers I've seen where PPE has been banchmarked (this and the Alias Wavefront cloth solver) the PPE has performed well below the P4. Why is this? Is it because the algorithm being run on Cell isn't PPE friendly, so when the PPE does the work on it's own it crawls? Or is the PPE fundamentally limited in it's calculation capabilities despite having a high clock, chunky VMX, fair cache size etc.?

It also poses the question as to how much the SPE's are going to be PPE bound? If they need the PPE to part process the data to feed the SPE's in a game engine the PPE could well have it's hands full setting up data for the SPEs. If in this case the PPE was fully under load, if it had to manage IO, AI, and other sundries, the physics would suffer. It seems a lot of rethinking might be needed to rework approaches to promote SPE independence from the PPE. This could well see very diverse software appearing for Cell as different developers try different techniques with varying degrees of success. It also goes to show how important opening Cell up to community development is. With a large userbase of 'homebrew researchers' experimenting, development of new methods should be a lot faster.
 
Shifty Geezer said:
It also poses the question as to how much the SPE's are going to be PPE bound? If they need the PPE to part process the data to feed the SPE's in a game engine the PPE could well have it's hands full setting up data for the SPEs.

The approach they took did that, but it wasn't a necessary approach. One of the things they highlight for change is that very issue, reducing or eliminating the need for the PPE to set up data for the SPEs.
 
The Alias cloth demo was hypothesised to be using a DD1 Cell, with that being hypothesised as being FPU only (not VMX capable).

In this paper they describe PPE VMX. Having said that, the Alias demo showed similarly awful performance for Cell PPE on its own.

I wonder if in either demo they bothered to utilise the symmetric hardware threads in PPE? Somehow I suspect not, because that's also a great way to make SPEs look better, by crippling the PPE.

Still, SMT wouldn't necessarily save PPE - but on the other hand it might have made a significant difference to the scalability of this implementation.

Though, with Mintmaster's comments, it seems IBM were being rather more deceptive than I realised (I'm really out of my depth with physics) and this really was nothing more than a boutique demo of Cell's power that went horribly wrong.

Jawed
 
Titanio said:
What about optimisations for the Cell code? Read the paper, this clearly isn't an optimal implementation for Cell. And what about the delta in clockspeed?

I have read the paper.
What about the implementations for Pentium4? I haven't seen any comments on hyperthreading or cache implementations which would have to be changed in the migration. But there are low-level optimizations for fitting the SLE size and others. I doubt IBM has spent the same time optimizing for Cell than for P4, don't you?
Now we have to assume it is an optimal implementation for a Pentium4... I don't think so.

What about the delta clockspeed? Well, you have some delta clockspeed in the PC CPUs too in the last 3 years. Actually you have two cores instead of one, running at higher speeds, which in a problem like this one is more than a 100% improvement.

What I am saying is that if you want to compare to the PC CPUs, do it with the 2006's ones, not against 3 years old CPUs.
 
DarkRage said:
I have read the paper.
What about the implementations for Pentium4? I haven't seen any comments on hyperthreading or cache implementations which would have to be changed in the migration. But there are low-level optimizations for fitting the SLE size and others. I doubt IBM has spent the same time optimizing for Cell than for P4, don't you?

I think it's fair to say that both implementations were sub-optimal. Regardless, an optimal implementation on Cell would look very different from that on the P4. They did at least use SSE, though, and in fact it sounds like the code started life on Wintel.

DarkRage said:
What about the delta clockspeed? Well, you have some delta clockspeed in the PC CPUs too in the last 3 years. Actually you have two cores instead of one, running at higher speeds, which in a problem like this one is more than a 100% improvement.

What I am saying is that if you want to compare to the PC CPUs, do it with the 2006's ones, not against 3 years old CPUs.

A fair enough point. But they should also use the latest Cell, not the 2.4Ghz 6-SPE version.
 
Fafalada said:
I think it's pretty safe to assume VMX had no significant role in those results - if there was a large portion of time spent in vector code I just can't see performance delta being anywhere near as big.

I agree with this statement.
 
ralexand said:
Is there some structural difference in the ppe in comparison to the spe that makes it such a poor physics op performer?

It's VMX unit doesn't have enough registers to effectively hide the instruction latency, but it's also being timed for the whole task where the SPE's are only being timed for the compute intensive integration portion.

As Faf mentioned above I'd be surprised if there was significant VMX work on the PPE, there just isn't that much disparity between it and the SPE's when it comes to vector computational performance, even given the register shortage.
 
Titanio said:
No, if performance scaled linearly, the time to completion would halve.

If you have a task that takes 3 seconds on one CPU, and it's trivially parallel and performance scales linearly with more cores, it'll take 1 second with 3 CPUs, half a second with 6. Remember, they're working in parallel, you don't add their completion times.

Now, to extend your example a little. Imagine there was a 7th CPU that had to do work for the frame, and it took 2 seconds to complete. Then regardless of the fact that the 6 other CPUs finish their work for the frame in 0.5 seconds, the total frame time is still 2 seconds, because you can only complete as fast as your slowest component lets you. As my understanding goes, that's what's happening here with the PPEs and the SPEs. The SPEs would be idle for some time after they complete, or between calculations, waiting for the PPE to finish up its work and prep them for the next task.

I see what you're saying about the work being done in parallel. Seems I used yet another bad example of what I was trying to say.

I would have to make the slower CPUs take 3 seconds to complete the work on their portions of the task for what I said to make sense. DOH!

Oh well. I'd really like to know whether the performance continued to scale or not on the integration portion of the task while the PPE bottlenecked everything in spite of.

At least these test highlight that care must be taken with BOTH the PPE and SPEs of the CBEA.
 
Maybe I'm reading this wrong here but adding it all up the 2,4GHz Cell is 11 times more powerfull in the first one and 7 times more powerfull in the second one.
This is without optimisations on a low clocked Cell with only 6 SPE's and you guys are disappointed???
In worst case senario it's 7 times more powefull that's amasing show me one benchmark of a dual core A64 or P4 doing 2 or 3 times the performance than the 3,2GHz P4.
 
You're not reading it correctly, there's nothing to add up.
(Except if I'm reading it completely wrong, which I'm fairly sure is not the case ;))
 
Guilty Bystander said:
Wouldn't the weak PPE performance be because it's ordering all the workload of the SPE's?

That's the problem. It is weakly performing at scheduling the SPEs, at least in the way the engineers (and possibly most other programmers in the world) had assumed it would work.

Changing the data format and offloading more scheduling to an SPE would help this, but it was an unpleasant surprise to find the bottleneck was that limiting.
 
Guilty Bystander said:
Wouldn't the weak PPE performance be because it's ordering all the workload of the SPE's?

It was benched on its own, or appeared to be.

We'd seen relatively weak PPE performance in other benches too.
 
"I'd be surprised if there was significant VMX work on the PPE, there just isn't that much disparity between it and the SPE's when it comes to vector computational performance, even given the register shortage."

Hmm. I dont understand.
 
blakjedi said:
"I'd be surprised if there was significant VMX work on the PPE, there just isn't that much disparity between it and the SPE's when it comes to vector computational performance, even given the register shortage."

Hmm. I dont understand.

For vector computation, a VMX unit and a SPE aren't a million miles apart. They've different numbers of registers to work with though (32, or 32 per thread, on the PPE vs 128 in a SPE).
 
Good thinking ... and since the large number of registers helps code to hide latency (e.g. by allowing loop-unrolling and pre-fetching), perhaps the register count amounts to an overwhelming advantage for SPEs over PPE.

Hmm...

Admittedly it's so long so I read the article that I can't remember if there was much discussion of this specific point.

Jawed
 
Jawed said:
Good thinking ... and since the large number of registers helps code to hide latency (e.g. by allowing loop-unrolling and pre-fetching), perhaps the register count amounts to an overwhelming advantage for SPEs over PPE.
Having the extra registers definitely helps performance - that's why MS bumped up the number of VMX registers to 128 for the Xenon.

Another important factor that hasn't been mentioned so far is the quality of compiler code generation. The Xenon, PPU, SPUs and P4 use different compilers and currently there are big differences in the quality of code produced by different compilers, particularly for vector / SIMD heavy code.
 
For vector computation, a VMX unit and a SPE aren't a million miles apart. They've different numbers of registers to work with though (32, or 32 per thread, on the PPE vs 128 in a SPE).

The SPEs have a local store, the PPE has a cache. The differences in the way these work and their speed is likely to have a very heavy influence on performance.
 
ADEX said:
The SPEs have a local store, the PPE has a cache. The differences in the way these work and their speed is likely to have a very heavy influence on performance.

I latterly thought of this..it is true. I guess anywhere you can control memory access ahead of time, the local store is going to be a good advantage for the SPEs, given its speed.
 
Back
Top