IBM on CELL as online games server(do physics simulation).

cho · Jan 15, 2006

http://www.research.ibm.com/journal/sj/451/damora.html

hmm, so how about the XCPU to do physics simulation (which has 3 VMX128).

tema · Jan 15, 2006

XCPU (A)0.2 x 3 = 0.6 (B) 0.18 x 3 = 0.54 ?

Titanio · Jan 15, 2006

Didn't get a chance to read the whole paper - it looks very interesting though - but that old analogy of the PPE being the conductor and the SPEs the orchestra springs to mind..

The lower performance of the PPE is the major factor in considering overall system performance when porting legacy code. The SPEs performed very well, beyond our expectations, and we did not experience any DMA-related performance issues. Nevertheless, it is clear that with the current Cell BE hardware, there is not a performance advantage unless most of the non-SPE-optimized profile is moved to SPE code. Even using the SPEs to process scalar codes would offer a significant advantage over executing the same code on the PPE.

It sounds like they're going to pursue it further, hopefully they'll update us again

It's worth noting with the performance comparisons that they were benching on a 3Ghz P4 vs a 2.4Ghz 6-SPE Cell, and the code was also originally Wintel (which makes it a pretty interesting case study).

tema said:
XCPU (A)0.2 x 3 = 0.6 (B) 0.18 x 3 = 0.54 ?

3 PPEs wouldn't necessarily scale linearly. And the PPE and Xenon cores aren't exactly the same anyway.

BenQ · Jan 15, 2006

I don't get it.

Jawed · Jan 15, 2006

Interesting - even if I did skip over the maths involved in the physics modelling.

Version 2 for the win :!:

It's quite clear from this that the dev team were shooting in the dark, solving the wrong problem first (integration) and only finding out too late that they'd built-in a hideous bottleneck.

It seems that if they'd moved a lot of the PPE's work onto one or two SPEs and also re-jigged the datastructures/DMA techniques to obviate the conductor having to do anything for the orchestra, they'd have a demo that truly lived up to the promise of Cell.

But that's the nature of R&D, so nothing to criticise them for.

It's also interesting that integration, being so compute-intensive, obviated double-buffered work-unit storage on each SPE. The compute time was 182x the DMA time.

Overall, I guess this is why Havok and Ageia have a business.

Jawed

Titanio · Jan 15, 2006

Well, they took the obvious route. Integration was the most intensive part of the Wintel code, so it made sense to farm that out to the SPEs. They did also intend putting some of the collision detection on the SPEs, but time prevented them from doing so. They just didn't bank on the PPE slowing with rest of the work, vs the P4.

Their experience with the SPEs is very promising though. And they seem to have a clear idea of where to go next, so hopefully they'll be given the opportunity (and a DD3 3.2Ghz Cell

).

rounin · Jan 15, 2006

Why didn't they bench against AMD stuff? Doesn't it make more sense to run benchmarks against, say, a Dual-Core AMD chip?

Jawed · Jan 15, 2006

Titanio said:
They just didn't bank on the PPE slowing with rest of the work, vs the P4.

Which was a little naive (they admit this), given that the Cell implementation of the "game" has a significant computational overhead in order to split the workload across SPEs.

If you put that overhead on the PPE, then you only make the bottleneck even worse. They were surprised by how bad the bottleneck was - but there's no doubt they were expecting it.

Pity they didn't have twice as long...

Jawed

Panajev2001a · Jan 15, 2006

They mention the game server was a CBEA prototype board runnign at 2.4 Ghz and with 6 SPE's... it seems like they were using the DD1 revision. Inthis case you would not only be 800 MHz from the final speed achieved by the CBEA processor with say PLAYSTATION 3, but you would also have a decisively slower PPE implementation (the PPE grew 2x going from DD1 to DD2).

Gubbi · Jan 15, 2006

Still, isn't this more or less as expected?

The compute bound parts of the physics engine is offloadet to SPEs and sees a great speed up, the remaining pointer chasing bound part (collision detection) of it is not (because it isn't as straight forward).

Still, the craptacular performance of the PPE is quite surprising IMO. The fact that CELL just barely out performs a P4 in a real game situation is as well.

Cheers
Gubbi

Jawed · Jan 15, 2006

Gubbi, I think they were misdirected in their approach and so the craptacular results are more a reflection of the dead ends they encountered rather than Cell intrinsically.

Jawed

Gubbi · Jan 15, 2006

Jawed said:
Gubbi, I think they were misdirected in their approach and so the craptacular results are more a reflection of the dead ends they encountered rather than Cell intrinsically.

Well, the multicycle latency for the schedule-execute loop for instructions, and a high load-to-use latency (which are probably the biggest culprits in the detrimental collision detection performance) of the memory arrays are intrinsic to CELL, so I disagree.

To get solid performance they would have to be able to distribute collision detection too. We've discussed this in other threads, it is not a workload that fits the SPEs well.

Cheers

Titanio · Jan 15, 2006

Gubbi said:
Still, isn't this more or less as expected?

The compute bound parts of the physics engine is offloadet to SPEs and sees a great speed up, the remaining pointer chasing bound part (collision detection) of it is not (because it isn't as straight forward).

They said they would with more time (collision detection, or the "narrow phase" part of it at least), and move other things too..

It's also worth considering if you scaled results according to the clockspeed difference present, there would be a better result. But also, in the second benchmark, if you considered it as a game, you may aswell free 3 of the SPEs, since you're not seeing any further improvement beyond 3 SPEs, and use those for other things. In other words, you're getting better performance than the P4, which is clocked higher, and still could do more than it elsewhere..

It sounds like it is a going-concern, so we might get a further update later. If it was a DD1 Cell, it'd be interesting to even just see it running as is on a DD2 or DD3, to see if the obvious changes to the PPE in those iterations would alone have an impact.

Gubbi said:
Well, the multicycle latency for the schedule-execute loop for instructions, and a high load-to-use latency (which are probably the biggest culprits in the detrimental collision detection performance) of the memory arrays are intrinsic to CELL, so I disagree.

To get solid performance they would have to be able to distribute collision detection too. We've discussed this in other threads, it is not a workload that fits the SPEs well.

Cheers

The main culprits they identify (collision detection, data packing) are all things that could be changed and improved, which the authors also agree upon.

On collision detection, the method they discuss would be quite well suited to it. You do your bounding volume heirarchy traversal on the PPE, but then when you've gone down as far as you can go there, you offload the final check to a SPE (that's the approach they're suggesting). But whole collision detection on SPEs..that even is a very arguable issue.

Lord Darkblade · Jan 15, 2006

Its an interesting read, however as they note a lot of their slowdown was due to their approach to the problem rather than the cell. They noted that data structures were bad for the SPEs and were packaged and sent from the PPE (surely running some DMA here and accessing the actual data structures themselves would have been a better plan, let the SPEs do the fetching rather than the PPE?). The code on the PPE was hurt severely by its need to organise the SPEs heavily, a more light-handed approach would likely have given better results (and a lot of their collision detection could be easily vectorised as noted) which would make it a more suitable task for the SPEs.

Overall its not a bad performance, with 2 SPEs the city demo was looking at a 1.3x speed up (ish) leaving a theoretical 4 more SPEs for other tasks (assuming that the load on the PPE would not be that much greater to co-ordinate other tasks). This was also a 2.4 rather than the 3.2 or 3.0 GHz cell and only 6 active SPEs (were the server blades not dual 1:6 2.4GHz systems?) so a PS3 running at this would still perform substantially better than a P4 (1.5ish speedup?)... so even though its bad there is still light at the end of the tunnel, what the cell is offering isn't perhaps the best solution to every problem without substantial revisions to the code however it is a solution that may give some advantage.

blakjedi · Jan 15, 2006

this seems pretty unspectacular considering the fact you are pitting 7 cores verses a single processor... 4 times speedup is NOT impressive imho... where is the magnitude level speedup...? how would it compare next to one of the new dual core athlons/p4s?

Titanio · Jan 15, 2006

blakjedi said:
this seems pretty unspectacular considering the fact you are pitting 7 cores verses a single processor... 4 times speedup is NOT impressive imho...

Well, note that you're over 3.5x already with 4 SPEs. You may as well stop throwing SPEs at the problem at that point, as beyond that you're obviously not really being bound by the integration speed.

blakjedi said:
how would it compare next to one of the new dual core athlons/p4s?

How would a DD2 3.2Ghz Cell compare?

We can only work with what we're given.

I think it's safe to say this is a less than perfect implementation. Cell would suffer more for that than a P4 though, for sure, if you want to look at it that way. But conversely, with a really ideal implementation (on both), you'd probably see it stretch its legs vs the P4 more than is exhibited here.

ERP · Jan 15, 2006

I could just point out that for it's order of maginitude performance (1000%) difference in FPU terms, Cell returned a 30% speed improvement in the "realistic game scenario"

And everything after the second SPU was wasted.

But it's more interesting to compare the performance in the artificial test to the performance in a real scenario. Because it demonstrates how misleading artificial tests can be.

Titanio · Jan 15, 2006

ERP said:
But it's more interesting to compare the performance in the artificial test to the performance in a real scenario. Because it demonstrates how misleading artificial tests can be.

Only misleading if you take them out of context

edit -

There's also another Cell article online from that journal:

MPI microtask for programming the Cell Broadband Engine™ processor

http://www.research.ibm.com/journal/sj/451/ohara.html

The Cell Broadband Engine™ processor employs multiple accelerators, called synergistic processing elements (SPEs), for high performance. Each SPE has a high-speed local store attached to the main memory through direct memory access (DMA), but a drawback of this design is that the local store is not large enough for the entire application code or data. It must be decomposed into pieces small enough to fit into local memory, and they must be replaced through the DMA without losing the performance gain of multiple SPEs. We propose a new programming model, MPI microtask, based on the standard Message Passing Interface (MPI) programming model for distributed-memory parallel machines. In our new model, programmers do not need to manage the local store as long as they partition their application into a collection of small microtasks that fit into the local store. Furthermore, the preprocessor and runtime in our microtask system optimize the execution of microtasks by exploiting explicit communications in the MPI model. We have created a prototype that includes a novel static scheduler for such optimizations. Our initial experiments have shown some encouraging results.

DeanoC · Jan 15, 2006

Panajev2001a said:
They mention the game server was a CBEA prototype board runnign at 2.4 Ghz and with 6 SPE's... it seems like they were using the DD1 revision. Inthis case you would not only be 800 MHz from the final speed achieved by the CBEA processor with say PLAYSTATION 3, but you would also have a decisively slower PPE implementation (the PPE grew 2x going from DD1 to DD2).

If its a CEB 20x0 series, its a DD2+.
They said its a 2.4Ghz Cell with 512Mb of system RAM, if thats a CEB its a 2030 (by the amount of RAM) and then it would be a DD2 or greater, if however it not a CEB its could any revision.

Carl B · Jan 15, 2006

This last section of the article stood out as offering some of the research team's more condensed/cohesive insights into utilizing the SPE's to greater effect in future attempts:

...Another important design change that would alleviate the PPE bottleneck involves the data structures used to store the game scene data, which must be transferred to the SPEs for the collision-detection and integration calculations. These structures, which describe such things as rigid bodies, collision bodies, and forces, were designed as C++ structures in the code base with which we started, and tend to be somewhat complex. For example, a collision body includes (among other things) a vector of shared faces, each of which has a normal, a vector of edges, and a vector indicating which face is on the opposite side of each edge. This complexity allows a degree of abstraction in C++ that makes algorithm development much easier. When we moved the integration to the SPEs, we had two options: packing the information from the various structures needed for each workload into contiguous storage on the PPE side and copying it to the SPE as one “chunk,” or sending the addresses of the structures to the SPE and letting it crawl through the C++ structures to get the necessary data. We chose the former approach because the latter would have been difficult to program and error-prone, and we expected the PPE performance to be somewhat better than it turned out to be. However, the packing step is highly inefficient and further burdens the PPE with a data-processing task that not only would be unnecessary on an Intel platform, but would execute five times faster. An extension to the SPE compiler which would read the PPE's C++ data structures would be helpful in porting the code, but ultimately the data structures need to be simplified to obtain maximum performance.

For collision detection, we are investigating ways to simplify the data structures on the PPE side so that we can transfer blocks of contiguous storage, with pointers to vectors (again in contiguous storage) for some of the variable-sized data structures. This will require some additional discipline on the PPE side. We can store pointers to C++ vectors without resetting them every time we send them to the SPE as long as we do not add elements to the vectors during game play, because this may cause the data to be moved. This simplification also requires more complex code on the SPE side to transfer the data, but should make it possible to improve performance significantly. If so, we would investigate doing something similar with the data structures that are used in the integration step in order to break the bottleneck there...

I think overall, though they seemed to fumble some things initially, Cell if nothing else shows promise for the intended tasks. The paper does go on to mention that code-porting to the SPE's will be a 'daunting' task, and one that hopefully future applications will make easier. But of course we've known all along that getting code onto the SPE's is Cell's Achilles Heel in a sense.

I walk away generally pleased though with Cell's potential. Truthfully I don't think a team at IBM would be my #1 pick for developing an MMOG engine for Cell straight-off. Given knowledge of the Cell, I rather see what some experienced PS2 devs could create.

IBM on CELL as online games server(do physics simulation).

cho

tema

Titanio

BenQ

Jawed

Titanio

rounin

Jawed

Panajev2001a

Gubbi

Jawed

Gubbi

Titanio

Lord Darkblade

blakjedi

Titanio

ERP

Titanio

DeanoC

Trust me, I'm a renderer person!

Carl B

Friends call me xbd

Similar threads