http://www.research.ibm.com/journal/sj/451/damora.html
hmm, so how about the XCPU to do physics simulation (which has 3 VMX128).
Last edited by a moderator:
The lower performance of the PPE is the major factor in considering overall system performance when porting legacy code. The SPEs performed very well, beyond our expectations, and we did not experience any DMA-related performance issues. Nevertheless, it is clear that with the current Cell BE hardware, there is not a performance advantage unless most of the non-SPE-optimized profile is moved to SPE code. Even using the SPEs to process scalar codes would offer a significant advantage over executing the same code on the PPE.
tema said:XCPU (A)0.2 x 3 = 0.6 (B) 0.18 x 3 = 0.54 ?
Which was a little naive (they admit this), given that the Cell implementation of the "game" has a significant computational overhead in order to split the workload across SPEs.Titanio said:They just didn't bank on the PPE slowing with rest of the work, vs the P4.
Jawed said:Gubbi, I think they were misdirected in their approach and so the craptacular results are more a reflection of the dead ends they encountered rather than Cell intrinsically.
Gubbi said:Still, isn't this more or less as expected?
The compute bound parts of the physics engine is offloadet to SPEs and sees a great speed up, the remaining pointer chasing bound part (collision detection) of it is not (because it isn't as straight forward).
Gubbi said:Well, the multicycle latency for the schedule-execute loop for instructions, and a high load-to-use latency (which are probably the biggest culprits in the detrimental collision detection performance) of the memory arrays are intrinsic to CELL, so I disagree.
To get solid performance they would have to be able to distribute collision detection too. We've discussed this in other threads, it is not a workload that fits the SPEs well.
Cheers
blakjedi said:this seems pretty unspectacular considering the fact you are pitting 7 cores verses a single processor... 4 times speedup is NOT impressive imho...
blakjedi said:how would it compare next to one of the new dual core athlons/p4s?
ERP said:But it's more interesting to compare the performance in the artificial test to the performance in a real scenario. Because it demonstrates how misleading artificial tests can be.
The Cell Broadband Engine™ processor employs multiple accelerators, called synergistic processing elements (SPEs), for high performance. Each SPE has a high-speed local store attached to the main memory through direct memory access (DMA), but a drawback of this design is that the local store is not large enough for the entire application code or data. It must be decomposed into pieces small enough to fit into local memory, and they must be replaced through the DMA without losing the performance gain of multiple SPEs. We propose a new programming model, MPI microtask, based on the standard Message Passing Interface (MPI) programming model for distributed-memory parallel machines. In our new model, programmers do not need to manage the local store as long as they partition their application into a collection of small microtasks that fit into the local store. Furthermore, the preprocessor and runtime in our microtask system optimize the execution of microtasks by exploiting explicit communications in the MPI model. We have created a prototype that includes a novel static scheduler for such optimizations. Our initial experiments have shown some encouraging results.
If its a CEB 20x0 series, its a DD2+.Panajev2001a said:They mention the game server was a CBEA prototype board runnign at 2.4 Ghz and with 6 SPE's... it seems like they were using the DD1 revision. Inthis case you would not only be 800 MHz from the final speed achieved by the CBEA processor with say PLAYSTATION 3, but you would also have a decisively slower PPE implementation (the PPE grew 2x going from DD1 to DD2).
...Another important design change that would alleviate the PPE bottleneck involves the data structures used to store the game scene data, which must be transferred to the SPEs for the collision-detection and integration calculations. These structures, which describe such things as rigid bodies, collision bodies, and forces, were designed as C++ structures in the code base with which we started, and tend to be somewhat complex. For example, a collision body includes (among other things) a vector of shared faces, each of which has a normal, a vector of edges, and a vector indicating which face is on the opposite side of each edge. This complexity allows a degree of abstraction in C++ that makes algorithm development much easier. When we moved the integration to the SPEs, we had two options: packing the information from the various structures needed for each workload into contiguous storage on the PPE side and copying it to the SPE as one “chunk,” or sending the addresses of the structures to the SPE and letting it crawl through the C++ structures to get the necessary data. We chose the former approach because the latter would have been difficult to program and error-prone, and we expected the PPE performance to be somewhat better than it turned out to be. However, the packing step is highly inefficient and further burdens the PPE with a data-processing task that not only would be unnecessary on an Intel platform, but would execute five times faster. An extension to the SPE compiler which would read the PPE's C++ data structures would be helpful in porting the code, but ultimately the data structures need to be simplified to obtain maximum performance.
For collision detection, we are investigating ways to simplify the data structures on the PPE side so that we can transfer blocks of contiguous storage, with pointers to vectors (again in contiguous storage) for some of the variable-sized data structures. This will require some additional discipline on the PPE side. We can store pointers to C++ vectors without resetting them every time we send them to the SPE as long as we do not add elements to the vectors during game play, because this may cause the data to be moved. This simplification also requires more complex code on the SPE side to transfer the data, but should make it possible to improve performance significantly. If so, we would investigate doing something similar with the data structures that are used in the integration step in order to break the bottleneck there...