View Full Version : A new programming paradigm for the Cell
First of all I'm not a game developer. But with the knowledge of distributed programming I would suggest that the current programming style on the Cell is questionable.
The Cell is a parallel multi-processor hence it works best if the resource and computation are shared among its sub-processor(cores).
This work has been proven by the following paper:
http://hpc.pnl.gov/people/fabrizio/papers/ipdps07-graphs.pdf
2 Cells beat performance of BlueGene/L 128 CPUs and pretty much neck-to-neck with BlueGene/L 256 CPUs
I saw many PS3 games not using the distributed programming approach even though it looks quite nice.
For example: HS uses 2 SPUs for AI and few other for physics...
This style of dividing sub-processors for specific tasks is not efficient.
At school, our department professor tested his algorithm on 2 SPUs give an advantage of 2.2 times to 1 SPU, however when he distributed the algorithm to 5 SPUs the performance got a boost of 16 times to 1 SPU.
So the question to developers is: should you design your codes such that it's distributed to all SPUs?
Suppose you want to have 60 fps, which means a frame takes up 16.67 millisecond. Out of this 16.67, a third is dedicated to graphic card and the rest goes into the Cell for other tasks such as geometry, animation or physic... So the geometry would take x amount, animation takes y amount and physic take z amount of 11.11 millisecond on all SPUs.
SPURS (SPU Runtime System) is available for PS3 developers AFAIK. In SPURS, SPEs are main processors and the PPE is merely a service co-processor invoked only when absolutely necessary.
one, do you remember if the PPE is involved in any way for a SPU to access main memory ? My current impression is PPE is needed during the setup (memory map) stage. Thereafter, the SPU should be able to access main memory or another local store without any external help. There was also some mention of SPU interacting with I/O devices (via PPE ?) but I can't remember where I read it anymore.
one, do you remember if the PPE is involved in any way for a SPU to access main memory ? My current impression is PPE is needed during the setup (memory map) stage. Thereafter, the SPU should be able to access main memory or another local store without any external help. There was also some mention of SPU interacting with I/O devices (via PPE ?) but I can't remember where I read it anymore.I don't remember exact details, but the ideal workload configuration is PPE does the tasks possible only for PPE (setup, SPE booting, system calls).
Also, SPU does lock-free synchronization by the atomic cache in SPE and doesn't use the mutex in the shared memory or provided in the OS API since all OS function calls are costly remote procedure calls via PPU and the scheduling for PPU threads is independent of SPEs. In SPURS Job, DMAs are automatically overlapped and pipelined.
inefficient
21-Mar-2007, 03:54
For example: HS uses 2 SPUs for AI and few other for physics...
That information is old/incorrect. HS and most of the non-launch window games coming out do not use SPU in that fashion. They use a more robust job/task system. If not SPURS, then something of a similar vein.
That information is old/incorrect. HS and most of the non-launch window games coming out do not use SPU in that fashion. They use a more robust job/task system. If not SPURS, then something of a similar vein.
I hope so because if they want to see a leap in performance they must do it in a distributed way.
rendezvous
22-Mar-2007, 07:36
At school, our department professor tested his algorithm on 2 SPUs give an advantage of 2.2 times to 1 SPU, however when he distributed the algorithm to 5 SPUs the performance got a boost of 16 times to 1 SPU.
I think this sounds a bit fishy, I have yet to see any porformace increase that isn't sublinear with the amounts of SPUs before. I could imagine scenarios where it could happen, when you are limited by the amount of local ram and intra SPE communication and omputations aren't the limiting factor.
Could you please shead som light on what kind of algorithm he was testing, preferably with information of how he could reach such impressive numbers.
Shompola
22-Mar-2007, 13:17
This is definitely an issue of memory usage, the data (chunks etc.) is large enough that it fits the combined local memory better, and reduces memory access to the main RAM pool significantly. what happens if he increases data usage? The speed-up factor should decrease.
inefficient
22-Mar-2007, 14:32
This is definitely an issue of memory usage, the data (chunks etc.) is large enough that it fits the combined local memory better, and reduces memory access to the main RAM pool significantly. what happens if he increases data usage? The speed-up factor should decrease.
According to data from this presentation (link (http://www3.stream.co.jp/www11/jinzai/20070301/04/index.html)), the actual latency for one SPU reading from another SPU is still a whopping 200 cycles. In comparison a DMA from XDR to LS is 500cycles.
Not to be a skeptic. Given those numbers, I don't know see that 16x performance increase quoted coming simply from utilizing the combined LS's for better memory performance.
To me it sounds more likely that there was just a misunderstanding. Something like professor X ran a simulation on 1 SPU that was fairly unoptimized but he benched it. Then later down the road, by a combination of exploiting instruction level parallelism and multiprocessor parallelism, he was able to optimize it to the point of a 16x speed up. I think this is the most realistic scenario.
Unless it was proof of concept where all code and data fit easily into the 5 x 256K combined memory space. But that kind of example is not really that usefully in the real world.
Datasegment
22-Mar-2007, 17:05
According to data from this presentation (link (http://www3.stream.co.jp/www11/jinzai/20070301/04/index.html)), the actual latency for one SPU reading from another SPU is still a whopping 200 cycles. In comparison a DMA from XDR to LS is 500cycles.
Not to be a skeptic. Given those numbers, I don't know see that 16x performance increase quoted coming simply from utilizing the combined LS's for better memory performance.
Actually, although the instigation of communitactions between two SPEs is relatively slow, the actual data throughput rate is phenomenally fast, around the 200Gb/s mark (4 rings * 25.6Gb per second in each direction) - AND this data transaction happens onchip, with no hit to the main memory subsystem at all. This coupled with the fact that these memory transactions can be performed while the SPE's simultaneously operate on other data means that the effective transfer penalty can be reduced to almost zero (setup time and handshaking still required).
Datasegment
22-Mar-2007, 17:06
(*communications)
I think this sounds a bit fishy, I have yet to see any porformace increase that isn't sublinear with the amounts of SPUs before. I could imagine scenarios where it could happen, when you are limited by the amount of local ram and intra SPE communication and omputations aren't the limiting factor.
Could you please shead som light on what kind of algorithm he was testing, preferably with information of how he could reach such impressive numbers.
He's submitting his paper to a conference. Once his paper is published I'll post it here.
Shompola
23-Mar-2007, 04:41
He's submitting his paper to a conference. Once his paper is published I'll post it here.
That might take up to a few months no?
Recent (new) Cell programming articles from IBM:
Part 4: http://ps2dev.org/news/Program_the_SPU_for_performance
Part 5: http://ps2dev.org/news/Programming_the_SPU_in_C/C++
one, do you remember if the PPE is involved in any way for a SPU to access main memory ? My current impression is PPE is needed during the setup (memory map) stage. Thereafter, the SPU should be able to access main memory or another local store without any external help. There was also some mention of SPU interacting with I/O devices (via PPE ?) but I can't remember where I read it anymore.
Answering my own question...
Came across a PS3 developer's post in Slashdot.
http://games.slashdot.org/comments.pl?sid=227923&threshold=1&commentsort=0&mode=thread&pid=18466573#18466779
> DMA has latency, and requires a response from the PPU on an interrupt
FWIW, it is possible for DMA to be initiated and controlled completely from the SPU's without any PPU intervention - I'm not sure if this is exposed in PS3 linux but the CELL is certainly capable of completely SPU driven DMA from a hardware perspective. In this case, the only latency is in the actual time to fetch data from main memory to local store. This is the similar to the type of stall you get with a data cache miss on a general purpose CPU while moving data from memory to l2 to l1 cache (which can be hundreds of cycles for a single cache miss). The one advantage with programming the CELL is that properly coded algorithms can double-buffer and effectively hide memory access latency to the point where it's no longer an issue. Granted you can do the same on genreal purpose CPU's with Prefetch NTA instructions.
You are right that the maximum for a single DMA transfer is 16K. However, each SPU can queue up to 16 DMA transfers so effectively, an SPU can initiate loads to its entire addressible memory and then continue without interruption or latency in waiting for one DMA to complete before the next begins.
More here... http://games.slashdot.org/comments.pl?sid=227923&threshold=1&commentsort=0&mode=thread&pid=18465477#18466339
The following code fortifies my argument:
All the papers out there from IBM and Sony suggest PREFERRED METHODS which are different than your implied limitation. They advocate cooperatively running multiple small task (dozens or even hundreds) rather than dedicating cores to fixed tasks and to double-buffer data (or even code) to mask DMA latencies. Singular task SPU usage can lead to idle SPU's very easily. Additionally, the double-buffer method can hide DMA latencies which can double the speed of your code if you're roughly equal on memory (DMA) and SPU compute time. We prefer to program SPU's in the manner in which they will be utilized much more fully and run up to 2X faster. I think anyone rewriting code for the SPU (since you have to retarget for the SPU anyway, there's no reason to write your code in a manner that will deliberately underperform).
Same source as above post.
The fixed allocation of SPE's for specific processing related tasks is inherently inefficient due to the extreme amounts of idle time experienced. I just find it odd that game would even contemplate using this design pattern due to its obvious flaws.
Even though Cell can be a building block for a supercomputer, it doesn't mean we always have to use it that way. In supercomputing, the problems are large and compute intensive, but people may not need the answers right away, so it is best to solve 1 problem using all the available nodes and hope that the answer comes out asap. The system is usually optimized for efficiency (so it can scale to larger problems with reasonable speed up).
I remember Kutaragi mentioned that Cell is also suitable for running interactive, real-time applications. The needs are different although they "prefer" similar CPU traits.
Like gaming, a real-time application has strict timing requirements even under heavy load and much concurrent activities. Because it has more cores, Cell can afford to allocate different ones to separate tasks to meet multiple (stringent) schedules. It is useless to achieve 100x speed up for large problem sizes if the answer is always late, even for small problem size.
The fixed allocation of SPE's for specific processing related tasks is inherently inefficient due to the extreme amounts of idle time experienced. I just find it odd that game would even contemplate using this design pattern due to its obvious flaws.
Because it's easier to implement into an existing single core based game design?
vBulletin® v3.8.6, Copyright ©2000-2013, Jelsoft Enterprises Ltd.