A glimpse inside the CELL processor

Interesting stuff, SPM, thanks.

For the record, I never meant that programming and maximizing the potential for either console would be "easy"; I know that it certainly isn't. I think my post has been misconstrued a bit. It just seemed to me like the Cell was a bit more complicated, but some of the information from SPM that's been brought to my attention makes me wonder if my original statement is a little exaggerated.

Darkon--like someone said, those are mostly cinemas you've been looking at. And, I haven't been all that impressed with the pics I've seen of Heavenly Sword lately. Motor Storm isn't turning out to look anything like what Sony showed at E3 last year, either. I think Sony and their partners have been deceiving people a little.
 
SPM said:
Xenon programming for games isn't necessarily going to be easier than Cell - at least if you want to make use of all the cores.

Multi-core processors pose a lot of problems, but one area Xenon has an advantage in "ease of use" (i.e. easier) for some developers is that it is very similar to the PC model of multithreading at this point in time. That alone may make the Xenon approach easier to many developers as the model is one they are not only familiar with but it is the trend of a huge segment of the industry.

This can be done in Xenon as well, but in Cell, this concept is easier due to the fact that the concept is built into the hardware

But as you note that Xenon can do this as well. The difference is that Xenon is not limited to a specific arrangement, where as with Cell you have no real choice: You have 1 PPE and 7 SPEs. And you have the added difficulty of having to fit the main game loop on the PPE and then working your code to work on an asynochronis core that is quite different where you have to fit much of your code and data into a very small space. Is that easier than having 3 identical cores? So far developers have not indicated yes. The Chronicle of Riddick devs, while noting the potential of Cell, commented on the difficulty in this very area (not to mention id and Valve, etc).

and software tools

Do you know for certain that Sony's multi-core tools are built better than MS's in this regards?
 
elementOfpower said:
Darkon--like someone said, those are mostly cinemas you've been looking at. And, I haven't been all that impressed with the pics I've seen of Heavenly Sword lately. Motor Storm isn't turning out to look anything like what Sony showed at E3 last year, either. I think Sony and their partners have been deceiving people a little.

and ? if those real time demo's where shown by EA , bungie or that sce studio in londen i would undestand being scepticle about them , but this 3 developers ... come on mate.:???:
 
SPM said:
In game programming which is time critical and tied into the video frame, you need to keep different threads syncronised with each other and with the video frame and ensure code is deterministic as far as possible, which can get quite complicated. The easiest way to do this is to have one master core acting as a controller handing off work to other slave cores - the way the Cell architecture is organised. This can be done in Xenon as well, but in Cell, this concept is easier due to the fact that the concept is built into the hardware and software tools, and more cores are available (so you are less likely to need to share a core between multiple processes).

Thanks! Your pretty much confirmed everything i wrote :)

(Though im just a little nitpicker with quite a low post count :cool: )
 
The difference is that Xenon is not limited to a specific arrangement, where as with Cell you have no real choice: You have 1 PPE and 7 SPEs. And you have the added difficulty of having to fit the main game loop on the PPE and then working your code to work on an asynochronis core that is quite different where you have to fit much of your code and data into a very small space. Is that easier than having 3 identical cores? So far developers have not indicated yes.

I think DeanoC commented on this matter some time ago. And there are much more programming models for CELL btw. (also stated by DeanoC).

Basically it goes from a PPU centric to a SPU centric engine afair. Should also be in his blog. The different programming models for CELL are also described in multiple IBM docs, like using the SPURS for task management on the SPUs themselves, without involving the PPU at all.
 
elementOfpower said:
I'm no technical guru, but in every article I've read so far "the challenge of programming for cell" has been discussed. It seems to be that for that to be mentioned there must be some glaring challenges. It's a brand new architecture. Multi-core processors, like are in the 360, have been out in the PC world for quite a while; the Cell hasn't.


Your right, traditional symmetric multi-core processor systems have been around for a long long time. The industry, developers and even the academic sectors know a lot about these systems.

And do you know what they know? They know that it is DAMN hard work to get good performance out of them.

People will argue that Xenon is easier to write code for because it is a "better understood" design. But from where I am standing the only thing that is well understood is that it is damn hard to come even close to fully utilizing the power in all 3 cores.

Cell was not designed in a vaccum. It is not just a different design. It is the approach that directly tackles the issues inherient with traditional multi-core/multi-processor systems.

STI engineers who spent most of their carrers working with traditional multi processor archs sat down and tried to come up with a solution to overcome the problems the existing systems have. The result is a the Cell that is designed from the ground up to get good performance out of a lot of cores.

Did they solve every problem? Of course not. But we have to give them credit for trying. How effective they actualy were, history will tell in time.
 
Acert93 said:
Multi-core processors pose a lot of problems, but one area Xenon has an advantage in "ease of use" (i.e. easier) for some developers is that it is very similar to the PC model of multithreading at this point in time. That alone may make the Xenon approach easier to many developers as the model is one they are not only familiar with but it is the trend of a huge segment of the industry.

The PC model of SMP multi-threading on multi-core CPUs is familiar and easy to use. That is fine for desktop PCs and servers which multi-task and timeshare to run a number of independent applications simultaneously.

Unfortunately that is not what you want for games code - You don't want to multi task between a number of independent games which do not interact with each other (which is what SMP does) - you want to write code where the processes on the different cores are tightly syncronised with each other. This is at least as difficult to do on the Xenon as on the Cell - more so if you consider that the Cell's programming model, hardware organisation, and programming tools are all orientated towards this type of close coupled parallel processing.

The 7 to 8 cores on Cell makes things easier than Xenon because it makes it possible to dedicate some SPEs to specific tasks. You can then simply treat that SPE as a device which you send data to and get results from. With Xenon, if you use one core as a master core and the other two as slaves, you will probably have to timeshare/multi-task processes on these two cores. You have to then start to think about task switching latency issues and how quickly the sleeping processes will respond to input - should the input be interrupt driven to ensure prompt response? do you need to rewrite sections of code from other processes to ensure that they relinquish control quickly enough? It is much more difficult to write deterministic time critical code where you timeshare/multi-task on a processor, because you not only need to get the timing of the time critical code right, but also every other possible combination of code that can run on the same processor in order to prevent those hogging too much time. As far as I know, most realtime OSes go to the extent of foregoing multi-tasking capability completely just to ensure sufficiently fast realtime response because of the difficulty of programming multi-tasking OSes to respond quickly enough.
 
SPM said:
Unfortunately that is not what you want for games code - You don't want to multi task between a number of independent games which do not interact with each other (which is what SMP does) - you want to write code where the processes on the different cores are tightly syncronised with each other. This is at least as difficult to do on the Xenon as on the Cell - more so if you consider that the Cell's programming model, hardware organisation, and programming tools are all orientated towards this type of close coupled parallel processing.

The 7 to 8 cores on Cell makes things easier than Xenon because it makes it possible to dedicate some SPEs to specific tasks. You can then simply treat that SPE as a device which you send data to and get results from. With Xenon, if you use one core as a master core and the other two as slaves, you will probably have to timeshare/multi-task processes on these two cores. You have to then start to think about task switching latency issues and how quickly the sleeping processes will respond to input - should the input be interrupt driven to ensure prompt response? do you need to rewrite sections of code from other processes to ensure that they relinquish control quickly enough? It is much more difficult to write deterministic time critical code where you timeshare/multi-task on a processor, because you not only need to get the timing of the time critical code right, but also every other possible combination of code that can run on the same processor in order to prevent those hogging too much time. As far as I know, most realtime OSes go to the extent of foregoing multi-tasking capability completely just to ensure sufficiently fast realtime response because of the difficulty of programming multi-tasking OSes to respond quickly enough.

Thanks again:) Very intressting details, already expected it to be like that ...
 
SPM said:
you want to write code where the processes on the different cores are tightly syncronised with each other.

Actually that's exactly the opposite of what you want to do to make code that scales well with more SPUs (and the same applies to SMP and Xenon).

The more your different threads are tightly synchronized, the more opportunities you have for races, deadlocks, lock contention, and so on, and the harder it will be to make your code perform well.

For optimal SPU usage, you want small independent work items that don't depend on anything else and don't need to be synchronized with anything.
 
Last edited by a moderator:
First up, PS3 is GCC not metrowerk compilers, Xenon is Visual Studio, I'd take a random guess and say Metrowerk might be available on Wii.

PS3 threading IS harder than a SMP architecture like Xenon, and only the must hardcore ****** would claim otherwise.

Both architectures have atomic instructions, so that not an issue.

The extra difficulty comes from juggling N seperete memory pools (particular given most of them memory pools are pretty damn small by modern standards).
But thats also where the advantage comes from, if you can factor you algorithm into something working with a small discrete pools, it flys.

Give you a practical example of the type of issues you have to work with.

I'm working on army logic at the moment and due to the shear number of calculations I decided to re-tackle it as a pure SPU problem. A tiny bit of PPU code sets some parameters up and then kicks if off into our SPU task system.

The first SPU task is single threaded, doing some sorting and spatial organisation. This is so it can split to work into lots of descrete chunks (around 100), these are then sent to SPU to be processed. Each chunk task is designed to run indepdent in its own RAM with its results DMA'ed back to main RAM.

So I have 1000s indepedent read/writes to local RAM, 100+ DMAs to/from main ram, 100s of read/writes into a external shared task system, external shared linear allocators etc. All while trying to minimise atomic operations (I have at the mo, 100 in the spatial task and 0 in each sub task...)

On an SMP I'd probably have 3 read/writes to main RAM with 3 read/write compare and sets for synchronisation. With prehaps a few cache prefetchs to try and maximise cache hits.

Of course the complexity lets me sustain amazing performance, but its would stupid to claim its easy...

The thing about real SPU is that the good and great will eat this stuff up and make amazing things, but lots (most?) games programmers these days just aren't setup to worry about this kind of stuff. A good example is our entity state, the main game has KB worth of state, pointers, object data etc. per enemy. To do the army logic, I have 16 Bytes to try and get a similar level of AI... Certainly makes life interesting :)
 
DeanoC said:
On an SMP I'd probably have 3 read/writes to main RAM with 3 read/write compare and sets for synchronisation. With prehaps a few cache prefetchs to try and maximise cache hits.

Deano, I assume you meant per-soldier in the above statement ? So for 1,000 soldiers, we would multiply the number of (logical) reads by 1000. Correct ?

Essentially, the extra complexity comes from:
1. Fitting data and program into 256K local memory
2. Building the scaffolding for parallel execution manually (e.g., NT adopted spartial partitioning to parallelize the army AI --- Yeah ! Thank you so much for the extra effort :p ).
3. Ensuring the scaffold overhead does not take up too much time and eat up all the performance gain (This usually means that the problem size has to be large enough).

It seems that 1. will continue to be a problem, while 2. and 3. may be addressed increasingly, albeit partially, by better and more variety of libraries/tools over time ?

I always wonder about the relative gain once you code to the SPUs' strength, how big is the gain developers are expecting (from faster execution, faster local memory, and more of them) ? e.g., For your army problem, is it in the order of 2-3 times per SPU assuming linear scaling when you average them out ? (Shucks, these are probably under NDA) :)

On a typical SMP model, the "parallel scaffold" would be handled automatically by the pre-emptive threading kernel. The (minor) catch is you have less to play with to optimize the gain further.
 
Last edited by a moderator:
DeanoC said:
First up, PS3 is GCC not metrowerk compilers, Xenon is Visual Studio, I'd take a random guess and say Metrowerk might be available on Wii.

PS3 threading IS harder than a SMP architecture like Xenon, and only the most hardcore ****** would claim otherwise.

how about 'PS3's is a different threading model, less universal, but for a good reason too?'
see, difficulty depens on what your ultimate goals are. getting something up and running fast is one potential goal, ergo context of difficulties, getting numerous threads to meet tight timings is another goal and context for difficulties. which of those is more imporatant, of course, is a matter of occasion, but if SMP was so damn easy to achieve _every_ possible viable goal in concurrent designs then why the heck all the efforts astray from SMP? you know, it makes no sense. at all.

Of course the complexity lets me sustain amazing performance, but its would stupid to claim its easy...

so if your ultimate goal is to 'sustain amazing performance' (out of the theoretical performance of the transistors at hand) then actually cell's desing makes it easier for you, versus, say, SMP? or did i read you wrong and you're saying that would've been easier if you had 8 SPU-worth SMPed cores over UMA?

see, we had this argument with aaaaa00 some time ago- stemming from the same roots your original comments originate from - the equating of 'versatility' to 'ease'. which is something rather mistaken, as versatily is something rather univocal ('A can do equally well more things than B can do equally-well'), whereas ease totally changes with actual goals (e.g. 'get 6 threads to run close to their theoretical performance.')
 
I think it's a case of to get a minimum working on SPE, you have to invest a lot more time than getting same task running on a convnentional processor. But when achieved, the headroom is much greater that you can do more. ie. The effort required to get 10 AI soldiers on a SPE is probably the same as getting 1000, wherea on XeCPU for example, a far simpler method would enable 100 soldiers and then you'd be hardware limited.

I don't think writing SPE code should be confused with the original point that XeCPU lacks many of the multicore difficulties of Cell though. The same difficulties inherent in multicore apply to both. Cell just has an added headache of writing for a weird CPU ;)
 
Shifty Geezer said:
Cell just has an added headache of writing for a weird CPU ;)

PLus additional hardware support which makes it a little easier and faster again, especially in regards of interprocess/thread communiction.
 
patsu said:
Deano, I assume you meant per-soldier in the above statement ? So for 1,000 soldiers, we would multiply the number of (logical) reads by 1000. Correct ?
Sorry bad wording by me, I meant that over the normal main RAM read/writes, the 3 I was talking about were just to split my pool of soldiers into say 3 parellel chunks. In reality there would also be the thousands of logic read/writes that would go to main ram (tho hopefully many would be caught by cache). The nice thing about SPU RAM is that you know its fast, full-stop... Whereas main-ram cache archtecture are much more fuzzy about when they are fast or not.
 
Thanks Deano for the clarification.

darkblu, it's not a comparison between different parallel computation models. It's a case of:
+ Handling the extra complexity like what Deano and Shifty mentioned (e.g., 256K local memory under any computation models)
+ Cost-Benefit-Analysis. If I can achieve certain amount of gain (to hit the target frame rate or effects) by putting in 3 days of work... it may be good enough.

As for specialized communication primitives, Cell needs them more than Xenon because it's based on a distributed computing model while the latter is based off a shared memory model. You should be able to simulate similar sharing, queueing or concurrency control effects using other primitives (e.g., semaphores, memory locks, shared buffer, ...).
 
Alstrong said:
One "could" take it to be the case that Sony is cherry-picking what they're showing to the public. ;)
No shite! No need to wink! Everyone knows about the Ninjas...

Rumor is that if a dev even THINKS about leaking something, a Ninja shows up with a copy of the NDA and a stern look in his eyes.
 
aaaaa00 said:
The more your different threads are tightly synchronized, the more opportunities you have for races, deadlocks, lock contention, and so on, and the harder it will be to make your code perform well.

For optimal SPU usage, you want small independent work items that don't depend on anything else and don't need to be synchronized with anything.

Actually this is what I meant - the processing tightly syncronised to the video frame (by interrupts), but no or minimal locks or semaphores. As opposed to running a mass of asyncronous preemptively multi-tasked threads allocated to processor cores by the SMP scheduler, which then require programatic syncronisation between threads and cause the sort of races, deadlocks, lock contention you mention.

Also I think it is better to loop through a process 1000 times on a one or more specifically assigned SPE cores than to spawn 1000 preemptively multi-tasked threads and have it distributed among 3 SMP cores both for efficiency (no context switches) and for predictability in timing.

Also rather than preemptively multi-task to get say 3 different processes to run on one core, I would run each of the processes on the same core sequentially. This avoids context switching, and makes the timing more predictable, making it less likely that you would have to make one process wait for another.

Preemptive multi-tasking with SMP scheduling PC style is easy but it is no good for real time programming. It shines only for multiple asyncronous processes like desktop or server applications.

The same performance benefits (and programming effort) of not using the preemptively multi-tasked SMP approach will apply to Xenon also. There is no reason why this approach rather than SMP should not be used on the Xenon to improve performance.
 
patsu said:
it's not a comparison between different parallel computation models. It's a case of:
+ Handling the extra complexity like what Deano and Shifty mentioned (e.g., 256K local memory under any computation models)

what if that extra complexity there is buying you something elsewhere? maybe spares you an even higher complexity needed to attain similar performance from a similar number of threads in an SMP environment?

+ Cost-Benefit-Analysis. If I can achieve certain amount of gain (to hit the target frame rate or effects) by putting in 3 days of work... it may be good enough.

definitely. if you can get your job at first try and call it a day. unfortunately there's this general principle that more complex results are achieved through more complex means. so if somebody wants to have technically-bleeding-edge concurrent stuff in their game they can a) either bite the cell bullet, or b) scale down their goals to fit on simpler-but-less-resourceful SMP architectures.

As for specialized communication primitives, Cell needs them more than Xenon because it's based on a distributed computing model while the latter is based off a shared memory model. You should be able to simulate similar sharing, queueing or concurrency control effects using other primitives (e.g., semaphores, memory locks, shared buffer, ...).

if that was directed at me - yep, surely. SMP is definitely the most verstaile concurrent achitecture there is. it just has issues at the high RPM end : )
 
Last edited by a moderator:
Back
Top