Ok. I wanted to post this long note in another thread but decided to do it here instead.
Now that there are some evidences...
It seems to me that it is not difficult to find some uses for all the SPEs since there are enough problems to go around (All sorts of physics, AI, I/O pre-, post-processing, procedural stuff, ...). But, eventually it will be harder and harder to parallelize the code base unless the entire game engine is designed from the ground up for it (Actually, I am not sure whether the main loop needs to be parallelize at all, probably not ?).
This is one of the reasons I felt we will be able to see some differences even in early games. Not to mention every game will spend their resources differently, so it's natural that the outcome is unique. Heavenly Sword = where people have hair (1 SPU full of hair
). Then towards the end of the console lifecycle, only the best can show even more amazing results.
When I did parallel programming many many years ago, a given problem always has some gotchas (i.e., not embarrasingly parallelizable, need to share global states). However these are usually not show stoppers. What it means is just that you don't get linear speed up. In rare cases, you get slow down. In which case, you just serialize the code (with some new twist
). At that time, the difficult part of parallel programming was mostly its newness/uncertainty: Sometimes it's hard to know its performance in real-life (say beyond 64 processors). Then you have to go back to the drawing board (or enhance existing code further). I think some of the PS3 devs have gone through this stage already.
There are a few basic models starting with the simplest:
+ Serialize the code in 1 SPU, with each SPU handling different or even independent things (statically). I remember one of the NT devs mentioned that a SPU can run faster than a PPU if done correctly. I believe the speed up comes from SIMD (if used), NUMA due to "high-speed" local memory, async request-response and its simplicity. What happens if the data structure is larger than 256K ? It may not matter in some situations. Even if it does, you can usually tweak the code and pre-fetch to hide the latency (Parallel programming often involve these fine-tuning). In the same optimizing exercise, you can throw in branch hints and OOE stuff. I believe Xenon programmers will need to go through similar exercises but they don't have to handle NUMA (and hence they don't get its benefits but may be they can play with the L2 cache). Other common tricks if you have enough free processors and memory include pre-calculations, approximation (the longer it runs, the more accurate it is), reorganizing the code base to collapse multiple/different "passes" into 1, ...
+ Next up the difficulty is data partitioning (only suitable for some problems). In this case, we allocate > 1 SPUs to tackle different segments in a shared memory space. This is highly subjected to the data structue and is vulnerable to skewed run-time behaviour (e.g., many polygons/activities in one area only). Sometimes, boundary results are estimated/recalculated to avoid communication overhead. Since there are many other problems to solve, a developer may choose to use the second SPU on another problem rather than making 2 work together on a fixed problem for twice the speed up.
+ Third would be dynamic allocation of tasks (like job queue) where each SPU gets its next task from a job manager/queue. I think NT is using this approach because at one point, Dean was writing a simple kernel for the SPUs (probably to load different SPUlets depending on the type of work). The advantage is that each SPU can use different strategy for each task. Your latest comment would click with this idea because the SPUs are preloaded with a default task, but there is no bound because the SPU can get additional work order from someone (PPU ?) to do something else.
+ Fourth would be highly specialized set up (e.g., stream processing). In some occassions, if the problem is big enough, we will dedicate esoteric setup to flow the data through the processor grid. This is probably the "signature model" for Cell but I think it's rare in gaming. The terrain demo is one of this examples.
For Xenon, it's not much different from above but you don't have to bother with writing a SPU kernel (for the third approach) and scheduling DMA requests. You also don't have to worry about processor boundary and tool chain (as John Carmack mentioned) so everything sits cleanly in 1 logical space. But you only get 3 "PPUs" (vs 7 SPUs + 1 PPU).
I'm probably over-simplifying but this would be the framework I used to understand and evaluate Cell programming. Any devs feel free to tell me where I go wrong. It's just a discussion.
EDIT: I forgot to highlight Xenon's key advantage. Because Xenon's 3 cores are identical, it makes them seamless (Share and reallocate work freely, cheaply and dynamically). This is a deciding advantage over PS3's SPU-PPU arrangement. Because of the added SPUlet scheduling/allocation overhead, it can be more difficult to parallelize a game engine in PS3 to achieve the same speed up.