A glimpse inside the CELL processor

deathkiller said:
They hide latency/prevent stalling using more of the shared cache so I think that they shouldn't be ignored when counting the cache.

Shared cache also have lower size efficiency because of aliasing.

In this case, the cores/hardware threads in question would be sharing/hitting on the locked (part of the) cache. I assume the remaining cache continues to serve its purpose although the lesser cache will now have a smaller hitrate. Isn't this correct ?

I do agree that since Cell is built for LS kind of environment, it should be more performant... like the "fast LS vs cache management overhead" issue below:
Shifty Geezer said:
L2 cache is slower than the LS, I think in the order of 50% slower (6 cycles versus LS 4 cycles).

and

Jesus2006 said:
But still how does the memory managment work? What does a locked cache do? When you write data to it, will it write back to main memory immediately, or do you have control over this, when to write back and when to read from it? Because every write back will make performance drop siginficantly, since a SPE works on it's own LS until the task is finished and then DMAs back results user controlled. I thought that locked cache only guarantees an amount of cache being reserved for a specific thread, but nothing else (like addressability etc.).

Instead of zooming into specific hardware advantages, can we layer the analysis based on some sort of context (model or application framework) to get a coherent picture ?

Here's me thinking out loud [Remember: I am not a game programmer]:

Assuming the application/game requirements are known beforehand ( ;-) ) ...

(1) (Near) real-time, event-driven run-time forms the foundation. This is where basic things like the game loop, user controls, storage/streaming, memory management, network code, animation, rendering, audio, AI and physics are master-planned, laid out and tie down.

(2) Key resources (e.g., cores, memory) are budgeted for each focus area (e.g., visuals, animation, network) based on experiences and requirements. Although in the interest of time, some may be folded within the PPE until devs have time to spin them off to another SPE. This also determines the planned worst case scenarios for individual areas (e.g., How small a time slice are we talking about, how small/big the data structures can be)

(3) Within each focus area, specific techniques are prototyped/developed.

(4) Things are put together and run end-to-end.

(5) More variety of stuff (e.g., weapons) added

(6) More optimization follows to hit specific performance target.


Now, one way to do this on Cell seems to be:

(A) Cooperative game loop and user-controls go to PPE to form the basic skeleton. Basic math library goes to VMX on PPE. Esoteric math stuff goes into domain-specific SPEs.

(B) Supporting storage and network code go into 1 SPE (Although these are rather slow and we may not need to use compression that much -- thanks to Blu-ray and HDD -- I assign the work to a separate SPE to free up the PPE in case other tasks below are thrown back to PPE to run because of 256K LS limits).

(C) All sorts of AI and world simulation assistance = 1 SPE

(D) Rendering assistance (?) = 1 SPE

(E) Audio assistance = 1 SPE

We still have 2 SPEs to spare for touch up. We can also do it such that the 4 SPEs above share some of their workload using a job queue model. Each of these cores would run at almost full-speed.

Now if we take a step back and look at the overhead. e.g., Let's say putting 1 SPE to work takes up additional X% of memory used, Y% of SPE performance, Z% of PPE performance, W more man-month of development (because we need an area to re-org the memory for LS; or the data structures have to be SPE-friendly to begin with).

For such a 4 SPE arrangement, we would take up 4 * X% of PPE cycles, 4 * Z% of main memory and 4 * W man-month. The worst thing is these overheads may hit the PPE simultaneously if not done carefully (e.g., All 4 SPEs needed the PPE to pack the memory at the same time to meet near real-time requirements). For people who cannot make it, we may need to scale back the problem size (e.g., use simpler AI), or reassign the work back to the PPE to avoid the 256 LS overhead.

Now _if_ Cell is a fictitious SMP system, these advanced features would be possible since the same data structure can be used without restrictions. But we run 50% slower borrowing the Xenon's shared cache performance numbers with 100% cache hit.

What kind of conclusion can we draw from this simplified example ? :D
 
patsu said:
For such a 4 SPE arrangement, we would take up 4 * X% of PPE cycles, 4 * Z% of main memory and 4 * W man-month. The worst thing is these overheads may hit the PPE simultaneously if not done carefully (e.g., All 4 SPEs needed the PPE to pack the memory at the same time to meet near real-time requirements).

Why should the PPE be involved in the SPEs business at all? :) The SPE program can do everything it wants, without the PPE.

Now _if_ Cell is a fictitious SMP system, these advanced features would be possible since the same data structure can be used without restrictions. But we run 50% slower borrowing the Xenon's shared cache performance numbers with 100% cache hit.

What kind of conclusion can we draw from this simplified example ? :D

I have absolutely no idea :)
 
I was sure it should be much larger, but that's the only figure I could dig up. Must have found the wrong one :oops:. Does anyone have figures for L1 cache efficiency? If the L1 is fetching data effectively, and that data is in L2 as micromanaged, what's the likely delay for feeding the registers?

Still, you can see how going all out for fast local memory can really help, and that was the backbone to the Cell design. Emulating the architecture on other chips probably won't have anything like the benefit.
 
Shifty Geezer said:
Still, you can see how going all out for fast local memory can really help, and that was the backbone to the Cell design. Emulating the architecture on other chips probably won't have anything like the benefit.

Yes, and if it'd work out, i guess MS would suggest this as the primary programming model their CPU. Currently though, it's more the traditional SMP there, with each core doing his own stuff, without much synchronization (as compared to Cell).
 
Jesus2006 said:
Yes, and if it'd work out, i guess MS would suggest this as the primary programming model their CPU.
Not necessarily. If as a sofware developer you can obtain the results you want with a simpler (cheaper) method, would not go that way instead? The manually managed memory model of Cell is great for efficiency, at a cost of developer complexity. If XeCPU is capable of behaving in a similar way and getting performance advantages for it, developers might well choose not to develop using it that way but instead stick to traditional SMP approaches that make their lives a lot easier, even if it means missing some of the CPU's potential. Kinda like the load of PS2 games that didn't use the VU and missed some of the EE's potential. They decided the effort required was worth the returns.

In Cell's case, devs have to work that way so don't have a choice. Yet. There may be libraries appear for software caching behaviour, and if that happens you might see devs reducing performance for their apps in order to make their lives easier. Something to keep an eye out for, if devs given choices choose power over simplicity.
 
Shifty Geezer said:
In Cell's case, devs have to work that way so don't have a choice. Yet. There may be libraries appear for software caching behaviour, and if that happens you might see devs reducing performance for their apps in order to make their lives easier. Something to keep an eye out for, if devs given choices choose power over simplicity.


This is from Alex Chow interview:

http://www-128.ibm.com/developerworks/library/pa-expert8/

Chow: There is no de facto programming model. We sort of experimented on different flavors of programming models. Each type of workload may use certain combination of models. We try not to force the programmer into one specific one. The architecture supports all different kinds of programming models. A programmer can decide one over the other considering development efficiencies and performance
 
Jesus2006 said:
Why should the PPE be involved in the SPEs business at all? :) The SPE program can do everything it wants, without the PPE.

In my example, I was assuming that the SPEs have trouble packing their stuff into 256K and hence need PPE to "sort things out" for them (from time to time). In general, if all things go smoothly, SPEs should not need PPE to help out in normal execution.

I have absolutely no idea :)

Personally, my NUMA Cell example only "enjoy" 50% improvements (due to assumption that LS is 50% faster than L2 cache) compared to SMP Cell. It doesn't really give me a conclusive/decisive answer whether the "Cell way" is worthwhile. Given a particular problem, a dev may be able to frame it differently, or sacrifice less noticeable things to achieve comparable result with much less development cost.

With the new data, I'm kinda speechless. :) With about 10 times advantage in memory access (4 vs ~40), I might just optimize for Cell. My reason being: Once I master it, there's a lot more mileage I can tap on to "simplify" my work later.

This is all of course just my judgement call based on simplified data.
 
patsu said:
In my example, I was assuming that the SPEs have trouble packing their stuff into 256K and hence need PPE to "sort things out" for them (from time to time).
That is not neccessary. Each SPU has a DMA unit attached to it which can transfer data to and from the local store completely independently of the PPE.

Personally, my NUMA Cell example only "enjoy" 50% improvements (due to assumption that LS is 50% faster than L2 cache)
50% faster latency-wise perhaps, but the same isn't neccessarily true of bandwidth. L2 in Xenon only runs at half CPU clock, and has a 256-bit bus attached to it AFAIK. That's not really enough to feed three concurrent 3.2GHz CPU cores. Fortunately, there's a L1 cache also... I'm not sure of the speed/bandwidth of the PPE L2, if it follows the same pattern of Xenon, or if it runs full-speed.

compared to SMP Cell. It doesn't really give me a conclusive/decisive answer whether the "Cell way" is worthwhile.
For tasks that Cell is ideal for it's certainly worthwile. The MRI scan results that ran like 50 times faster than a standard CPU, or video decoding, or all those other tasks we've seen benched that run anything from 5 to dozens of times faster.

I guess we'll have to see some real actual games running on PS3 before determining wether Cell is worthwile for a more general type of computing loads, but so far things are sure looking good to me. :) Warhawk really impressed the hell out of me I have to say.
 
Guden Oden said:
That is not neccessary. Each SPU has a DMA unit attached to it which can transfer data to and from the local store completely independently of the PPE.

Certainly. The assumptions are (i) the data structure in the main memory is optimized for SPE use already (to ensure consistent and high performance); and (ii) there is enough main memory to hold the reorganized data structure for all active SPEs (plus other "normal" data used by the PPE and RSX).

Guden Oden said:
50% faster latency-wise perhaps, but the same isn't neccessarily true of bandwidth. L2 in Xenon only runs at half CPU clock, and has a 256-bit bus attached to it AFAIK. That's not really enough to feed three concurrent 3.2GHz CPU cores. Fortunately, there's a L1 cache also... I'm not sure of the speed/bandwidth of the PPE L2, if it follows the same pattern of Xenon, or if it runs full-speed.

Yes, I am reminded in this thread again by Cell's raw performances. :|

In one of my earlier posts, I asked Deano what kind of ballpark speed up one can expect by transfering a problem from PPE to SPE(s) since I knew the LS is faster (but I didn't know how fast compared to Xenon or Cell's PPE). More importantly, I also wasn't sure whether and how much Cell's inherent weaknesses affect the SPE performance under "live" run and real dataset.

The only confirmed answer from Deano was that a dev can guarantee a memory write is consistently fast due to more predictable run-time (The usual argument about no-cache, cooperative multitasking). He did not highlight the LS latency/speed advantage at all (which surprised me a little).

NDA probably precludes him from revealing/discussing more detailed performance specs. So after 1 big round, I'm back to square 1. I hate Sony NDAs.

Guden Oden said:
For tasks that Cell is ideal for it's certainly worthwile. The MRI scan results that ran like 50 times faster than a standard CPU, or video decoding, or all those other tasks we've seen benched that run anything from 5 to dozens of times faster.

Right but they probably also utilize the SPE SIMD engines there. There are 7-8 of them, so it's a no brainer for that kind of problem (almost like bullying little kids :D ). How fast is a SPE DMA compared to a block read from main memory (in a typical CPU) ?

Guden Oden said:
I guess we'll have to see some real actual games running on PS3 before determining wether Cell is worthwile for a more general type of computing loads, but so far things are sure looking good to me. :) Warhawk really impressed the hell out of me I have to say.

Yes they look good. I am more impressed by Motorstorm if it plays like the trailers.
I was a wee-bit disappointed by Warhawk because it didn't allow me to crash my plane into the water in E3 2006

Actually my crazy side says they should do an underwater mission rather than ground missions.
 
Shifty Geezer said:
Not necessarily. If as a sofware developer you can obtain the results you want with a simpler (cheaper) method, would not go that way instead? The manually managed memory model of Cell is great for efficiency, at a cost of developer complexity. If XeCPU is capable of behaving in a similar way and getting performance advantages for it, developers might well choose not to develop using it that way but instead stick to traditional SMP approaches that make their lives a lot easier, even if it means missing some of the CPU's potential. Kinda like the load of PS2 games that didn't use the VU and missed some of the EE's potential. They decided the effort required was worth the returns.

In Cell's case, devs have to work that way so don't have a choice. Yet. There may be libraries appear for software caching behaviour, and if that happens you might see devs reducing performance for their apps in order to make their lives easier. Something to keep an eye out for, if devs given choices choose power over simplicity.

The only problem is that SMP or NUMA pre-emptive multi-tasking and scheduling, (which is what you are talking about when you talk about easier development a la PC) is unsuited to coordinated parallel processing or any time critical processing. It is easy only when the parallel processes can be allowed to run in an uncoordinated manner and do not have to deliver results by specific points in time. If you have to coordinate the processes (using locks, polling and semaphores) and ensure that certain results are produced by a certain time, then SMP systems become inefficient or slow to respond, and the extensive testing and solving all the possible timing problems, makes it a real headache to program - much more difficult than Cell I would say.

Proof? Well SMP and NUMA have been around for a long time and run on all conventional desktop and server systems.

Linux also has a cluster SMP type of OS called OpenMosix which turns an entire networked cluster of desktop or server Linux machines into a massive SMP machine by migrating execution threads to other machines on the network the same way as an SMP machine migrates the processes to another CPU. This is easy to set up and works well. http://openmosix.sourceforge.net/#What

Linux/Unix is used in almost all but a few parallel processing supercomputers, and the trend in order to save on costs due to floor space, electricity and airconditioning costs, is to use clusters SMP or NUMA boxes. The current fastest supercomputer in the world IBM's Gene uses clusters of 64 CPU Linux NUMA boxes.

The point is that SMP and NUMA machines are used in most production supercomputing clusters. SMP over the network (OpenMosix) is available for these HPC applications. However no supercomputing applications use SMP pre-emptive multi-tasking scheduling as a means of distributing processes in HPC applications is zero. SMP scheduling is completely unsuited to running massively parallel processes that need to be tightly coordinated.

SMP pre-emptive multi-tasking scheduling is also completely unsuited to time-critical programming, since the pre-emptive task switching and the need for processed to stop and making processes wait on locks for other processes as a means of syncronization makes timing extremely unpredictable.

for the above reasons, I would suggest SMP pre-emptive multi-tasking and scheduling is completely unsuitable for games. It doesn't mean Xenon is unsuitable for games, just that the way multiple cores are used on the PC is unsuitable. Instead, processes need to be explicitly managed and assigned to cores. Multi-core programming on the Xenon isn't really going to be easier on the Xenon than on Cell.
 
SPM said:
Linux also has a cluster SMP type of OS called OpenMosix which turns an entire networked cluster of desktop or server Linux machines into a massive SMP machine by migrating execution threads to other machines on the network the same way as an SMP machine migrates the processes to another CPU. This is easy to set up and works well. http://openmosix.sourceforge.net/#What

if you mean that this delegation of processes is done transparently - that still does not make such an architecture 'SMP'. i mean, it could be 'pseudo-symetric' alright, but it's much more node-distributed by means of totally separate memory pools - just like with SPEs. in other words, it's not the transparent workload distribution that characterizes desktop-style SMP here (that btw is called 'pervasivness' in beos - the notion that your actions may spawn arbitrary new threads and each and every single thread can land on an arbitrary core).

btw, we're sort of on the wrong foot in this thread here - by textbook definition cell deviates from SMP because of its cores' non-coherency, whereas we're actually discussing its merits vs SMP based on the local storage/interconnects advantages. i, personally, think that the ppe element in cell is just a convenience measure and as such is nothing exceptional; the major differentiating factor for cell is the distribute-ness of its SPEs and their local storage organisation.
 
Last edited by a moderator:
Hmm... SPM, there are a few overlapping concepts in your last post and they don't necessarily stand behind what you're saying:

* NUMA machines vs SMP machines vs distributed cluster

* Pre-emptive vs cooperative multitasking/multiprocessing

* Supercomputing applications (e.g., weather simulation) vs real-time applications (e.g., "RADAR tracking")

It is best to separate them in your discussion.

Also Shifty's point does not conflict with any of the above.
e.g., I can have a machine that does all of the above (any combination you want !) but still choose to solve my problem using just 1 CPU sequentially... as long as it meets the users' and my needs.
 
patsu said:
Certainly. The assumptions are (i) the data structure in the main memory is optimized for SPE use already (to ensure consistent and high performance); and (ii) there is enough main memory to hold the reorganized data structure for all active SPEs (plus other "normal" data used by the PPE and RSX).

I don't think it's a good idea to restructurize your data at runtime to make them fit to the SPEs. That's something you should think of before you start programming for CELL :)
 
darkblu said:
if you mean that this delegation of processes is done transparently - that still does not make such an architecture 'SMP'. i mean, it could be 'pseudo-symetric' alright, but it's much more node-distributed by means of totally separate memory pools - just like with SPEs. in other words, it's not the transparent workload distribution that characterizes desktop-style SMP here (that btw is called 'pervasivness' in beos - the notion that your actions may spawn arbitrary new threads and each and every single thread can land on an arbitrary core).

OpenMosix assigns processes rather than threads to a particular machine. Presumably any process using shared memory would be migrated to the same machine. Threads created by a process would share memory and would be located on the same processor. Linux processes also communicate using semaphores, message queues, and pipes, and treat hardware device drivers like files, so these can simply be streamed across the network to provide seamless IPC across the whole cluster. One thing about Linux is that a Linux process is almost as lightweight as a Linux thread and can share memory. Hence most Linux applications spawn child processes rather than threads to allow multi-tasking. On Linux unlike Windows, processes not threads are the basic units of scheduling in any case so as far as multi-threading is concerned, Linux processes are equivalent to Windows threads.

Note processes and threads in Linux and Windows differ:
http://www.lynuxworks.com/products/whitepapers/partition.php
http://www.semack.net/wiki/default.asp?db=SemackNetWiki&o=LinuxVsWindowsKernel

Although it is not quite as fine grained as SMP on a single multi-core machine because processes sharing memory must reside on the same machine, it is pretty fine grained nevertheless, and transparent to the programmer - a process can run on memory on any machine in the cluster and deal with all communication with hardware and IPC with other processes as if it was on the same machine - the patched Linux Kernel deals with everything, and OpenMosix should run any Linux application transparently.

btw, we're sort of on the wrong foot in this thread here - by textbook definition cell deviates from SMP because of its cores' non-coherency, whereas we're actually discussing its merits vs SMP based on the local storage/interconnects advantages. i, personally, think that the ppe element in cell is just a convenience measure and as such is nothing exceptional; the major differentiating factor for cell is the distribute-ness of its SPEs and their local storage organisation.

Yup. The SPEs aren't and can't be called SMP or NUMA, because there is no shared memory - the SPE is a completely isolated processor with local memory. A number of Cell chips as a whole connected together using Flex i/o interface (as in the Mercury and IBM blades) can be considered a multi-processor NUMA architecture though.

Also there isn't much point doing SMP style preemptive multi-tasking with SPEs because of the very heavy penalty of doing a context switch on an SPE due to it's large local store.
 
for the above reasons, I would suggest SMP pre-emptive multi-tasking and scheduling is completely unsuitable for games. It doesn't mean Xenon is unsuitable for games, just that the way multiple cores are used on the PC is unsuitable. Instead, processes need to be explicitly managed and assigned to cores. Multi-core programming on the Xenon isn't really going to be easier on the Xenon than on Cell.

I think it's unsuitable for everything that requires a strict timing and reaction, unless you have such a fast CPU that you just don't need to care.:smile:

Anyways, this sounds like the way it's done right now. Assign tasks to the individual Xenon Cores, let them run and do their individual jobs (like Rendering, Gameloop, Sound, Compression, as it's described in the GDC slides by MS, thus something that does not much interfere with each other, to avoid heavy synchronization...)
 
patsu said:
Hmm... SPM, there are a few overlapping concepts in your last post and they don't necessarily stand behind what you're saying: * NUMA machines vs SMP machines vs distributed cluster * Pre-emptive vs cooperative multitasking/multiprocessing * Supercomputing applications (e.g., weather simulation) vs real-time applications (e.g., "RADAR tracking") It is best to separate them in your discussion. Also Shifty's point does not conflict with any of the above. e.g., I can have a machine that does all of the above (any combination you want !) but still choose to solve my problem using just 1 CPU sequentially... as long as it meets the users' and my needs.

What I am trying to say is that for writing games code on the Xenon or Cell, forget about the easy automatic scheduling of threads like on the conventional desktop PC to make use of multiple cores. That only works for processes that can run independently of each other, and are not time critical (ie. do not have to complete or be syncronized with video frames). An example would be running a wordprocessor, spreadsheet and a browser at the same time. These run independently, don't need to complete in sync with a video frame, and it is no big deal if they slow down.
 
SPM said:
The only problem is that SMP or NUMA pre-emptive multi-tasking and scheduling, (which is what you are talking about when you talk about easier development a la PC).
Actually I wasn't ;) I was talking SMP as in Symmetrical MultiProcessor, a vernacular shorthand for 'XeCPU and similar standard multicore processor'. Cell has a complexity disadvantage in Cell code and managing memory, but the interacting with the system in the full applications has the same problems for all multicore processors.

I'm sure I wrote this already. Maybe I didn't but thought about. This thread is getting too messy for me! Lot's of points and opinions, all seemingly dealing with different arguments, and I can't work out exactly what's trying to be proven here. I thought it was 'writing for Cell is easier/harder than writing for XeCPU' though that doesn't tie in with the OP. My take on that is Cell is harder as you have to learn different coding models and ISA over XeCPU, but in the realm of parallel processing, all mutlicore architecture have the same limits. Thus Cell is hard, but XeCPU isn't much easier. But I don't know if that was even the topic, or if so when it became that topic! :???:
 
SMP and NUMA have NOTHING to do with the way threads and tasks are scheduled.

SMP simply tells you that all the processors in the machine are identical and that they can all see all of the system's memory, and NUMA tells you that the processors may not have access to all the memory in the system or may not have access to all the memory in the system at the same speed. An SMP machine can be NUMA (like Opteron systems) or it can be UMA (like Xeon systems).

SPM, what you're calling "SMP multitasking" is simply standard multiprocessor preemptive scheduling. The fact that the OS by default implements this is an artifact of the OS, not the machine's architecture.

Especially on a console like the xbox or the 360, you are free to use whatever scheduling mechanism you want, and the OS provides primitives that help you build these scheduling mechanisms, be they fiber based or thread based or whatever else you want.

Even the Windows kernel, which is a general purpose OS, has support for cooperatively multitasked fibers, and if you want to get esoteric, you can always write a driver, take control of the system and run everything at high IRQL if you need total control. In fact several hard real-time subsystems for Windows exist and provide an environment for developers to write tasks that require real-time responses.

The point is, yes preemptive multitasking may be inappropriate for a game (depending on what you are trying to do), but SMP does not necessarily force this on you, and SMP does NOT imply any particular form of scheduling -- it only describes the layout of the CPUs in the machine.
 
Last edited by a moderator:
A key difference with Cell and traditional parallel procs (SMP if you will) is that the threading mechanisms are supported in HW on Cell were as on other procs the mechanisms exist in software and thus provide a greater overhead.

When it comes to constructing custom real-time schedulers for games the question is whether Cell's HW support for syncing threads etc. is easier to come to grips with and more straightforward than using semaphores, locks and other OS mechanisms to the same end.

One can still get to same promise land either way...but is the road traveled harder to get there?

---------

Shifty did you get the some of your math wrong before?

6 threads on Cell each get 256K by virtue of their LS. 7 threads if the OS does anything game related and ignoring the PPE for now.

Anyway...

On Xenon 6 threads share one 1MB L2 cache. Each thread would have 166K not 333K available to them if the cache was split evenly. That is to say if the six threads are all independent tasks and do not share data by design or happen stance. If the cache is locked six ways the chance the data is useful to any two threads by "happen stance" is eliminated.

I myself feel only 3 threads will dominate execution time because each core shares pipes not duplicates them so on the average 3 of the six contexts will be blocked while the others are using the pipelines in each core. Staging is eliminated because contexts != instructions and so instructions belonging to different contexts can't be staged AFAIK. With this is mind I think it may be better to say 3 threads would have 333K available to them on the average.

---------------

I may be wrong but another tangent of this discussion is whether the LSs are as good as caches when it comes to scheduling. I don't see how they could be just yet. Whether using preemptive or real-time schedulers when a switch occurs with cache there is the opportunity at least some data the next process needs is still sitting around. With an LS the data gets wiped clean or if you attempt to re-use data with a custom scheme it is difficult to guarantee "exactly" what data is to remain in an LS.

I don't see how they could be better or easier to deal with when switching tasks. When it comes to how tasks execute that's another matter. Highly deterministic memory access even across several cores should be very valuable if you can keep the DMAs to a minimum and hide their latency. With all the models for proposed for scheduling on Cell I wonder that avoiding the memory wall and expensive context switching doesn't kind of push one to use co-op multi-tasking with shared memory as much as possible. To me it seems the about the only way utilize the resources available as best as possible while avoiding the innate penalties the BE architecture would introduce otherwise.

It would be great if DeanO could comment here as he's spoken to Heavenly Sword using a co-op scheduler recently. It would be great to know whether Ninja Theory has tried other approaches and they worked or they found co-op with shared memory is the method Cell "prefers." I remember comments from Ninja Theory at E3 that said something along the lines of them using all the SPUs at some point in the game loop for shadows alone. I wonder if the approach is then to find another big task-->split it up-->push to SPUs-->get results-->find another big task, so that switching is avoided and all the SPUs are leveraged. (if a task can't utilize all SPUs then the SPUs are partitioned between tasks and hopefully you can find tasks that complete near the same time with this partitioning in place.)

I should note: If SPUs work towards 1 common "task" then I consider their memory shared whether the SPUs DMA data to one another along a chain or the task permits each SPU can explicitly work on a a "chunk" of it and the results are sewn together at the end. The LS space of all SPUs involved is consumed in getting that 1 goal accomplished regardless.

Ok I'm done :)
 
Back
Top