A glimpse inside the CELL processor

SPM · Jul 21, 2006

DeanoC said:
On an SMP I'd probably have 3 read/writes to main RAM with 3 read/write compare and sets for synchronisation. With prehaps a few cache prefetchs to try and maximise cache hits.

Are you actually using pre-emptive multi-tasking with SMP scheduling for game code on the Xenon or PC?

aaaaa00 · Jul 21, 2006

SPM said:
Also I think it is better to loop through a process 1000 times on a one or more specifically assigned SPE cores than to spawn 1000 preemptively multi-tasked threads and have it distributed among 3 SMP cores both for efficiency (no context switches) and for predictability in timing.

You would never do this on SMP either. Who in their right mind would spawn 1000 preemptively multi-tasked threads for 1000 workitems?!?! (I mean, you... could... but WHY?!)

Instead you would spawn 3 or 6 threads, lock each thread to a logical CPU, and just queue jobs to them.

The same performance benefits (and programming effort) of not using the preemptively multi-tasked SMP approach will apply to Xenon also. There is no reason why this approach rather than SMP should not be used on the Xenon to improve performance.

SMP and pre-emptive multitasking are two completely seperate issues.

SMP describes the fact that each processor is the same. SMP machines are typically shared memory, which means all the processors can see all the system memory.

Whether you choose to use OS threads or instead cooperatively schedule tasks with fibers, or adopt a work-item approach, or instead implement any other pardigm you want on top of this foundation is entirely up to you on an SMP.

SPM · Jul 22, 2006

aaaaa00 said:
Instead you would spawn 3 or 6 threads, lock each thread to a logical CPU, and just queue jobs to them.

Sounds good to me. The 1000 was from DeanoCs post. What I was trying to say is that on the PC SMP scheduling model, you spawn a lot of process threads - more than there are CPU cores, and you allow the scheduler to automatically distribute those threads between the CPU cores to balance the processing load between the cores. In order balance the load evenly, you need a reasonably large number of threads (probably not 1000 though).

SMP and pre-emptive multitasking are two completely seperate issues.

SMP describes the fact that each processor is the same. SMP machines are typically shared memory, which means all the processors can see all the system memory.

There are two different meanings of SMP: symmetrical cores as you mention, but also SMP scheduling ie. scheduling of process threads symmetrically between cores on the basis preemptive timeslices (what you call OS threads). Another method of multi-core scheduling is NUMA (non uniform memory access) which schedules processes between multiple processors that are identical, but with a higher priority to those with faster access to the memory the process wants to access. By SMP preemptive multi-tasking, I meant the automatic scheduling of processes by allocating them to cores in a symmetrical way.

Whether you choose to use OS threads or instead cooperatively schedule tasks with fibers, or adopt a work-item approach, or instead implement any other pardigm you want on top of this foundation is entirely up to you on an SMP.

Again I agree with you, that is why I suggested that you won't get the best out of Xenon by using SMP scheduling to distribute workload between the cores.

However when people say the multi-core PC programming is easy, and paradigms for multi-core processing exist for the PC, they are talking about spawning a lot of threads and allowing the OS SMP scheduling to distribute the threads between the cores. This is good for running independent applications that don't need to cooperate, but unfortunately this approach isn't good for games, realtime programming, or high performance computing where the results need to aggregated or complete by a short period of time.

patsu · Jul 22, 2006

darkblu said:
what if that extra complexity there is buying you something elsewhere? maybe spares you an even higher complexity needed to attain similar performance from a similar number of threads in an SMP environment?

This is my interpretation. I don't think we are disagreeing.

The principles behind NUMA, SIMD, larger number of cores, and deterministic run-time work towards Cell's advantage. These salient points are not found in a simple, small-scale SMP setup. Naturally, a developer can do more work to map his/her problems to the Cell architecture to achieve aggressive result. I think we are agreeing here.

However this does not mean that all our Cell programming effort will contribute to speed up. The current Cell implementation is also constrained by power consumption, heat and size requirements. For example, it would be nice to have >256K local store in practice. Without which, developers may need to jump through extra hoops to avoid unnecessary penalty, or even to make parallelization possible.

Then there is also the base parallel execution scaffold (kernel/layer) to do everytime a dev tries to organize his/her SPUs for a custom job. With time, this problem should go away. But right now, I do not know if Sony has sufficient libraries and tools to help developers out (debugging !).

Did I understand you correctly ?

darkblu said:
definitely. if you can get your job at first try and call it a day. unfortunately there's this general principle that more complex results are achieved through more complex means. so if somebody wants to have technically-bleeding-edge concurrent stuff in their game they can a) either bite the cell bullet, or b) scale down their goals to fit on simpler-but-less-resourceful SMP architectures.

See... the thing is I'm not sure developers want to "have technically-bleeding-edge concurrent stuff in their game" just for the heck of it. They have many big and small headaches to solve. The more effective the tool/solution, the better. SMP has its place and Cell has its place. There are room for improvement for both.

darkblu said:
if that was directed at me - yep, surely. SMP is definitely the most verstaile concurrent achitecture there is. it just has issues at the high RPM end : )

Perhaps SMP is not so "useless" if many of the needs are simple ? I do not have the answer. I was answering to SPM's comment about Cell's fast communication primitives.

EDIT:
I found this Cell programming link. Not sure if it has been posted here before:
http://gamasutra.com/features/20060721/chow_01.shtml

DeanoC · Jul 22, 2006

You wouldn't want to spawn 1000s of tasks on Cell either... At the moment we are in the 100-200 region (per frame across all systems) but they will that will change (probably downwards) once I've had a chance to optimise the optimal spatial partitional numbers... Obviously thats a late stage performance tuning.

In our case we send batches of spatially local coherant grunts (the name I use for the basic solder in an army), so that memory access to things like the collision data is extremely coherant and suited to a single(ish) DMA upload to SPU. The size of each batch, is a balence between the optimal number of task stop/start/uploads etc. and the size of the collision data that can fit in the SPU LS.

I haven't yet optimised the collision data for space to send to the SPU yet, so I expect I'll be sending much bigger batches once thats done...

One important note, is that we want more tasks than SPUs because we can interleave things like animation and push-buffer generation in many cases to reduce stall time.
Pre-emptive multi-tasking is really not that useful if you have enough tasks (and full control, obviously OSs want pre-emptive cos of malicious/buggy apps) for a co-operative multi-taskers (what we use).

darkblu · Jul 22, 2006

patsu said:
The principles behind NUMA, SIMD, larger number of cores, and deterministic run-time work towards Cell's advantage. These salient points are not found in a simple, small-scale SMP setup. Naturally, a developer can do more work to map his/her problems to the Cell architecture to achieve aggressive result. I think we are agreeing here.

yes, we are.

However this does not mean that all our Cell programming effort will contribute to speed up. The current Cell implementation is also constrained by power consumption, heat and size requirements. For example, it would be nice to have >256K local store in practice. Without which, developers may need to jump through extra hoops to avoid unnecessary penalty, or even to make parallelization possible.

yes, local SPE memory definitely can be larger and that would alleviate eveybody's lives. if you compare it to the present vanilla (SMP) architectures you'll notice though that 256KB is not that far from what cache a vanilla-architecture cpu gets these days. surely 512KB would have been better, and 1MB being already in the dream domain.

Then there is also the base parallel execution scaffold (kernel/layer) to do everytime a dev tries to organize his/her SPUs for a custom job. With time, this problem should go away. But right now, I do not know if Sony has sufficient libraries and tools to help developers out (debugging !).

i think it is only reasonable to expect that this isse will be addressed and likely resolved. if you look at the large-scale distributed archtiectrues of out time, most of them have the distribution/dispatching mechanisms transparently in-place, and for a good reason too - e.g. the earth simulator has 640 nodes - imagine having to manage those in your code ; )

See... the thing is I'm not sure developers want to "have technically-bleeding-edge concurrent stuff in their game" just for the heck of it. They have many big and small headaches to solve. The more effective the tool/solution, the better. SMP has its place and Cell has its place. There are room for improvement for both.

i deffinitely agree with that last sentence of yours - i don't see SMP departing anytime soon. actually a distributed cell-like architecture and a corresponding SMP configuration are not orthogonal to each other, they're residing in the same specrum, just kind-of in the opposite ends of it. but both can be made so that they move slightly closer to the opposite end. e.g. an SMP with shared cache and capability for core-wise partitioning of that cache is a tad close to cell and a cell with built-in data consistency mechanisms would be a tad closer to SMP. as things stand now though, getting each to behave like the other can be a bit counter-productive, and i can see that being one of the issues devs having with cell - them trying to make it behave like an SMP.

I found this Cell programming link. Not sure if it has been posted here before:
http://gamasutra.com/features/20060721/chow_01.shtml

nice find!

patsu · Jul 23, 2006

darkblu said:
yes, local SPE memory definitely can be larger and that would alleviate eveybody's lives. if you compare it to the present vanilla (SMP) architectures you'll notice though that 256KB is not that far from what cache a vanilla-architecture cpu gets these days. surely 512KB would have been better, and 1MB being already in the dream domain.

I believe the issue is the LS is not a cache and the SPEs can only work on LS directly.

I also think that I might have focused on the wrong advantages when putting Cell to work in gaming context. It seems that between Deano's far and few comments, the hard requirement is real-time scheduling+execution (as opposed to parallel computation, efficiency, raw power, ...). e.g., Certain effects must start immediately upon specific events, certain tasks must complete within specific timeframe, or both.

The Cell architecture has a few traits (predictable run-time, more cores to summon, fast execution path) that allow developers to hit their targets more adequately and consistently regardless of what's happening in the game world. An abstract layer such as prioritized pre-emptive threading has its limitations in this aspect (no matter how good their priority schemes are).

When I did real-time programming eons ago, we built our own primitive kernel and run-time from the ground up to cater for very specific application requirements. In particular the scheduling policies are tuned/tweaked/fudged to busines-domain needs (but run at the lowest level) to guarantee timeliness.

In any case, I'm happy that the NT guys are doing more SPE stuff today. Please keep up the good work. We all know it's not easy.

darkblu · Jul 23, 2006

patsu said:
I believe the issue is the LS is not a cache and the SPEs can only work on LS directly.

yes, whereas with cache-reliant archtiectures you want to maximise you cache efficiency, ergo data locality, ergo the 'local storage constrains' matter is not foreign to that class of architectues too. in cell this matter is just way more, erm, prominent. but again, for a good reason too (tm).

see, a classic SMP architecture (especially one with shared, lockabe, fully-associateve cache) can be made to perform not that differently from a distributed, cell-like collection of individual nodes as in that SMP architecture, too, each cpu could be made to effectively work only in its cache apartment (things like 'io space' aside), given certin discipline on the sw side. it would be rather complicated, but it could have its gains. of course that would not necessily be 'easy', or for that matter, viable/justified. so enter cell.

I also think that I might have focused on the wrong advantages when putting Cell to work in gaming context. It seems that between Deano's far and few comments, the hard requirement is real-time scheduling+execution (as opposed to parallel computation, efficiency, raw power, ...). e.g., Certain effects must start immediately upon specific events, certain tasks must complete within specific timeframe, or both.

i'm sorry, i though we were in the same boat in this regard. but once you've had too many disucssions about cell on these boards, you start to forget in which you have layed out your principal views and in which you have not. so, let me clear this out: determinisim in performance is exactly what i attribute as cell's major advantage - the ability of the cpu to deliver certain oomph at a certain moment (hence my 'high RPM' analogy). or put in other words, its the ability to provide the developer with a reasonable percentage of its theoretical power, subject to current availability, in a timely manner, given the dev has done all the necessery rituals in advance, chanted the right words in latin, facing the right direction of the world. i.e. all those last things being at the dev's expense. but once he's done them right - the deamon is in the pentagram. with SMP the deamon often materialises one foot stepping out, or totally out, or happens to be on vacation.

The Cell architecture has a few traits (predictable run-time, more cores to summon, fast execution path) that allow developers to hit their targets more adequately and consistently regardless of what's happening in the game world. An abstract layer such as prioritized pre-emptive threading has its limitations in this aspect (no matter how good their priority schemes are).

see, i should've gotten that clue - you did mention (issues with) preemptivness in cell and i just let that slip. well, you don't (or should not) really care about preemptivness in a closed, distributed environment doing RT stuff. because..

When I did real-time programming eons ago, we built our own primitive kernel and run-time from the ground up to cater for very specific application requirements. In particular the scheduling policies are tuned/tweaked/fudged to busines-domain needs (but run at the lowest level) to guarantee timeliness.

..you'd take a well-oiled cooperative system for RT purposes over a preemptive one any day of the week ; ) whereas peemptivness is a big deal in desktop/multi-app/multi-user environments, its advantages in closed, well-behaved systems with emphasis on real-time are rather, well, absent.

In any case, I'm happy that the NT guys are doing more SPE stuff today. Please keep up the good work. We all know it's not easy.

i'm happy too (while i'm saving for a cell blade ; )

Jesus2006 · Jul 23, 2006

darkblu said:
see, a classic SMP architecture (especially one with shared, lockabe, fully-associateve cache) can be made to perform not that differently from a distributed, cell-like collection of individual nodes as in that SMP architecture, too, each cpu could be made to effectively work only in its cache apartment (things like 'io space' aside), given certin discipline on the sw side. it would be rather complicated, but it could have its gains. of course that would not necessily be 'easy', or for that matter, viable/justified. so enter cell.)

Not really comparable. You'd need alot of cache if you want to imitate that Cell like behavior. You would need to lock large chunks of cache, to fit whole "programs" (SPU like) on it and the rest of the system would suffer severly, because it has no uncached access/DMA to main RAM.

Shifty Geezer · Jul 23, 2006

1 MB cache, shared between 3 cores, gives 333KB each, more than SPE's 256 KB.

Jesus2006 · Jul 23, 2006

Shifty Geezer said:
1 MB cache, shared between 3 cores, gives 333KB each, more than SPE's 256 KB.

Isnt it 6 threads? :smile:
Then again you cannot fetch any data from main memory anymore since most of the cache is locked. In addition you cannot synchronize these "tasks" well since the individual locks cannot see each other and have no access to each other, in contrast to cell (via DMA), except for going via GDDR, which is not the fastest solution

Still, this is cache, it's not addressable and you cannot transparently work in it and expect it to behave in a determinstic way, since it's shared by all threads.

_phil_ · Jul 23, 2006

Shifty Geezer said:
1 MB cache, shared between 3 cores, gives 333KB each, more than SPE's 256 KB.

6threads. interlaced.Just like PPU *3

Jesus2006 · Jul 23, 2006

_phil_ said:
6threads. interlaced.Just like PPU *3

What do you mean by the term interlaced?

darkblu · Jul 24, 2006

Jesus2006 said:
Not really comparable. You'd need alot of cache if you want to imitate that Cell like behavior.

256kB per core. i would not exactly put that in the 'lots' domain.

You would need to lock large chunks of cache

yep. those same 256kB.

, to fit whole "programs" (SPU like) on it

so where does the code of a modern CPU reside?

and the rest of the system would suffer severly, because it has no uncached access/DMA to main RAM.

what 'rest of the system'? we're talking of cell emulation here. you don't need DMA - your main RAM resides right behind that cache - the cache controller is your 'DMA'. and since it's unified cache, i.e. everything would go through the same cache controller, contention can be resoled in similar efficiency as cell's EIB does it. and you don't want uncached access - on the contrary - you want your cache to have very-lazy/deferred writes to main RAM. best case would be if they were code-controllable. that's all from the hw, the rest is all sw discipline.

voila. you've got your hypotetical 'cell-of-a-sort' over SMP. kind of costly and exotic by todays measures, but otherwise fairly similar to the original.

patsu · Jul 24, 2006

In Xenon's defense, my view is:

* One does not need to split the shared cache evenly to exploit the CPUs. The point is Xenon's flexibility allows the dev to allocate larger than 256K to 1 or 2 cores where appropriate. In addition, data beyond 256K can still be processed (although slower) in the main memory to preserve the "locked cache" content (Am I correct here ?). It is less likely to have a "local memory deficiency" show stopper in Xenon.

* Also it seems that one does not need to rely solely on the pre-emptive software/OS threading framework in Xenon. If desired, a dev may develop its own cooperative tasking framework in Xenon at the expense of simplicity. Together with a locked cache of a larger size, it may be able to achieve comparable performance (as Cell). This is especially true if (i) the problem size is not overly large, and/or (ii) All the cores are not fully occupied already. A 1,000 soldier simulation does not sound too large.

Granted, Cell should have larger capacity given the larger number of cores (when it's not tripped by its local store limitation). I ignore hardware threads above because my impression is that they are used more to hide latency/prevent stalling than to provide additional (peak) throughput/computation power.

ihamoitc2005 · Jul 24, 2006

Execute

Jesus2006 said:
What do you mean by the term interlaced?

I think he means like 1080i than 1080P. One core is two threads but both cannot have same resource at same time so they must have interlaced for execution no? I think this is what he says.

deathkiller · Jul 24, 2006

patsu said:
I ignore hardware threads above because my impression is that they are used more to hide latency/prevent stalling than to provide additional (peak) throughput/computation power.

They hide latency/prevent stalling using more of the shared cache so I think that they shouldn't be ignored when counting the cache.

Shared cache also have lower size efficiency because of aliasing.

Shifty Geezer · Jul 24, 2006

Jesus2006 said:
Isnt it 6 threads? :smile:

And SPE's share 256KB between two threads (if you write two threads). OR in other words, for each logical core that performs work on data, in SPE's they have 256 KB local cache, and in XeCPU you have up to 333 KB local cache.

Then again you cannot fetch any data from main memory anymore since most of the cache is locked.

Or you make 256 KB available to cores and still have 256 KB available as open cache. And besides, if you're locking all three cores to use the cache like LS, you'd be managing RAM access anyway, just like SPEs.

In addition you cannot synchronize these "tasks" well since the individual locks cannot see each other and have no access to each other, in contrast to cell (via DMA), except for going via GDDR

Or, as has been mentioned already, you write to a memory address that is cached, and then read from it, not having to touch GDDR. I dn't know though if a core can access a locked part of the cache and if it addresses outside that space, runs into the normal cache process.

Still, this is cache, it's not addressable and you cannot transparently work in it and expect it to behave in a determinstic way, since it's shared by all threads.

So only run one thread, same as SPEs.

From what I can see, the biggest disadvantages that has versus Cell's SPE LS model is 1) Simplicity - XeCPU and other processors stick to a simple model for software developers to not have to worry about low level workings and yet get good use of the hardware. 2) L2 cache is slower than the LS, I think in the order of 50% slower (6 cycles versus LS 4 cycles). SPE's are going to have an efficiency advantage.
I'm also not sure what low level memory/cache controlling is available. SPE's are obviously designed for working that way so surely have advatanges.

Which is beside the point. You said you'd need a lot of cache to simulate Cell behaviour. At 256 KB per SPE, and 512 KB for PPE, XeCPU can simulate a two-SPE device by locking 256 KB for each of two cores and leave 512 KB for the third to act as PPE. Hence your suggestion is patently false. If you want to talk about other aspects to LS versus cache, that's fine, but you might want to accept you are wrong about the cache size being a limit first.

Jesus2006 · Jul 24, 2006

Shifty Geezer said:
Which is beside the point. You said you'd need a lot of cache to simulate Cell behaviour. At 256 KB per SPE, and 512 KB for PPE, XeCPU can simulate a two-SPE device by locking 256 KB for each of two cores and leave 512 KB for the third to act as PPE. Hence your suggestion is patently false. If you want to talk about other aspects to LS versus cache, that's fine, but you might want to accept you are wrong about the cache size being a limit first.

Ok i agree, you can simluate a 2 SPE device. But with all the performance and efficiency disadvantages you mentionend that push it way below a "real" 2 SPE CELL

so it's not a good simulation after all

Btw. SPEs do not support mutliple threads afaik. They are dual issue, but can only execute 2 operations of different kind at a time and this is only for peformance increase to pervent pipeline stalls.
In a SMP environment like the simulated CELL that we discuss here this would require extra space in cache for both threads (if one uses them).

But still how does the memory managment work? What does a locked cache do? When you write data to it, will it write back to main memory immediately, or do you have control over this, when to write back and when to read from it? Because every write back will make performance drop siginficantly, since a SPE works on it's own LS until the task is finished and then DMAs back results user controlled. I thought that locked cache only guarantees an amount of cache being reserved for a specific thread, but nothing else (like addressability etc.).

_phil_ · Jul 24, 2006

ihamoitc2005 said:
I think he means like 1080i than 1080P. One core is two threads but both cannot have same resource at same time so they must have interlaced for execution no? I think this is what he says.

Yes ,kinda.Likewise the ps3 ,you should see the xcpu cores ,as 3*2 threads at 1,6ghz.

A glimpse inside the CELL processor

SPM

aaaaa00

SPM

patsu

DeanoC

Trust me, I'm a renderer person!

darkblu

patsu

darkblu

Jesus2006

Shifty Geezer

uber-Troll!

Jesus2006

_phil_

Jesus2006

darkblu

patsu

ihamoitc2005

deathkiller

Shifty Geezer

uber-Troll!

Jesus2006

_phil_

Similar threads