Micro-tasks and streaming.

Frank

Certified not a majority
Veteran
So, how would you build an OS and API that fancies those above the current model? Because that is essentially what is needed to pull that off. Better compilers can help, but especially when mixing different architectures on the same chip (which would include the VMX units as a different model as well), what you really need is an efficient way to distribute and manage all those small tasks and a way to keep the data structures whole when moving them around in bits and pieces.

Forget about threads and objects as we know them, they won't work in the long run. Something like the supercomputer distributed SIMD vector model might be a better fit. But then again, how are you going to handle a single game loop on something like that?

Any ideas?
 
DiGuru said:
So, how would you build an OS and API that fancies those above the current model? Because that is essentially what is needed to pull that off. Better compilers can help, but especially when mixing different architectures on the same chip (which would include the VMX units as a different model as well), what you really need is an efficient way to distribute and manage all those small tasks and a way to keep the data structures whole when moving them around in bits and pieces.

Forget about threads and objects as we know them, they won't work in the long run. Something like the supercomputer distributed SIMD vector model might be a better fit. But then again, how are you going to handle a single game loop on something like that?

Any ideas?

The issue isn't the different processors, it's the limited disjoint memories that force a streaming model.
 
ERP said:
The issue isn't the different processors, it's the limited disjoint memories that force a streaming model.
Good point.

So, what OS and API do we need to break those monolithic structures into something manageable?
 
Cell is not well suited to general multi-application type OS's.

Part of the problem is that the local memories are part of the SPU local state and would have to be saved in a context switch, that means an SPU context switch can take 100's of thousands of cycles.

That leaves you with a run to completion model which can work well inside an application, but does not work in a Multi application environment, since badly behaved applications may never release the SPU.

Your best bet is probably similar to the model IBM suggests, and break things up into small pieces that can be run to completion, with static scheduling of the jobs. However while this works well for trivially parallel jobs, it's an extremly limited view of parallelism in general.

In more general cases you will want to build cooperatively multithreaded jobs, you will want to conditionally execute Jobs based on the results of other Jobs, this makes the scheduling more complex.

But in general the real issue is still breaking the application down into small pieces, that have localised memory requirements, it's harder than it sounds.
 
ERP said:
Cell is not well suited to general multi-application type OS's.

Part of the problem is that the local memories are part of the SPU local state and would have to be saved in a context switch, that means an SPU context switch can take 100's of thousands of cycles.

That leaves you with a run to completion model which can work well inside an application, but does not work in a Multi application environment, since badly behaved applications may never release the SPU.

Your best bet is probably similar to the model IBM suggests, and break things up into small pieces that can be run to completion, with static scheduling of the jobs. However while this works well for trivially parallel jobs, it's an extremly limited view of parallelism in general.

In more general cases you will want to build cooperatively multithreaded jobs, you will want to conditionally execute Jobs based on the results of other Jobs, this makes the scheduling more complex.

But in general the real issue is still breaking the application down into small pieces, that have localised memory requirements, it's harder than it sounds.

That was something I was wondering. How will it be now that linux will be shipped and stuff, how would you write a program for PS3. I doubt we could see many programms that are directly controlling SPEs since that could collide with another program that might be written in a way that controls that SPEs in a similar way. What I mean is there a way to write programms for PS3 where the programmers manually control the SPEs so that the programs made by different programmers do not collide with each other? Are the SPEs capable of running several programs at the same time fighting for their attention?...
 
Couldn't the app provide an estimated time of execution to the OS for it's SPE task, and if the task runs significantly longer then that, the OS assumes there's an error and switches the SPE to something else for a while. This way all programs get reasonably fair access, even if the user/system doesn't stops the offending program.

Of course, for compilers you will need a good profiler built-in to give you estimates for this function.
 
ERP said:
Cell is not well suited to general multi-application type OS's.

Part of the problem is that the local memories are part of the SPU local state and would have to be saved in a context switch, that means an SPU context switch can take 100's of thousands of cycles.

That leaves you with a run to completion model which can work well inside an application, but does not work in a Multi application environment, since badly behaved applications may never release the SPU.

Your best bet is probably similar to the model IBM suggests, and break things up into small pieces that can be run to completion, with static scheduling of the jobs. However while this works well for trivially parallel jobs, it's an extremly limited view of parallelism in general.

In more general cases you will want to build cooperatively multithreaded jobs, you will want to conditionally execute Jobs based on the results of other Jobs, this makes the scheduling more complex.

But in general the real issue is still breaking the application down into small pieces, that have localised memory requirements, it's harder than it sounds.

You could also extend IBM's micro-tasking idea a bit and do it by hand dividing the LS in partitions, one for each mini-thread: you could support several threads in execution and we would have a ~64 KB slot (64 KB must be LS mini-slot + other context info) used for the current executing thread, a full 64 KB slot used for context save storage (backing up the context) with 128 KB of LS or less (you can wiggle things around if you want) used as a shared software cache for all threads (when streaming in data and fetching instructions we could, if needed, organize them in "tagged" groups keeping track of some thread ID). Context switching would then be performed manually.

This is surely a rushed idea, but it might be useful in some cases: IF you can divide in small batches your data/set and take enough advantage of the software cache it would be faster to multi-thread that way (and use a preemptive multi-threading policy) than to do a full context switch with 256+ KB worth of info... IMHO.
 
Last edited by a moderator:
That's why a streaming model would make more sense, as long as you don't use too many data blocks.

As for the task switching: you can give each thread their own local block of registers. Like: thread 0 uses the adresses 00000 to 000ff as registers/local storage, thread 1 uses 00100-001ff, etc. For the streams that go with it, you likewise reserve blocks of local memory, depending on the size needed. And as they are streams, you can use any block size that is a multiple of the structure/record size. You can even adjust that size dynamically as needed, by only asking for new data when there is room. Although bigger blocks are more efficient for the DMA transfer.

That way, you can still switch the treads with little overhead, and run the streaming tasks to completion.
 
Last edited by a moderator:
DiGuru said:
That's why a streaming model would make more sense, as long as you don't use too many data blocks.

As for the task switching: you can give each thread their own local block of registers. Like: thread 0 uses the adresses 00000 to 000ff as registers/local storage, thread 1 uses 00100-001ff, etc. For the streams that go with it, you likewise reserve blocks of local memory, depending on the size needed. And as they are streams, you can use any block size that is a multiple of the structure/record size. You can even adjust that size dynamically as needed, by only asking for new data when there is room. Although bigger blocks are more efficient for the DMA transfer.

That way, you can still switch the treads, and run the streaming tasks to completion.

The beauty of CELL's SPE's is that you have lots of ways to try to achieve a solution (even though that means that you have to do more work to do THE basic thing in some efficient way too).

You can see that the people responsible for PSTwo's architecture are also behind CBEA (with the BIG addition of IBM to the mix even though it seems that programmers do like more the SCE/Toshiba influenced ideas like the SPE's than the IBM driven PPE/PPX cores).
 
Yes, I agree. Although it's not that hard, especially for people who have programmed for the segmented DOS memory model, and used the earlier task switchers. With the added benefit, that those segments can take up all available RAM, and you can have any amount of registers you like.

And when you use streams, you don't suffer from memory fragmentation, as you can just allocate a new buffer when you get new data. Sized to fit the current situation.

It's actually a rather nice model. And not hard to implement, as many RISC processors and *nix OS-es for them use the same method, like SPARC and the earlier SOLARIS versions.
 
Back
Top