Multithreaded design

Graham

Hello :-)
Veteran
Supporter
:idea:
We don't seem to have much talk on multithreading, so I thought I'd have a think out loud.

I'm wondering on peoples opinion of the following: (even if it's 'well duh!')

One of the bigger issues I've always faced when trying to think in terms of multithreading has been how you share data between processing objects, and deal with potential change in state with that data, and the read/write race conditions that can occur during that change in state.

Now I'm one to often miss the blindingly obvious, so I initially started thinking in terms of reader/writer locks. But this doesn't account for knowing the state of the object (buffering state.. shudder).

Anyway. I'll cut a long story short.

Eventually, I looked at all the code I'd written, and decided it could be broken into 4 parts:

read data, set data,
communicate with other objects, listen for communication.

Now. Having a 'ohh bugger' moment, which became a 'I've wasted so much time' moment, they can be rearranged:

read data,
communicate with other objects,
listen for communication,
set data,

And funnily enough, they can be split down the middle.


read data and send communication to other objects,
read communication and set data

Suddenly I had two operations I could easily have each run in parallel on all my objects.
So, say there are X threads, with Y objects.

on each thread:

process batches of 10 objects,
run 'read data and send communication to other objects'.

sync to other threads

process batches of 10 objects,
run 'read communication and set data'.


I was annoyed with myself because it was simple.
In physics for example, it would boil down to this:

'read data' is find collisions.
'send communication to other objects' is send a message to objects that you are colliding with.

'read communication' is apply physics code to this object (if it didn't send a message) and to those objects who did send a message to this object (recursive).
'set data' is update the position/momentum/etc of the object.


The hardest part is the communication bit, but it's not really that bad. If no collisions are occurring, each object can potentially be being computed on it's own thread - and once collisions start occurring, they all get processed on the same thread, as to avoid races etc. And the best bit is that it's actually not that hard to implement. The locking is very minor as well, with only sending a message really needing a write lock, however it should be possible to get around this most of the time.

If you do the batches in the same order each time, the caching should work fairly well too I guess.


Thoughts?
 
I'm not quite sure I understand the point you're trying to make, maybe giving a simple example would help.

Other than that, worry about correctness first and performance later, in particular when it comes to multi-threaded code. And make sure you don't parallelize on a too fine-grained level.
As a useful paradigm (which is halfway to the scheduling idea I described in an earlier post), I find a fixed pool of worker threads which picks tasks from a central(-ish) task-list works well; see for example NSOperationQueue or my simple (as it doesn't really take care of dependencies) Cocoa Worker Thread.
 
basically if you can break your work units into read and write phases, you can run all the read code in parallel, then run all the write code in parallel.

So problems with fine grained threading, etc, don't really matter. If you have 10,000 objects to process, get four threads to process ~2,500 each. They all do:

foreach object,
object.ReadPhase(); // gather data

The threads then sync, and then run:

foreach object
object.WritePhase(); // process data

In the first phase objects are only reading data, so you don't get race problems. In the second, objects are only writing *private* data, so you also don't get race conditions.
The threads can easily work with giant batches, and you don't have to do much locking at all. I currently use batches of a few 100, which seems to scale very well with thread count.

Task buckets are good too, however if objects on different tasks suddenly need to communicate, or rely on each others state, you can get into some very tricky problems.

A simple example is a missile tracking a target (where both missle and target have movement tasks). Doing this the missile would be consistently tracking the same position frame of the target, but using task lists it could be either tracking the current or next frame, depending what task runs first - or worse get in a race condition if both tasks run at once.


I've been thinking about this a lot. Applying this sort of design to every method in a program (in my mind) would allow an entire program to run with some level of parallelism (automatically?). Provided methods that return data don't modify data, and modifying methods don't return data.
 
If your problem can be split thus, this seems to work well, but I feel the subset of problems for which this is useful is rather limited (I can only think of your given example of "game entities" at the moment, which is something where memory access patterns (i.e. only walk your structures once per frame) is paramount).
Also, that each object needs to locally cache the data it needs for its write-phase (if I understand correctly) is not good in memory constrained situations (= nearly always).

How would you cast a collision query or path-finding in this manner?
 
[maven];976162 said:
If your problem can be split thus, this seems to work well, but I feel the subset of problems for which this is useful is rather limited (I can only think of your given example of "game entities" at the moment, which is something where memory access patterns (i.e. only walk your structures once per frame) is paramount).

Yeah which is exactly what I'm thinking of :)
Multithreading tasks that have little reliance on each others' state is fairly easy. It's just in the real world nothing is easy :). I'm just thinking aloud, trying to find *simple* methods that work around complex synchronisation problems - even if they do have overhead.

The problem for games - in my view - is that no one object or process is going to kill performance (except rendering of course :p) it's the little bits that all add up. Yet throwing all those little bits into a thread pool is going to cause you all sorts of problems... So maybe breaking them into bits that don't conflict, and batching those is a better alternative? I don't know. It seems to be working nicely so far, but there may be better options.

It seems I see a lot of talk of how to group objects into non-conflicting batches, by why not sort batches into non-conflicting groups?
Take rendering. I personally store details about the object to be rendered in a little struct (vb,ib,pos,etc), which then get handed off to the acutual rendering thread. Now I could loop the objects in parallel to running update logic... However that needs to be syncronised, so the render item collection thread doesn't get ahead of the logic thread. By breaking the logic up, just add the render onto the end of the 'read' phase. It removes the sync problem. This is what I do, and it works really well. (there are other tasks going on too :)) The difference is you end up with X lists of render items, instead of one giant list (where X is thread count).

Also, that each object needs to locally cache the data it needs for its write-phase (if I understand correctly) is not good in memory constrained situations (= nearly always).

Well I'm not saying I disagree with you, however at least for CD isn't so bad. You only need to use memory when you detect a collision - and in such a case you are going to have bigger performance worries than a tiny bit of memory for a impact message. (imo). But yes, nothing works for every problem perfectly.
I guess I'm also thinking in terms of a garbage collected world, where small short-lived allocations are very very cheap.
It's just - at least in this case - I can see alternative methods getting crazy complex, crazy fast. :) Unless I'm missing something else that is obvious? (which is why I started the thread, not so much to push what I'm doing, but seek ideas for improvements and alternatives).

How would you cast a collision query or path-finding in this manner?

Well, these are read operations.. It's only when you react to the data that you may write data. However that said, if you can batch all the path-finding operations into one group, and your path finding data is small, you may well see nice cache friendly performance advantages by doing it all in one hit.

Kind of similar to something I wrote a couple of years ago:
http://forum.beyond3d.com/showthread.php?p=544306#post544306

Ok I've written way too much already, and it's stupidly late (and I have a headache!). But thanks, it's nice to know I'm completely behind the times :p
 
Well, these are read operations.. It's only when you react to the data that you may write data. However that said, if you can batch all the path-finding operations into one group, and your path finding data is small, you may well see nice cache friendly performance advantages by doing it all in one hit.
For collision detection:
What kind of simulation environment would this be in?

Time step 0:
If object A in batch 0 only reads the data state of object B in batch 1, it may not know that object C in batch 2 has a velocity vector that will make it collide with object B at a time that lies between time step 0 and time step 1.

Depending on how complex the simulation is, how fast each object is moving, and how far apart in simulated time each time step is, how each object is batched will influence the computation's outcome to greater degrees.
 
Time step 0:
If object A in batch 0 only reads the data state of object B in batch 1, it may not know that object C in batch 2 has a velocity vector that will make it collide with object B at a time that lies between time step 0 and time step 1.

I think you might be thinking of an even more complex problem :)

Well under the assumption that object A does read the state of object B and C, and predicts the path of movement of the two:
The problem I'm trying to avoid, is if those batches were to process in an undetermined order (or concurrently), then object C could end up in time step 1 by the time object A comes to process. So any prediction object A does as to object C's path of movement is going to be one time step out of sync. However if you do all that prediction and collision testing in one go - for every object in parallel - and then go ahead and apply movement and collision response to progress to time step 1 (in parallel), you avoid this problem.
 
I don't know the exact simulation you are discussing.
Is time discrete or considered continuous?

If it's discrete, then the rules of the simulated world prevent objects from interacting between time steps.

If it's continuous, it's not enough to predict that the starting state of B and C will solely determine the end state, not without calculating the collision detection of B and C first.

I'm also unsure of the mechanism you are using to predict collisions, and how all the logic involved in that won't require an unknown number of read and write phases. I also don't know how it can be called a prediction if it's just processing data, unless it can turn out to be wrong...

At any rate, if object A can predict the final states of B and C with complete accuracy, then there is no need for collision detection for B and C, since a completely accurate solution for B and C would involve knowing the end states for any other objects that could have interacted with them in the simulation.
Each batch could try to communicate its results, but this complicates the process beyond your proposal.

The solution for any given object would be calculated many times by different batches.
Such a solution is doable, but it strikes me as being huge waste on non-trivial problems.
 
I remember reading an artical on how Valve manages wirting multithreaded code (with demos to reporters on their partical system using a quad core -> real rain).

Generally they do coarse grain and fine grain multi threading. Coarse grain for stuff like GPU setup, or the sound system, and fine grain for stuff like batch processing for their partical engine. I also remember them saying they like using lockfree interthread communciation (assuming atomic reads/writes). They also say they run 1 thread per core, and always have one manager thread (so a quad core would have 1 manger + 3 workers).

Multithreaded programming will make your brain explode. I still get headaches from my 2 thread code because you just can't make assumptions on the order of exceution. I should have googled before I wrote the code :)
 
Back
Top