Concurrency in game programming: interesting blog entry from a developer

Farid

Artist formely known as Vysez
Veteran
Supporter
http://www.elbeno.com/blog/?p=351

We (both internally, and I think, in the wider industry) are making some progress on some systems code being multithreaded. However, it’s not great progress. Of course, one problem is that PCs are still lagging behind next-gen consoles: the min spec target, even for games looking at shipping this time next year, is still a single-core P4 based system. Hyperthreaded if we’re lucky. Multicore PCs are out there, and are getting more numerous, but don’t yet have the market share to support dropping the lower end. And of course both Xbox 360 and PS3 still have more cores than a current top-end PC (6 hardware threads and 2 PPU hardware threads + 6 SPUs respectively).

...

But the major problem area for multithreading is where the actual gameplay happens: the intersection of animation, “high-level†AI, player control and to some extent physics. When I talk about “high-level†AI I mean the decision-making and glue code that controls what entities in the game do. The main game loop of object updates has multiple ordering dependencies among these systems and updating one system for a single object often causes cascading requests to other systems. For example: animation must be posed before physics is resolved (because posing animation may cause an arm to stick through a wall that a character is standing close to, and the physics system will then apply constraints to push the arm back outside the wall). Another example is that AI will decide to change what a character is doing and therefore will drive animation blending. A third example is that AI code will often want to know whether two characters can see each other in order to decide what to do. This involves asking the physics system to do a line-of-sight trace between the objects, and the result is needed to make the AI decision, so the call is typically synchronous.

...

So what’s to be done? Well, I have a couple of ideas designed to make it easier for “ordinary engineers†to take advantage of multiprocessing in at least a semi-safe way. The first idea is Software Transactional Memory or STM. You’ll find the literature if you care to look around the web. I think this is the only practical solution to achieving some concurrency in updating the thousands of objects that we need to, each of which touches half a dozen others. And I think we should apply it at the object level. This will require a bit of extra memory, but if done right, the performance gains could be significant. And importantly, it will scale. For the first time, I think we have the potential to update thousands of objects across as many cores as we have. Of course, I don’t yet have any practical experience, and we may find that a new bottleneck (e.g. memory access) springs up.

The second idea I have to enable more multiprocessing of mid- and systems-level code (and perhaps some higher-level code) by functional programming. I bet that more than 90% of the loops we write can be implemented in terms of map and fold. The challenge is finding a way to abstract this in C++ (much as I would like to write games in Haskell, I don’t see it being adopted any time soon - but I may well experiment with a scripting language embedded in Haskell). Imagine if AI and player control engineers could, instead of writing a synchronous for-loop to find the closest enemy, write that same functionality in terms of map and fold (which is easy) and automatically have it be distributed across cores? The beauty of this idea is that the concurrency work is done once to make map and fold themselves multicore, and then (with appropriate forethought and data constraints) everyone can take advantage of that.

So the only thing I need to do now is actually write this code, get some preliminary use cases together, try it out, and then introduce it to my team. Wish me luck…
Efficient concurrent multi-threading programming is still the principal challenge when writting for multi-core CPUs. I thought this blog entry was presenting an interesting case of that issue, so it definitely deserved its own thread.
 
" However, it’s not great progress. Of course, one problem is that PCs are still lagging behind next-gen consoles: the min spec target, even for games looking at shipping this time next year, is still a single-core P4 based system. Hyperthreaded if we’re lucky. Multicore PCs are out there, and are getting more numerous, but don’t yet have the market share to support dropping the lower end."


Single cores still dominate on the pc

Valve has updated results of its hardware survey among users of Steam, and the results speak for themselves. It seems that gamers mostly live in a broadband world, with 2Mbit/s internet connection accounting for a third (32.54%) of all users. In the everlasting battle of AMD versus Intel, currently Intel has the nod with 53.19% of all users on steam, while AMD held 46.80%. However, results do not sound so good for Intel, since 76.06% of all users own a single-core processor, predominately Intel Pentium 4.
 
This STM memory, would it fit in local storage on cell ?.

And is the DMA thing between spu:s a good way do update ?.

It sounds like an fundamental problem in multicore applications, and that made me think that STI already thought about that when they implemented the DMA channels between localstorage (I can't remember the name but it was something similar to dma channel between LS:s).
 
This STM memory, would it fit in local storage on cell ?.

The local store on the SPU is well, local, so it doesn't need transactions, since just one SPU can use it.
Transactional memory applies primarily to shared memory architectures.

The most common method for multi threading so far has been to use locks (mutexes, semaphores, events etc). Locking doesn't scale very well unfortunately.

Alternatively there is a small class of problems that can be solved with lockless methods, using atomic operations. The data structures that can be operated on with atomic operations are very few, and the algorithms could end up being very tricky.

Transactional memory expands on the atomic lock free idea, by doing a reserve/commit/retry but on arbitrary amount of data.
It is very easy to work with, but has a hefty performance price currently.
My feeling is we'll need some essential hardware support to make this more feasible.
 
However, one of the most important data structures has been made completely lock free: The hash table. Check out Cliff Click's Google presentation of lockfree concurrent hash tables. Amazingly, the constant factor on his implementation is lower than traditional non-concurrent and lock-oriented, so even on single threads, they are faster.

There are some interesting lock free and memory concurrent distributed tree algorithms in the HPC area too.

STM would work much better with HW support (I wrote about how CPUs implemented with checkpoint tables/out-of-order-commit might be tweaked to support STM on B3D a few months ago) Azul has shown for example how GC can be greatly accelerated by some simple CPU extensions.
 
Amazingly, the constant factor on his implementation is lower than traditional non-concurrent and lock-oriented, so even on single threads, they are faster.

This might be true on a Pentium maybe, but on the Xbox360/PS3 atomic ops are easily 10 times slower not even counting memory barriers, so I have hard time believing this claim.
Regardless though, I tend to prefer (balanced) binary trees over hash maps, especially when I really need to have the nodes embedded.
I have to look at his hash map implementation, but I assume he's resizing his hash table as needed which would require allocations, which might not be an option in some cases. I guess one can have a fixed size hash table, but that would then potentially mess that "constant" factor and make it more "linear".
 
This might be true on a Pentium maybe, but on the Xbox360/PS3 atomic ops are easily 10 times slower not even counting memory barriers, so I have hard time believing this claim.

Cliff's algorithm only requires memory barriers during resize. There is no machine wide coherent state. In other words, if Thread 1 does put(K,V) at time t1, and Thread 2 does get(K,V) at time t2, t2 > t1, Thread 2 is not guaranteed to see Thread 1's update. If you want those kinds of guarantees, you can insert a fence yourself. The table by default is meant to be truly concurrent, and if you want extra features, such as happens-before guarantees, you can layer those on top.

Regardless though, I tend to prefer (balanced) binary trees over hash maps, especially when I really need to have the nodes embedded.

You prefer O(log n) vs O(1) (when order doesn't matter?) Cliff's algorithm stores both K and V within a cache-line, BTW.

I have to look at his hash map implementation, but I assume he's resizing his hash table as needed which would require allocations, which might not be an option in some cases. I guess one can have a fixed size hash table, but that would then potentially mess that "constant" factor and make it more "linear".

Of course, his algorithm is optimized for the 768-core machines that Azul Systems sells, Sun Niagara's, and other hyperthreaded machines. It's the kind of algorithm that loves to chew up gigs and gigs of server RAM, the kind of thing Google would like.

You might not want to run it inside SPUs by any means. Then again, I don't know if I'd want to run STM inside an SPU either.

The point is, the hashtables are one of the, if not THE, most important datastructures in computer science, and it has been proven that a completely lock free low-overhead design exists that scales very very well on real HW out there in the wild. It's not theory, it's practice.
 
I bet that more than 90% of the loops we write can be implemented in terms of map and fold. The challenge is finding a way to abstract this in C++

I think this hits the nail squarely on the head.

I've done some templated external iterators to spawn threads just like this. It's clearly the way forward.
Plenty of ways of writing c/c++ in the spirit of functional programming where it matters ..

Something very interesting is that optimizing a single-thread for in-order cores with long pipelines & poor branching is also about finding parallelism & seperating inter-dependancies- so it's a good mindset to be in. The two problems seem to overlap.

Your loops want to be a series of 'batch processes' collecting the interactions in messages which get sorted and consumed in subsequent 'passes'.
 
Last edited by a moderator:
I think this hits the nail squarely on the head.

I've done some templated external iterators to spawn threads just like this. It's clearly the way forward.
Plenty of ways of writing c/c++ in the spirit of functional programming where it matters ..

I think he was also referring to "no side effects" nature of Haskell language, and that it's hard to do this on C/C++.
 
The problem with Map and Fold and I do think it's fundamentally a good approach is getting the granularity right such that the synchronisation/communication overhead doesn't cripple performance.

My current thought on this is to design Interfaces that are fundamentally asynchronous and try to get people to start thinking in terms of asynchronous execution where possible. I would fundamentally ban the use of Mutex' for the majority of programmers, most of the time you can design solutions that don't need them and can rely on simple lockless datastructures for synchronization.
 
The problem with Map and Fold and I do think it's fundamentally a good approach is getting the granularity right such that the synchronisation/communication overhead doesn't cripple performance.

My current thought on this is to design Interfaces that are fundamentally asynchronous and try to get people to start thinking in terms of asynchronous execution where possible. I would fundamentally ban the use of Mutex' for the majority of programmers, most of the time you can design solutions that don't need them and can rely on simple lockless datastructures for synchronization.

YES ! I agree with this post. We are doing this for enterprise computing too, but our problem is recovery (How to recover when something has gone wrong).

The lock-less hash table above is useful too. I have forwarded it to my coworkers.
 
Cliff's algorithm

I'm so glad I ran across this. Very interesting stuff.

The point is, the hashtables are one of the, if not THE, most important datastructures in computer science, and it has been proven that a completely lock free low-overhead design exists that scales very very well on real HW out there in the wild. It's not theory, it's practice.

True.

I wonder, though, how many times shared hashtables are used as work-caches, or as containers of lazily-loaded static data. In many cases, paying the overhead of a generically correct implementation is not a good tradeoff (though it may be expedient). While some of the tricks Cliff describes I've had to use, many times I have not had to implement the full hashtable behavior. Othertimes I've needed completely different behavior so that I can ensure that I perform really complicated workloads once.

Of course, while I can sit here and wish that HashMap would allow me to just synchronize-on-put, or that there were 'long/int'-specific implementations (for key, value, and key-value inputs), mostly I spend time beating up people using HashMaps as caches without synchronization. We've got a long way to go to make a lot of people concurrent-aware. :sigh: I suppose one way to do that is to just build safe libraries in the first place and live with the inefficiencies until they dominate.

Thanks for the info,
-Dave
 
Back
Top