CPU Limited Games, When Are They Going To End?

Certain tasks just don't scale past a certain number of cores because the added latency goes up with every core added, and eventually gets to a point where the latency penalty outweighs the performance improvement.

Cores don't inherently add latency. Instead it depends on the cache topology etc.
 
Very good blog from Durante (game dev & former modder who worked on fixing the Dark Souls PC port) detailing how to do CPU optimizations to essentially triple fps.

Here are some interesting tidbits regarding multi core optimizations and high fps stabilization.

Parallelization Step:

This is by far the most exciting step, and also the one that is really a rather questionable idea. When we started to work on the port, the game had a main update thread, a graphics thread, and several background threads for audio and asynchronous tasks. Of those, only the graphics and main update thread do any real CPU work, and only the main thread was limiting performance.

Looking into that main update thread further, we discovered that a large chunk of time is spent updating all the actors in the scene individually -- that includes characters, monsters, pickups, and basically everything else you can see. as well as some things you can't see such as event triggers. So the basic idea is rather obvious: perform these updates in parallel.

Of course, in reality it's not quite that simple, since each of these update steps can and will interact arbitrarily with some global system or state that was developed under the assumption that everything is sequential. The surprising thing is really that we got it to work, but a lot of development time went into debugging problems that arose due to this parallelization. Ultimately, due to the level of synchronization required, the actor updates only really scale up to 3 threads -- but the improvement is still substantial, especially since it is most pronounced in the hardest-to-run areas (with more individual actors).

GPU Query Step:

The final step on the GPU optimization journey was more related to improving frametime stability at high framerates rather than pure performance. The image shows a frametime chart comparison, and you can see that not only is the release version much faster, it is also a lot more stable. It is important to note here that this is without a frame limiter or V-sync -- because you usually only see such a flat frametime chart with one of those engaged.

What we did here is purposefully introduce a one-frame-off synchronization point between GPU and CPU progress. My initial thought was that this would improve stability but reduce framerate, and so it would have to be a setting, but in actual testing across several configurations it turns out that it improves both. I'm not 100% sure why, but my theory -- after investigations with Tracy -- is that it has to do with thread scheduling decisions from the OS being improved in this case.

https://steamcommunity.com/games/2731870/announcements/detail/4666382742870026336


f318a0b154e91a5580dd6d413579c9109b2fd299.png
 
Looking into that main update thread further, we discovered that a large chunk of time is spent updating all the actors in the scene individually -- that includes characters, monsters, pickups, and basically everything else you can see. as well as some things you can't see such as event triggers. So the basic idea is rather obvious: perform these updates in parallel.
Of course, in reality it's not quite that simple, since each of these update steps can and will interact arbitrarily with some global system or state that was developed under the assumption that everything is sequential. The surprising thing is really that we got it to work, but a lot of development time went into debugging problems that arose due to this parallelization. Ultimately, due to the level of synchronization required, the actor updates only really scale up to 3 threads -- but the improvement is still substantial, especially since it is most pronounced in the hardest-to-run areas (with more individual actors).
That seems a really poor oversight IMO. We've had dev talks highlighting the need for CPU optimised engine design for years now, and the development of parallelisable entity based engines. Even me in my own efforts have designed my game around concepts that can be parallelised if/when the need arises. For mainstream studios not to be doing this when that's their paid job, and for independent modders to be able to hack it in themselves, is poor form indeed.

I feel there could do with being an 'Industry Award' review at the end of the year that highlights the highs and lows of games; "The Foundries" or something. And Ys X wins a Wooden Anvil for not parallelising its main code, and Durante et al get a Golden Hammer for doing it and showing what's possible and what devs should be doing.

Without shedding light on poor practice, it'll be difficult to pressure devs to amend their ways. When they are named and shamed, they might reconsider their priorities.

Note: 'Devs' means the whole studio including the execs making decisions that affect everything. The finger isn't being pointed at engineers who aren't doing it right, as we don't know if it's their call or if they request an update to the engine but the execs are refusing. Something somewhere in the studios needs to change when CPU utilisation in released games is this poor.
 
That seems a really poor oversight IMO. We've had dev talks highlighting the need for CPU optimised engine design for years now, and the development of parallelisable entity based engines. Even me in my own efforts have designed my game around concepts that can be parallelised if/when the need arises. For mainstream studios not to be doing this when that's their paid job, and for independent modders to be able to hack it in themselves, is poor form indeed.

I feel there could do with being an 'Industry Award' review at the end of the year that highlights the highs and lows of games; "The Foundries" or something. And Ys X wins a Wooden Anvil for not parallelising its main code, and Durante et al get a Golden Hammer for doing it and showing what's possible and what devs should be doing.

Without shedding light on poor practice, it'll be difficult to pressure devs to amend their ways. When they are named and shamed, they might reconsider their priorities.

Note: 'Devs' means the whole studio including the execs making decisions that affect everything. The finger isn't being pointed at engineers who aren't doing it right, as we don't know if it's their call or if they request an update to the engine but the execs are refusing. Something somewhere in the studios needs to change when CPU utilisation in released games is this poor.
At this point it honestly seems like the know how just isn’t there for many companies. I think it’s too wide spread to be anything else. We have heard reports on these forums that programming classes these days don’t even teach quality coding.
 
I don't really know about how current college students are learning programming, but one thing obvious is that many of them are not really learning about the very basic computer architectures. Even back at my student days (and that's more than 20 years ago) there were already pressures to teach students "something that can be used at work right away" meaning something like MFC. I've heard some even only teach languages like Javascript and Python. It's not that these are not useful programming languages, but if students only learn these languages they probably have no idea how these programs actually run by the CPU.

One of the problems, I think, is because everything are so complex these days. Today's CPU are so complex, it's no longer possible to teach students all about their designs. Back in my days a pipelined CPU was "state-of-the-art" and out-of-order execution was like rocket science. Today even embedded CPU are superscalar pipelined and probably out-of-order. The OS is also much more complex than what we had. GPU was not even a thing. It's really much harder for students to grasp even some designs considered basic today under the hood.

The huge increased demand for programmers also worsens the problem. There are still many excellent programmers but most of them are probably in high paying jobs. SInce game industry pays less than others, it's understandable that many smart people don't want to stay in this industry. It's the case back in my days, but there were many talented programmers who love games and were willing to stay in this industry despite less pay. Today I really can't say.
 
At this point it honestly seems like the know how just isn’t there for many companies. I think it’s too wide spread to be anything else. We have heard reports on these forums that programming classes these days don’t even teach quality coding.
Surely senior positions responsible for high-level design choices like this are 1) experienced and 2) invested, so they go to GDC and read the papers and know best practice? If these choices are going to recent graduates to make, there's something seriously wrong!
 
I like what Casey Muratori calls non-pessimized code. It’s understanding how a computer works and writing your program in a way that’s ideal for how a computer works. It’s not performing weird tricks with cpu instructions.

My first university programming course was an object-oriented class in Java. The professor basically told us that we can’t out-optimize a compiler and it was better to write code in a way that was maintainable because the compiler would make it fast.

And now most software is horrendous.
 
A very good blog post on multi threading implementations and problems in games.

The cost of multi threading ...

The CPU synchronizes the memory writes from one core so that other cores can also see it if it detects that one core is writing to the same memory location a different core is also writing into or reading from.

That synchronization has a cost, so whenever you program multithreaded programs, it’s important to try to avoid having multiple threads writing to the same memory, or one thread writing and other reading

For correct synchronization of operations, CPUs also have specific instructions that can synchronize multiple cores, like atomic instructions. Those instructions are the backbone of the synchronization primitives that are used to communicate between threads

Atomic operations are a specific set of instructions that are guaranteed to work well(as specified) even if multiple cores are doing things at once. Things like parallel queues and mutexes are implemented with them

Atomic operations are often significantly more expensive than normal operations, so you cant just make every variable in your application an atomic one, as that would harm performance a lot. They are most often used to aggregate the data from multiple threads or do some light synchronization

Multi threading implementations (Unreal Engine 4) ...

The first and most classic way of multithreading a game engine is to make multiple threads, and have each of them perform their own task.

For example, you have a Game Thread, which runs all of the gameplay logic and AI. Then you have a Render Thread that handles all the code that deals with rendering, preparing objects to draw and executing graphics commands.

Unreal Engine 4 has a Game Thread and a Render Thread as main, and then a few others for things such as helpers, audio, or loading. The Game Thread in Unreal Engine runs all of the gameplay logic that developers write in Blueprints and Cpp, and at the end of each frame, it will synchronize the positions and state of the objects in the world with the Render Thread, which will do all of the rendering logic and make sure to display them

While this approach is very popular and very easy to use, it has the drawback of scaling terribly. You will commonly see that Unreal Engine games struggle scaling past 4 cores, and in consoles the performance is much lower that it could be due to not filling the 8 cores with work

Another issue with this model is that if you have one of the threads have more work than the others, then the entire simulation will wait. In unreal engine 4, the Game Thread and Render Thread are synced at each frame, so if either of them is slow, both will be slowed as the run at the same time. A game that has lots of blueprint usage and AI calculations in UE4 will have the Game Thread busy doing work in 1 core, and then every other core in the machine unused.
A common approach of enhancing this architecture is to move it more into a fork/join approach, where you have a main execution thread, and at some points, parts of the work is split between threads. Unreal Engine does this for animation and physics.

While the Game Thread is the one in charge of the whole game logic part of the engine, when it reaches the point where it has to do animations, it will split the animations to calculate into small tasks, and distribute those across helper threads in other cores.

This way while it still has a main timeline of execution, there are points where it gets extra help from the unused cores. This improves scalability, but it’s still not good enough as the rest of the frame is still singlethreaded.

Multi threading implementations (id Tech Engine and Naughty Dog Engine) ...

Mostly since the ps4 generation of consoles, which have 8 very weak cores, architectures have evolved into trying to make sure all the cores are working on something and doing something useful.

A lot of game engines have moved to a Task based system for that purpose. In there, you don’t dedicate 1 thread to do one thing, but instead split your work into small sections, and then have multiple threads work on those sections on their own, merging the results after the tasks finish.

Unlike the fork-join approach of having one dedicated thread and having it ship off work to helpers, you do everything on the helpers for the most part.

Your main timeline of operations is created as a graph of tasks to do, and then those are distributed across cores. A task cant start until all of its predecessor tasks are finished. If a task system is used well, it grants really good scalability as everything automatically distributes to however many cores are available.

A great example of this is Doom Eternal, where you can see it smoothly scaling from PCs with 4 cores to PCs with 16 cores. Some great talks from GDC about it are Naughty Dog “Parallelizing the Naughty Dog Engine Using Fibers” and the 2 Destiny Engine talks.

 
Back
Top