CPU Limited Games, When Are They Going To End?

Funny enough years ago there was a thread about game engines supporting multiple cores and I asked the question
are game engines now being developed to use a specified number of core (eg: 8 cores) or are they being developed to use as many cores as they find (eg: if they find 8 cores they will use 8 cores, if they find 64 cores they will use 64 cores ect)
and was told it was the later (might have been Sebbi or Demirug it was someone involved in game dev (it wasn't humus) )
 
Funny enough years ago there was a thread about game engines supporting multiple cores and I asked the question
are game engines now being developed to use a specified number of core (eg: 8 cores) or are they being developed to use as many cores as they find (eg: if they find 8 cores they will use 8 cores, if they find 64 cores they will use 64 cores ect)
and was told it was the later (might have been Sebbi or Demirug it was someone involved in game dev (it wasn't humus) )

I think there might be a semantics issue with that type of question and answer. Using X amount of cores isn't really the same as being designed for optimal performance for X amount of cores. Or saying something will use as many cores as available also isn't specific at all in terms of what type of performance scaling target you see from 1 through to X cores.

You already see this with current games, they are all extremely well threaded in the sense that they run a lot of threads. But from the layman user side what they aren't seeing is what the user considers significant performance scaling (at least in terms of FPS) from 6->8->more (or even some cases just 4->6) cores.

Another aspect of this is it's worth keeping in mind that designing to scale performance to as many cores as possible as a target may not actually be optimal in an overall sense, as in order to do so you might run into trade offs including less performance with less cores.

IHVs could offer their own explicit APIs for the few developers who want to maximize performance.

I think something like, and I feel this way with the current APIs to some extent as well, in that the theory of how something like that would work has serious issues in practice given the actual dynamics of the PC market.
 
I think something like, and I feel this way with the current APIs to some extent as well, in that the theory of how something like that would work has serious issues in practice given the actual dynamics of the PC market.
There is probably no reason for AMD to do it. Given Nvidia's market share though, I could see a small pool of developers possibly writing a separate path for RTX GPUs.
 
Using X amount of cores isn't really the same as being designed for optimal performance for X amount of cores. Or saying something will use as many cores as available also isn't specific at all in terms of what type of performance scaling target you see from 1 through to X cores.
While what you said is true modern game engines don't conform to either they arn't designed for optimal performance on X amount of cores (according to whoever answered me) and they certainly don't use all the cores they find regardless of performance scaling .
ps: is iregardless a word
 
There is probably no reason for AMD to do it. Given Nvidia's market share though, I could see a small pool of developers possibly writing a separate path for RTX GPUs.
There are "separate paths" for all GPUs for a long time now, with Nvidia in particular providing tech which just doesn't work on other IHVs. A game which is using DLSS or some ODM RT feature of Lovelace is already using a "separate path" for RTX GPUs.

That being said however the main driving forces behind a modern API is that it is 1) a common programming target for everyone and 2) have some level of promise of being supported in future h/w+OS without developer intervention. Both of these will get fuzzy if there will be an IHV exclusive API which in turn would make it an undesirable target for any development (and I have severe doubts that such an API would provide any noticeable user facing value in terms of performance honestly).
 
Question: is OpenGL still a thing in game development It's been a long time since I've seen a game using it?
 
Question: is OpenGL still a thing in game development It's been a long time since I've seen a game using it?

Don't think so. Maybe in the mobile space if they're supporting older phones with GL ES. There could be indie games. I still think some people use it in academic/research because they find it more productive, depending on what they're doing.
 
You already see this with current games, they are all extremely well threaded in the sense that they run a lot of threads. But from the layman user side what they aren't seeing is what the user considers significant performance scaling (at least in terms of FPS) from 6->8->more (or even some cases just 4->6) cores.

Another aspect of this is it's worth keeping in mind that designing to scale performance to as many cores as possible as a target may not actually be optimal in an overall sense, as in order to do so you might run into trade offs including less performance with less cores.

Yeah, you get into memory issues the more cores you have. You also have to make sure threads that share/send data are close to each other so you can get the full effect of L2 caches.
 
A very interesting thread from a developer at Ubisoft on why games are hard to parallelize.

"Re: game optimization, there is usually a “hot path” of CPU updates - read the input, player + camera physics and animation - that need to happen every frame. Very hard to parallelize, without extra latency / memory. Single-core performance is still important"

"It’s about update order. To be able to say where the camera is, you have to run the animation of the player. To do that, you need to do the physics step. And so on.You can make these run in parallel, if you’re OK with latency - e.g. use the physics sim from the frame before"

"Mostly an observation from the last games I worked on. The bottleneck is a single-threaded chain of updates for the player, rather than the number of entities simulated / rendered. Animation + camera updates are surprisingly expensive, due to e.g. ray casts and others"

"As an example, a camera system can require hundreds of sphere casts to not feel janky. Predicting where it will move, smoothly avoiding obstacles, etc"





 
Last edited:
NVIDIA published an interesting blog regarding CPU threads in games, in short: more CPU threads are often actually detrimental to performance, NVIDIA advises the games to reduce the thread count of their worker pools to be less than the core count of modern CPUs.

Many CPU-bound games actually degrade in performance when the core count increases beyond a certain point, so the benefits of the extra threading parallelism are outweighed by the overhead
On high-end desktop systems with greater than eight physical cores for example, some titles can see performance gains of up to 15% by reducing the thread count of their worker pools to be less than the core count of the CPU
Instead, a game’s thread count should be tailored to fit the workload. Light CPU workloads should use fewer threads
Executing threads on both logical cores of a single physical core (hyperthreading or simultaneous multi-threading) can add latency as both threads must share the physical resource (caches, instruction pipelines, and so on). If a critical thread is sharing a physical core, then its performance may decrease. Targeting physical core counts instead of logical core counts can help to reduce this on larger core count systems
On systems with P/E cores, work is scheduled first to physical P cores, then E cores, and then hyperthreaded logical P cores. Using fewer threads than total physical cores enables background threads, such as OS threads, to execute on the E cores without disrupting critical threads running on P cores by executing on their sibling logical cores
for chiplet-based architectures that do not have a unified L3 cache. Threads executing on different chiplets can cause high cache thrashing
Core parking has been seen to be sensitive to high thread counts, causing issues with short bursty threads failing to trigger the heuristic to unpark cores. Having longer running, fewer threads helps the core parking algorithms

 
Yah, some games benefit from turning hyperthreading off. I know warzone was a game where you could alter your render thread count and people were lowering it to boost framerates. Not sure if the guidance was the same for AMD and intel, or if it's changed, but there was some number of threads that was a sweet spot and anything higher or lower would impact performance. This was for people trying to push max frame rates in cpu-limited scenarios like low graphics 1080p. Some people were using this process lasso tool to try to keep threads from moving around too. I'm hoping things like this get improved over time so there is lessing fussing and better out of the box settings.
 
Yah, some games benefit from turning hyperthreading off. I know warzone was a game where you could alter your render thread count and people were lowering it to boost framerates. Not sure if the guidance was the same for AMD and intel, or if it's changed, but there was some number of threads that was a sweet spot and anything higher or lower would impact performance. This was for people trying to push max frame rates in cpu-limited scenarios like low graphics 1080p. Some people were using this process lasso tool to try to keep threads from moving around too. I'm hoping things like this get improved over time so there is lessing fussing and better out of the box settings.
Warhammer Darktide have a setting that controls the number of worker threads available to the game, and the max number is always less than the number of available cores, on my 7800X3D system, I have 16 threads, the game only allows me to use 14 threads.
 
There are several issues at play there and just limiting the thread count below the available number isn't always going to be a win (OS runs just fine on modern CPUs even with all threads being used). Two big issues which are at play here are CPU memory bandwidth and the use of SMT/HT which can be negative if not optimized for properly. The former is mostly why you're getting an increase in performance by limiting the used thread count as the effective cache sizes per thread become larger in this case. It is also why Zen 3D CPUs are showing such high performance gains in some titles.
 
A rare new game where we can compare DX11 with DX12:
PCGH tested the game too, DX11 provided the best frame pacing and avg fps, except on the top 3 cards.

*DirectX 12 only at Radeon RX 7900 XTX and Geforce RTX 4080 Super and RTX 4090 to escape the CPU limit. On all other GPUs, DX11 delivers the overall better performance with more even frame output.


Alone in The Dark reboot was released yesterday, with both DX12 and D11. DX12 is only faster at 1080p, once you go 1440p and 2160p, then DX12 becomes slower on NVIDIA GPUs, while AMD GPUs are unaffected.

 
Sebbi posted a thread about CPU scaling on various CPUs. Recited here in full as it's quite a long thread.

Let's talk about CPU scaling in games. On recent AMD CPUs, most games run better on single CCD 8 core model. 16 cores doesn't improve performance. Also Zen 5 was only 3% faster in games, while 17% faster in Linux server. Why? What can we do to make games scale?

History lesson: Xbox One / PS4 shipped with AMD Jaguar CPUs. There was two 4 core clusters with their own LLCs. Communication between these clusters was through main memory. You wanted to minimize data sharing between these clusters to minimize the memory overhead.

6 cores were available to games. 2 taken by OS in the second cluster. So game had 4+2 cores. Many games used the 4 core cluster to run your thread pool with work stealing job system. Second cluster cores did independent tasks such as audio mixing and background data streaming.

Workstation and server apps usually spawn independent process per core. There's no data sharing. This is why they scale very well to workloads that require more than 8 cores. More than one CCD. We have to design games similarly today. Code must adapt to CPU architectures.

On a two CCD system, you want to have two thread pools locked on these cores, and you want to push tasks to these thread pools in a way that minimizes the data sharing across the thread pools. This requires designing your data model and communication in a certain way.

Let's say you use a modern physics library like Jolt Physics. It uses a thread pool (or integrates to yours). You could create Jolt thread pool on the second CCD. All physics collisions, etc are done in threads which share a big LLC with each other.

Once per frame you get a list of changed objects from the physics engine. You copy transforms of changed physics engine objects to your core objects, which live in the first CCD. It's a tiny subset of all the physics data. The physics world itself will never be accessed by CCD0.

Same can be done for rendering. Rendering objects/components should be fully separated from the main game objects. This way you can start simulating the next frame while rendering tasks are still running. Important for avoiding bubbles in your CPU/GPU execution.

Many engines already separate rendering data structures fully from the main data structures. But they make a crucial mistake. They push render jobs in the same global job queue with other jobs, so they will all be distributed to all CCDs with no proper data separation.

Instead, the graphics tasks should be all scheduled to a thread pool that's core locked to a single CCD. If graphics is your heaviest CPU hog, then you could allocate physics and game logic tasks to the thread pool in the other CCD. Whatever suits your workload.

Rendering world data separation is implemented by many engines already. It practically means that you track which objects have been visually modified and bump allocate the changed data to a linear ring buffer which is read by the render update tasks when next frame render starts.

This kind of design where you fully separate your big systems has many advantages: It allows refactoring each of them separately, which makes refactoring much easier to do in big code bases in big companies. Each of these big systems also can have unique optimal data models.

In a two thread pool system, you could allocate independent background tasks such as audio mixing and background streaming to either thread pool to load balance between them. We could also do more fine grained splitting of systems, by investigating their data access patterns.

Next topic: Game devs historically were drooling about new SIMD instructions. 3dNow! Quake sold AMD CPUs. VMX-128 was super important for Xbox 360 and Cell SPUs for PS3. Intel made mistakes with AVX-512. AVX-512 was initially too scattered and Intel's E-cores didn't support it.

Game devs were used to writing SIMD code either by using a vec4 library or hand written intrinsics. vec4 already failed with 8-wide AVX2, and hand written instrinsics failed with various AVX-512 instruction sets and various CPU support. How do we solve this problem today?

Unreal Engine's new Chaos Physics was written with Intel's ISPC SPMD compiler. ISPC allows writing SMPD code similar to GPU compute shaders on CPU side. It supports compiling the same code to SSE4, ARM NEON, AVX, AVX2 and AVX-512. Thus it solves the instruction set fragmentation.

Unity's new Burst C# compiler aims to do the same for Unity. Burst C# is a C99-style C# subset. The compiler leans heavily on autovectorization. Burst C# has implicit knowledge of data aliasing allowing it to autovectorize better than standard compiler. Same is true for Rust.

However autovectorization is always fragile no matter how many "restrict" keywords you put either manually or by the compiler. ISPCs programming model is better suited for reliable near optimal AVX-512 generation.

ISPC compiles C/C++ compatible object files. They are easy to call from your game engine code. Workloads such as culling, physics simulation, particle simulation, sorting, etc can be done using ISPC to get AVX2 (8 wide) and AVX-512 (16 wide) performance benefits.

 
On a two CCD system, you want to have two thread pools locked on these cores, and you want to push tasks to these thread pools in a way that minimizes the data sharing across the thread pools. This requires designing your data model and communication in a certain way.

This is such an obvious thing even in less performance intensive applications that any engine designer not doing this should go back to school. Sad that this even needs to be said.
 
This is such an obvious thing even in less performance intensive applications that any engine designer not doing this should go back to school. Sad that this even needs to be said.

Not necessarily, the problem is that such performance characteristic is somewhat recent. It's a bit similar to a NUMA system but it's not, as the memory is shared. These characteristics are also not necessarily exposed by the OS. On a fixed system such as consoles it's easier and should be done, because programmers should read the documents, but on a PC it's less clear. For example, should it be a game engine's responsibility to schedule their threads to specific cores, or should it be the OS's responsibility? It's not really that black and white.
 
Certain tasks just don't scale past a certain number of cores because the added latency goes up with every core added, and eventually gets to a point where the latency penalty outweighs the performance improvement.

There was a paper on BVH generation using the CPU and when going over 3 cores scaling nosedived because of the latency.

I'll have to try and find that paper later on as it was really interesting.
 
Back
Top