CPU Limited Games, When Are They Going To End?

DavidGraham · Mar 16, 2023

Recently I fired up some of the old but gold games, and I was sad to discover that several of them were CPU limited beyond hope, even on a top of the line CPU and GPU. And I don't mean CPU limited as in they can't reach 500fps, I mean they can't even reach 60fps in complex scenes no matter what GPU or CPU you use, it was all due to one single factor: Draw Distance.

This was clearly obvious in ArmA 3 and ArmA 2, you set the draw distance to the max in these games, and you watch your fps fall along with your GPU usage. They are called Visibility settings for Objects and Shadows, and they extend draw distance to thousands of kilometers. Operation Flashpoint suffered the same problem too.

Flight Simulator X 2010 and Flight Simulator 2020 are like the poster children of this, these games can't run reliably using any CPU if you set the draw distance settings to the max, and they will constantly drop below 60fps.

This problem was present in the original Crysis as well, you can set the draw distance to beyond max through console commands, after which you watch your fps and GPU usage plummet. Crysis Remastered suffered the same fate, but Crytek improved the CPU usage with a subsequent patch, and managed to moderately improve the situation. But still, scenes with large extended views, don't run as well as other regular scenes.

The original Dying Light had the same problem too, but the developer intervened and cut the draw distance in half with a patch to get rid of the problem altogether! But you can restore it and still suffer low fps/low GPU usage in outdoor areas with extended draw distance.

RTS games with advanced graphics suffer that problem too, in the original Company of Heroes and Company of Heroes 2, if you go down with the camera to the third person view to follow a soldier or a tank, your fps/GPU usage plummet hard, as the game renders the whole map in front of you, essentially making the game a third person action game with max draw distance. You can do the same in the Total War games and get the same result especially with their huge number of units. Total War Warhammer games got the same problem too. Essentially any RTS game where you have a free control of the camera, can net you the same result, max draw distance tanks your fps and GPU usage.

Watch Dogs 2, Watch Dogs Legion and Ark Survival Evolved suffer the same problem but to a lesser extent. In fact, any game that can be modded to offer greater draw distances than the default will suffer the same fate.

The problem is not limited to any single game or engine, or API, they all suffer, huge draw distance/object count limit the CPU performance to a single thread, CPU utilization falls, GPU follows, and fps tank, the problem is compounded further by how little advances we made in single core CPU performance since the early 2000s, that two decades later, we have no hope of resolving the issue with faster hardware, and it doesn't seem to be resolvable soon either, not even in the next 5 years. Unless the developers intervene and patch these games with extensive multi core optimizations. Also, most of these examples are PC exclusive titles, so consoles development had no effect on them.

As a result, massive draw distances are a rarity in modern titles, very few titles offer them on PC, as developers don't seem to have figured out a way to make them performant enough. Worse yet, in 2022, we seem to be getting a much worse new problem, ray traced games with CPU limitations due to the Ray Tracing code itself, which is CPU limited even in GPU limited scenes! We've seen this in games such as Gotham Knights (now fixed), The Callisto Protocol, and Hogwarts Legacy, and others. It seems to be a recurring theme in 2023, problems in the past rearing their head again in a different manner. We are once again being restricted by CPUs, whether view distance or Ray Tracing, there doesn't seem to be a solution soon on the horizon, and we don't know when things are going to change for the better, we don't even have a plan for that, DX12 seemed to offer some hope in the beginning, but it now backfired.

I am hoping this thread will document CPU limited cases in PC titles, in the hopes that this will put pressure on developers to seek out new solutions.

JoeJ · Mar 16, 2023

DavidGraham said:
in the hopes that this will put pressure on developers to seek out new solutions.

It's not just the developers.
To have large draw distance not affecting CPU we need:
* GPU driven rendering and occlusion culling, which we can do (but it's still used rarely).
* RT to support LOD by having control over BVH data structure, including custom build, modification, and streaming. Which we can not.

My current plan to address this is to change LOD at very low granularity. Instead having many small clusters like Nanite does, i might just use large clusters.
Thus, not so many parts of geometry need a rebuild per frame. The cost will be still very high, and it's a completely redundant cost. There will be also visible popping like with the current discrete LOD non-solutions. But it's all i can do to support RT.

DavidGraham said:
we seem to be getting a much worse new problem, Ray Traced games with CPU limitations due to the Ray Tracing code itself

I still don't know why exactly RT causes a increased CPU cost at all. We know BVH is built on GPU, so why is this?
I could imagine top levels of BVH, or eventually the entire TLAS is built on CPU for higher tree quality. But that's just speculation.
Is there any information about this? How does it differ across vendors?

davis.anthony · Mar 16, 2023

The answer is : Depending on your preferred target, never.

Scott_Arm · Mar 18, 2023

I've been playing battlefield 2042 on game pass. At launch I could probably probably maintain 70-80 fps, even on the lowest settings. There'd be areas where I'd get more, but worst case I'd be consistently dipping down to 70-80. Now on low I can maintain probably 160+ fps. Now it's actually fun to play. Not sure what they did, or when they actually fixed it. The game is several years old now.

It's a consistent problem. I think the main issues are just that not all cpu jobs are easy to run in parallel with simd, and they're not easy to run parallel in threads. Most engines end up with some kind of bottleneck on a single thread. Something about ray tracing, either in the game or the driver ends up bottlenecking, and I'm not sure what it is. But you can usually set the game resolution to something absurdly low and it'll never go up past a certain point. It's the main thing stopping me from playing a lot of games with ray tracing turned on. In terms of non ray traced games, I think aging game engines with technical debt is a big part of the problem. It's expensive to keep game engines going. Maybe your engine was great for xbox one and ps4, but suddenly it bottlenecks somewhere when you start increasing world size, density or fidelity. Everybody is turfing their custom engines because they can't keep up. If you massively overhaul underlying systems, you also have to redevelop all of the tools that game devs use to create levels, create art, create game logic, debug, inspect etc etc.

So mostly I think it's a combination of unsolved problems, technical debt and money.

I know id software, naughty dog have engines that don't have a main thread. Not sure how they work exactly, but I'm guessing it's a job system, and I know they're less likely to bottleneck on a single thread. Doom Eternal has insane cpu performance, but it's also not an open world game and has more limited somewhat linear environments.

arandomguy · Mar 18, 2023

An issue here is basically latency due to the nature of games being real time. We kind of oversimplify and divide game performance in basically the GPU and CPU block but it's more complex and subdivided than that.

Just still keeping it simple, CPU performance in terms of operations the actual cores can perform has increased much more than the speed and latency in which they can be fed. This contributes to why you actually will see very low CPU utilization in monitoring even if "CPU limited." It's not just latency to system memory either as latency to lower levels of cache haven't exactly improved either. Also with the OPs scenario of something such as increasing view distances I would think would push against larger lower level caches as the larger data set would basically push access more so back again to a higher level.

Related to this is also why I feel there needs to be some reservations in terms of how wide games can really scale in terms of thread count. As spreading the load more evenly across more cores doesn't really address the issue of speeding up how they access data latency wise.

In addition to this we have to wonder just how much performance "bottlenecks" are with the rest of the system. For instance while bandwidth has increased I don't think latency has with respect to communication across PCIe.

Speculatively going forward my guess is there is going to be more and more of a need to decouple game rendering, or really what gets finally put on the display, from game logic, and even game logic from each other. Especially if we are going to drive towards ultra high refresh rates, such as a proposals to 480+ and beyond for truly smooth motion.

Scott_Arm · Mar 18, 2023

@arandomguy Yah, there's definitely an issue with memory access patterns leading to poor performance. A lot of the software industry switches to object-oriented design for everything in the early 2000s, including the games industry, but now there's a push to go back to procedural programming so you can have cache and simd friendliness in how data is laid out in memory, minimizing the impact of latency in accessing various levels of memory. Different companies are approaching this in different ways, going full ECS where everything is laid out in a uniform pattern, and many companies are going hybrid selecting particular engine subsystems for OO or procedural so they can have the benefits of both where it makes sense. Doom Eternal was one where I remembered saying they actually weren't fully concerned with cache friendliness as it was necessary to eliminating the main thread. I'd still love to see how their engine works.

I kind of bundle the cache and main memory with CPU performance, because it's only real purpose is to feed the CPU data. They're tightly coupled. I haven't read it yet, but Richard Fabian's data-oriented design book seems interesting. https://www.dataorienteddesign.com/dodmain/

Decoupling rendering from the cpu would be interesting. I'm not really sure how that work. Not sure you could ever fully get there, but I guess that would look something like having a singular simple draw call per frame to minimize the cpu's affect to near zero, and the gpu drives pretty much everything.

DavidGraham · Mar 18, 2023

JoeJ said:
* GPU driven rendering and occlusion culling, which we can do (but it's still used rarely).

I am reading NVIDIA and AMD are gearing up for that, with their next generation archs.

Nanite offers a great step in the right direction, but it's still limited by HLODs and shadows distance, and UE5.1 is still limited by the CPU on PC unfortunately.

Scott_Arm said:
but I guess that would look something like having a singular simple draw call per frame to minimize the cpu's affect to near zero

DX12 held great promise in that regard, but if fell full flat on its face.

Rootax · Mar 18, 2023

DavidGraham said:
I am reading NVIDIA and AMD are gearing up for that, with their next generation archs.

Nanite offers a great step in the right direction, but it's still limited by HLODs and shadows distance, and UE5.1 is still limited by the CPU on PC unfortunately.

DX12 held great promise in that regard, but if fell full flat on its face.

Is it a DX12 problem btw, or, it's just to hard to do from en engine pov ?

Scott_Arm · Mar 18, 2023

I'm going to guess gpu-driven rendering just has a lot of difficult to solve problems, and like I said with cpu limitations in general, there's technical debt in legacy engines, lack of resources, lack of money etc that just make it very difficult for companies to transform their engines. Even if the graphics api were perfect, it would probably be a difficult transition for a lot of engines.

DavidGraham · Mar 19, 2023

Rootax said:
Is it a DX12 problem btw, or, it's just to hard to do from en engine pov ?

Sadly, all APIs are affected, but DX12 -in particular- promised less CPU overhead (so less chance for the game to become single threaded), more multi core utilization, and vastly more draw calls count than ever before (so drawing more stuff on screen without tremendosuly hurting performance). So far none of that materialized in a good capacity.

If not coded for carefully DX12 actually decreases fps, and introduces frequent "Pipeline State Objects" stuttering.

DavidGraham · Mar 20, 2023

Some more examples, Morrowind has a mod where you can extend the awful default view distance by several folds (from 1 cell to 60 cells). This guy went from 500fps to literally 20fps.

Some other mod was optimized to be operational at 60 cells on a GTX 1070, but the GPU is irrelevant here, as the limiting factor is something else!

And this is the max view distance (99 cell) with no object pop in "all objects are loaded", almost a slide show!

And I will throw something else for a quick laugh, here is the original Operation Flashpoint (released 20 years ago), running at max view distance (Visibility setting), and running a literal slide show on an RTX 3080, then eventually crashing!

JoeJ · Mar 20, 2023

DavidGraham said:
DX12 held great promise in that regard, but if fell full flat on its face.

Not sure if that's an API problem or devs being slow with doing the big changes on huge legacy engines.

My own experience is limited to my GI system, but it requires only one draw call indeed (meaning to execute one command buffer per frame for the whole system).
It does many similar things a renderer would do, e.g. LOD selection, culling, BVH update, even kind of software rendering.
It's all about doing only what's needed, so there are many lightweight shaders to figure out the work. The heavy work is then processed using indirect dispatches.
Beside executing the command buffer i only need to upload a data buffer containing transforms of animated objects. CPU cost is practically zero.

There are two general problems i have here: Lightweight shaders to plan the work can't saturate the GPU. This can be addressed with async compute.
The larger problem is the static command buffer. It needs to dispatch all potential processing tasks, e.g. process tree level 0, then level 1, level 2... level 20.
But at runtime only 3 levels of the tree will get some work to do per frame. Thus, all the other 17 dispatches and memory barriers are redundant, but are still processed.
This has a cost. Maybe useless memory barriers cause flushing caches or things like that. Idk, but it's noticeable.
And that's the point where is see the most urgent need for improvements. There are many options, and both Cuda and OpenCL2 proof it would be possible.

Though, in comparison to high level gfx APIs which completely lack an option to upload command buffers, and thus eventually requiring CPU<->GPU communication for each dispatch or draw call, the improvement we got from DX12 / VK is actually very big. New options, e.g. allowing to skip over commands which are not needed (like Mantle had), would surely help as well, but maybe not that much anymore in relation.

Beside Assassins Creed Unity i don't know examples of heavily GPU driven games. But this was a DX11 game, and DX12/VK really helps with this.
Doom Eternal is probably no example. They always talked about multithreaded rendering mainly. But if you're fully GPU driven, you should no longer need to spend multiple threads on gfx at all.
But it surely depends. Nothing speaks against combining multi threaded and GPU driven rendering.

To say it again: To me current low level APIs seem quite good. I don't see a need to replace them already now. They give good options. Hard to use, yes, but good.
If we see some performance problems in games, we can only guess about the reasons. Draw distance? API issues? Shader compiling stutter? Traversal stutter? Lazy / incompetent devs? Shitty HW or OS?
Who knows. Likely it's all of that, but with a very low percentage for each.
And if so, it's hard even for the devs to figure out what could be improved.
I mean, say you watch at some profiling results, and you see a list of 100 tasks, each taking 1% of runtime. Which task do you pick to optimize? Easy answered: None.
This is indeed the reality with complex software. To me at least. The more complex it gets, the faster you get to this situation. You will optimize some heavy bottlenecks before that, but then those flattened profiling results are what you get.
At this point, your belly might tell you: 'This could be twice as fast'. But there is no way to get there. You would not know where to start.

So there surely is some compromise to make for all the complexity in current games. Sadly we can't expect it will be perfectly optimized, i think.
But ofc. that's not meant as an excuse to release games which perform clearly bad.

JoeJ · Mar 20, 2023

DavidGraham said:
Some more examples, Morrowind has a mod where you can extend the awful default view distance by several folds (from 1 cell to 60 cells). This guy went from 500fps to literally 20fps.

Don't forget that this is just expected.
Let's assume a view region with a shape of a square, for simplicity.
1^2 = 1, 60^2 = 3600, so we draw 3k times more stuff than the default.
500 / 20 = 25, so we got only 25 times slower although we have 3600 times the stuff.

What ever the bottleneck is (likely draw calls), the game was not designed to deal with it. So those perf. results are more good than bad i would say.

Scott_Arm · Mar 20, 2023

@DavidGraham old games like operation flashpoint and morrowind probably can’t leverage simd at all, and are probably single threaded. Cpu single thread performance progress is slowing down. On top of that they were never considered well optimized in the first place, and probably have memory access patterns that leave the cpu waiting for data.

techuse · Mar 20, 2023

Software problem. If it wasn’t there would be no games with proper core scaling performance.

tuna · Mar 21, 2023

Scott_Arm said:
I know id software, naughty dog have engines that don't have a main thread. Not sure how they work exactly, but I'm guessing it's a job system, and I know they're less likely to bottleneck on a single thread. Doom Eternal has insane cpu performance, but it's also not an open world game and has more limited somewhat linear environments.

Every program has a main thread. I can't really think of any exception....

MuteyM · Mar 21, 2023

JoeJ said:
New options, e.g. allowing to skip over commands which are not needed (like Mantle had)

Vulkan has had Conditional Rendering for quite awhile now. I’d be surprised if DX12 didn’t have something similar.

DavidGraham · Mar 22, 2023

Sebbi indirectly chimed in on the conversation.

In my experience the biggest reason for slow performance is allocating a lot of small objects. Allocation is expensive on modern computers (mutexes between 10-32 threads in consumer CPUs). And this practice scatters objects around in the memory, causing a lot of cache misses.

One reason why people allocate single objects is virtual functions. An interface class is not concrete. You can't put objects implementing an interface into a linear array (std::variant is way too limiting). Thus you allocate them separately and access each though a pointer.

I have been optimizing data structures on both CPU and GPU side. Embedding data in the structure is always a big win compared to storing a pointer or array index in the structure and doing an indirect read (hits cold cache lines).

Also pointers are nowadays 8 bytes each. Storing pointers is not free. In order to fetch data stored elsewhere, you first need to load the pointer and then the data. This adds at least 8 bytes of reads. And allocators align by 16 bytes, causing extra losses for small allocs.

One of the nicest things about my new graphics API is that loading/unloading a game is super fast. People immediately noticed that game load stalls were gone. The new API requires no allocations (amortized) for creating a mesh, texture, shader or other resources.

Single object allocation leads to single object processing funcs. Processing functions taking N contiguous elements run faster than calling single object functions N times. This way you run the pre/post conditions once. Data access is linear and compiler can optimize better.

I am not saying that allocations are bad or virtual functions are bad. Fine grained allocations are bad, and with fine grained allocations you often end up with fine grained virtual function calls (trashing both i$ and d$).

A fast memory allocator (like TLSF) causes up to 6 cache line misses per allocation (and free) operation. But this isn't the biggest cost. The biggest cost is the mutex in MT environment.

I prefer to design the sync points in my programs carefully. Fine grained locks are a poison to performance, but they are also a code smell. It's hard to keep MT program safe if you have a lot of inter-thread dependencies. Allocator is the most fine grained lock in most programs.

I know that there are allocators with thread local portions to reduce the mutex locking cost, but these allocators often fragment more and cost more CPU cycles (when lock cost is not considered). Calling an allocator like this for millions of small objects is super expensive.

https://twitter.com/x/status/1638452063943241729

Scott_Arm · Mar 22, 2023

@DavidGraham Any time you dereference a pointer you can have a cache miss. If you allocate small objects on a heap, those objects may not be near each other. A cache line is typically 64 bytes, but if you are heap allocating two objects that are 32 bytes in size, they may not be picked up within the same read. The first one will be in cache, the second on will miss and then you have to do a second read to memory which wastes cycles. The typical though process for "data oriented design" is to allocate continguous blocks of memory for the same types of data. If you're doing an operation on similar data, you lay out the data sequentially to avoid cache misses.

Cache align structs. Turn the struct into a struct of array that fits within a cache 64 byte cache line. Create an array of structs. Array of struct of arrays. It doesn't work for everything, but if you're running the same transformation across many of the same thing, it's a good way of laying things out in memory.

AoS and SoA - Wikipedia

en.wikipedia.org

BRiT · Mar 22, 2023

The more things change the more they stay the same. When I was first getting into more complex coding decades ago you had to do your own memory allocation and the best way of handling that was chunking / carving so you never had individual objects fragmenting to hell and back. Somewhere along the way everyone forgot about that. It became too easy and simple to let some other library deal with it or to even worry about memory size minimizing.

CPU Limited Games, When Are They Going To End?

DavidGraham

JoeJ

davis.anthony

Scott_Arm

arandomguy

Scott_Arm

DavidGraham

Rootax

Scott_Arm

DavidGraham

DavidGraham

JoeJ

JoeJ

Scott_Arm

techuse

tuna

MuteyM

DavidGraham

Scott_Arm

AoS and SoA - Wikipedia

BRiT

(>• •)>⌐■-■ (⌐■-■)

Similar threads