Sebbi posted a thread about CPU scaling on various CPUs. Recited here in full as it's quite a long thread.
Let's talk about CPU scaling in games. On recent AMD CPUs, most games run better on single CCD 8 core model. 16 cores doesn't improve performance. Also Zen 5 was only 3% faster in games, while 17% faster in Linux server. Why? What can we do to make games scale?
History lesson: Xbox One / PS4 shipped with AMD Jaguar CPUs. There was two 4 core clusters with their own LLCs. Communication between these clusters was through main memory. You wanted to minimize data sharing between these clusters to minimize the memory overhead.
6 cores were available to games. 2 taken by OS in the second cluster. So game had 4+2 cores. Many games used the 4 core cluster to run your thread pool with work stealing job system. Second cluster cores did independent tasks such as audio mixing and background data streaming.
Workstation and server apps usually spawn independent process per core. There's no data sharing. This is why they scale very well to workloads that require more than 8 cores. More than one CCD. We have to design games similarly today. Code must adapt to CPU architectures.
On a two CCD system, you want to have two thread pools locked on these cores, and you want to push tasks to these thread pools in a way that minimizes the data sharing across the thread pools. This requires designing your data model and communication in a certain way.
Let's say you use a modern physics library like Jolt Physics. It uses a thread pool (or integrates to yours). You could create Jolt thread pool on the second CCD. All physics collisions, etc are done in threads which share a big LLC with each other.
Once per frame you get a list of changed objects from the physics engine. You copy transforms of changed physics engine objects to your core objects, which live in the first CCD. It's a tiny subset of all the physics data. The physics world itself will never be accessed by CCD0.
Same can be done for rendering. Rendering objects/components should be fully separated from the main game objects. This way you can start simulating the next frame while rendering tasks are still running. Important for avoiding bubbles in your CPU/GPU execution.
Many engines already separate rendering data structures fully from the main data structures. But they make a crucial mistake. They push render jobs in the same global job queue with other jobs, so they will all be distributed to all CCDs with no proper data separation.
Instead, the graphics tasks should be all scheduled to a thread pool that's core locked to a single CCD. If graphics is your heaviest CPU hog, then you could allocate physics and game logic tasks to the thread pool in the other CCD. Whatever suits your workload.
Rendering world data separation is implemented by many engines already. It practically means that you track which objects have been visually modified and bump allocate the changed data to a linear ring buffer which is read by the render update tasks when next frame render starts.
This kind of design where you fully separate your big systems has many advantages: It allows refactoring each of them separately, which makes refactoring much easier to do in big code bases in big companies. Each of these big systems also can have unique optimal data models.
In a two thread pool system, you could allocate independent background tasks such as audio mixing and background streaming to either thread pool to load balance between them. We could also do more fine grained splitting of systems, by investigating their data access patterns.
Next topic: Game devs historically were drooling about new SIMD instructions. 3dNow! Quake sold AMD CPUs. VMX-128 was super important for Xbox 360 and Cell SPUs for PS3. Intel made mistakes with AVX-512. AVX-512 was initially too scattered and Intel's E-cores didn't support it.
Game devs were used to writing SIMD code either by using a vec4 library or hand written intrinsics. vec4 already failed with 8-wide AVX2, and hand written instrinsics failed with various AVX-512 instruction sets and various CPU support. How do we solve this problem today?
Unreal Engine's new Chaos Physics was written with Intel's ISPC SPMD compiler. ISPC allows writing SMPD code similar to GPU compute shaders on CPU side. It supports compiling the same code to SSE4, ARM NEON, AVX, AVX2 and AVX-512. Thus it solves the instruction set fragmentation.
Unity's new Burst C# compiler aims to do the same for Unity. Burst C# is a C99-style C# subset. The compiler leans heavily on autovectorization. Burst C# has implicit knowledge of data aliasing allowing it to autovectorize better than standard compiler. Same is true for Rust.
However autovectorization is always fragile no matter how many "restrict" keywords you put either manually or by the compiler. ISPCs programming model is better suited for reliable near optimal AVX-512 generation.
ISPC compiles C/C++ compatible object files. They are easy to call from your game engine code. Workloads such as culling, physics simulation, particle simulation, sorting, etc can be done using ISPC to get AVX2 (8 wide) and AVX-512 (16 wide) performance benefits.