Blazkowicz
Legend
What's SoA? Soup of arrays?
Structure of Arrays.What's SoA? Soup of arrays?
What's SoA? Soup of arrays?
[object1[name1][location1][hp1][...]]
[object2[name2][location2][hp2][...]]
[object3[name3][location3][hp3][...]]
[names[name1][name2[name3]]
[locations[location2][location2][location3]]
[hps[hp1][hp2][hp3]]
[...[...][...][...]]
Many programmers have the mindset that code is the problem for parallel execution. They program critical sections / locks to prevent multiple threads accessing the same piece of code. In reality, code is immutable (read only), and can be freely accessed by as many cores concurrently as needed. Parallelism is purely a data problem. Good data separation is the key to multi-thread code bases (both safely and efficiently). Big data blobs that contain lots of data with loose coupling are problematic (deep object oriented inheritance hierarchies often do this). The main problem is that humans tend to group data together based on real life objects and concepts, but in a computer program you should group data together based on functionality / use cases / access patterns. Object oriented teaching in schools often uses real world objects as an example, and that is highly problematic. People need to unlearn this to be able to write good code.The problem with multi-threading isn't really that it's hard, it's more like a lot of programmers don't have the discipline and rigor to do it, and many also still think that Object Oriented Programming is panacea, which is pretty much the opposite of what you want to make fast (multi threaded or not) programs. (Note that going for SoA and arrays of data vs list of objects does help splitting work for concurrency, but you still have to care for data dependencies.)
AoS is better for random accessing, assuming you use all the struct data fields. You should of course mix and match AoS and SoA based on your access patterns. AoSoA if often a good idea. Link: https://software.intel.com/en-us/articles/memory-layout-transformations.AoS does have some advantages, notably cache locality of all properties of an object after one access. However, for most use cases the SoA style is significantly faster.
In reality, code is immutable (read only), and can be freely accessed by as many cores concurrently as needed. Parallelism is purely a data problem. Good data separation is the key to multi-thread code bases (both safely and efficiently).
A pragmatic point of view would be that a program does nothing but transform data. It is a series of data transformations. You are right that the order of these transformation matter, and it is the code that defines that order. The strictest order dependency (each instruction has dependency to the previous) prohibits all parallelism. The point of my previous post was to discuss about false dependencies (not actual ones) based on bad data layouts and synchronization. If you put all your data regarding to a single object to a big data blob (usually a class derived from something called CObject or BaseObject), it becomes either unsafe or slow to access that object from multiple threads, depending whether you give unguarded access to the same data blob to multiple cores simultaneously or whether you add lots of fine grained locks. The correct way is to split that big data blob to multiple data structures, allowing you to access each of them simultaneously from multiple threads safely and without the need of any fine grained synchronization.There are exceptions. When the order of processing matters because there are dependencies in the order of the progression of the algorithm(s), then you get a very different problem which you can not solve with approaches which focus on your data. You have to focus on state, time, latencies and pipelines. Parallelizing data-compression which is not fixed rate fe., you almost always end up with semaphore driven pipelines. Progressive data expressions have low hanging fruit parallelization (process each "level" in parallel, no locks), but you can get another 50% speed out of it when you manage to fine-grain pipeline the dependencies between the "levels" and run all levels in parallel (xy as well a z, thinking in dimensionalities of a multi-resolution image).
An entire game engine is very much behaving like a progressive "image" generator. Pieces in the system can be parallelized easily, mostly the ones with lots of data, others not so much. And when you want to go from 2-3 core fluctuating utilization to 16 core full utilization load balanced, you need to know way more than how to lay out data.
(I will stop my rambling now, since this is the cheap 16 core CPU thread, not a thread about writing good code for 16 core CPUs)
It gives a glimpse of insight why 16-core machines are not sold like potatoes though.
For one, A57 is mostly equal to Jaguar, at least in the browser benchmarks I could find (the error margin is huge, though, with varying browser/test versions and varying A57 clocks).Look at the ARM eco system, there are plenty of 8 core CPUs at very low price.
And these cores outperform the x86 Jaguar cores.