Where are the cheap 16-core desktop processors?

Blazkowicz · Jul 30, 2015

What's SoA? Soup of arrays?

Rodéric · Jul 30, 2015

Blazkowicz said:
What's SoA? Soup of arrays?

Structure of Arrays.
Not sure how a Soup of Arrays would taste, might be good...

tunafish · Jul 30, 2015

Blazkowicz said:
What's SoA? Soup of arrays?

Roderic answered, but I'll expand a little:

Let's say my game world contains lots of actors, each of which has the properties name, location, hp (etc...).

Array of structures means that in memory they are laid out as follows:

Code:

[object1[name1][location1][hp1][...]] 
[object2[name2][location2][hp2][...]]
[object3[name3][location3][hp3][...]]

While in structures of arrays style they would be:

Code:

[names[name1][name2[name3]]
[locations[location2][location2][location3]]
[hps[hp1][hp2][hp3]]
[...[...][...][...]]

AoS does have some advantages, notably cache locality of all properties of an object after one access. However, for most use cases the SoA style is significantly faster.

sebbbi · Aug 1, 2015

Rodéric said:
The problem with multi-threading isn't really that it's hard, it's more like a lot of programmers don't have the discipline and rigor to do it, and many also still think that Object Oriented Programming is panacea, which is pretty much the opposite of what you want to make fast (multi threaded or not) programs. (Note that going for SoA and arrays of data vs list of objects does help splitting work for concurrency, but you still have to care for data dependencies.)

Many programmers have the mindset that code is the problem for parallel execution. They program critical sections / locks to prevent multiple threads accessing the same piece of code. In reality, code is immutable (read only), and can be freely accessed by as many cores concurrently as needed. Parallelism is purely a data problem. Good data separation is the key to multi-thread code bases (both safely and efficiently). Big data blobs that contain lots of data with loose coupling are problematic (deep object oriented inheritance hierarchies often do this). The main problem is that humans tend to group data together based on real life objects and concepts, but in a computer program you should group data together based on functionality / use cases / access patterns. Object oriented teaching in schools often uses real world objects as an example, and that is highly problematic. People need to unlearn this to be able to write good code.

The classic example about humans messing up things is the Animal -> Bird -> XXX inheritance hierarchy. A Bird can fly(), but there are some bids that cannot. Inheritance in computer programs should always be based on functionality, not by grouping (especially an arcane species grouping system made by non-programmers). In this context, the class should inherit Flying instead of Bird, if that is what it does. As a bonus functionality based inheritance also increases code base reusability (Lizards that are Flying can use the same code as Birds to fly()). This is also one of the reasons why people don't know how to properly split data properly to components (in entity/component model). You should always look at the processing call sites, and split the data based on those. Grouping data based on what feels good is not a good idea. Component model is SoA layout, but often when technical people talk about SoA layout, they mean splitting data in a very fine grained way for SSE/AVX (16/32 bit lanes).

Object oriented programming also often emphasizes that classes are black boxes, and virtual function calls are one instance of this problem. When you call a virtual function, you don't know what code gets called and you don't know what data gets accessed. It is impossible to ensure (in a robust and quick way) that data races/hazards do not occur, if you don't know what data your code is accessing at each call site.

tunafish said:
AoS does have some advantages, notably cache locality of all properties of an object after one access. However, for most use cases the SoA style is significantly faster.

AoS is better for random accessing, assuming you use all the struct data fields. You should of course mix and match AoS and SoA based on your access patterns. AoSoA if often a good idea. Link: https://software.intel.com/en-us/articles/memory-layout-transformations.

Ethatron · Aug 3, 2015

sebbbi said:
In reality, code is immutable (read only), and can be freely accessed by as many cores concurrently as needed. Parallelism is purely a data problem. Good data separation is the key to multi-thread code bases (both safely and efficiently).

There are exceptions. When the order of processing matters because there are dependencies in the order of the progression of the algorithm(s), then you get a very different problem which you can not solve with approaches which focus on your data. You have to focus on state, time, latencies and pipelines. Parallelizing data-compression which is not fixed rate fe., you almost always end up with semaphore driven pipelines. Progressive data expressions have low hanging fruit parallelization (process each "level" in parallel, no locks), but you can get another 50% speed out of it when you manage to fine-grain pipeline the dependencies between the "levels" and run all levels in parallel (xy as well a z, thinking in dimensionalities of a multi-resolution image).
An entire game engine is very much behaving like a progressive "image" generator. Pieces in the system can be parallelized easily, mostly the ones with lots of data, others not so much. And when you want to go from 2-3 core fluctuating utilization to 16 core full utilization load balanced, you need to know way more than how to lay out data.

sebbbi · Aug 3, 2015

Ethatron said:
There are exceptions. When the order of processing matters because there are dependencies in the order of the progression of the algorithm(s), then you get a very different problem which you can not solve with approaches which focus on your data. You have to focus on state, time, latencies and pipelines. Parallelizing data-compression which is not fixed rate fe., you almost always end up with semaphore driven pipelines. Progressive data expressions have low hanging fruit parallelization (process each "level" in parallel, no locks), but you can get another 50% speed out of it when you manage to fine-grain pipeline the dependencies between the "levels" and run all levels in parallel (xy as well a z, thinking in dimensionalities of a multi-resolution image).
An entire game engine is very much behaving like a progressive "image" generator. Pieces in the system can be parallelized easily, mostly the ones with lots of data, others not so much. And when you want to go from 2-3 core fluctuating utilization to 16 core full utilization load balanced, you need to know way more than how to lay out data.

A pragmatic point of view would be that a program does nothing but transform data. It is a series of data transformations. You are right that the order of these transformation matter, and it is the code that defines that order. The strictest order dependency (each instruction has dependency to the previous) prohibits all parallelism. The point of my previous post was to discuss about false dependencies (not actual ones) based on bad data layouts and synchronization. If you put all your data regarding to a single object to a big data blob (usually a class derived from something called CObject or BaseObject), it becomes either unsafe or slow to access that object from multiple threads, depending whether you give unguarded access to the same data blob to multiple cores simultaneously or whether you add lots of fine grained locks. The correct way is to split that big data blob to multiple data structures, allowing you to access each of them simultaneously from multiple threads safely and without the need of any fine grained synchronization.

Data amplification/reduction can often be implemented with an atomic counter. If you are dealing with a extremely fine grained problem, such as sorting (all elements can be in random order in memory -> the result might interleave them in any way) prefix sums are often a good solution. For example a radix sorter (32 bit keys, 8 bits per pass = 4 passes) has only four synchronization points ("barriers"), one after each pass. As each core gets identical amount of work, the algorithm has almost linear scaling (it is actually slightly superlinear since the additional L1/L2 caches of each core improve the data accesses). The algorithm is dead simple (for each core, times 4): calculate a histogram, sync, prefix sum, move items.

Radix sort also is a good example why knowing your data is important. A generic comparison based sorting algorithm running time is O(n log n). Radix sorting running time (for fixed length keys) is O

. If you know that all your keys are fixed length, you can use a faster algorithm. As a bonus, radix sorting is trivial to parallelize (even for GPUs).

We have found that in most cases where you write complex branchy code, you should instead split that code by branch. Splitting by branch makes it easier to understand the data (in addition to removing branch prediction stalls), and allows the programmer to split the data as well. Split data is more cache friendly (higher percentage of actually used bytes per loaded cache line) and is often easier to parallelize. Some people might argue that data driven design is premature optimization, but this is not true. Good data splitting almost always improves the code quality and makes it easier to maintain.

(I will stop my rambling now, since this is the cheap 16 core CPU thread, not a thread about writing good code for 16 core CPUs)

Ethatron · Aug 3, 2015

sebbbi said:
(I will stop my rambling now, since this is the cheap 16 core CPU thread, not a thread about writing good code for 16 core CPUs)

It gives a glimpse of insight why 16-core machines are not sold like potatoes though.

Voxilla · Aug 7, 2015

Ethatron said:
It gives a glimpse of insight why 16-core machines are not sold like potatoes though.

If they would be priced like potatoes, they would sell like potatoes for sure.

We need more competition or CPU alternatives on the desktop though, that is the fundamental issue.
Look at the ARM eco system, there are plenty of 8 core CPUs at very low price.
And these cores outperform the x86 Jaguar cores.

It probably needs another super chip from Apple to open our eyes.

Kaarlisk · Aug 7, 2015

Voxilla said:
Look at the ARM eco system, there are plenty of 8 core CPUs at very low price.
And these cores outperform the x86 Jaguar cores.

For one, A57 is mostly equal to Jaguar, at least in the browser benchmarks I could find (the error margin is huge, though, with varying browser/test versions and varying A57 clocks).
For two, most of those 8 cores are actually 8xA53, which (for most applications) is slower than even 2xKrait.
And the ones that are 4xA53+4xA57 are in flagship phones, which often cost more than than a lot of quad core laptops & a lot of Core laptops, with dual Core (Ivy Bridge, Haswell, Broadwell, etc) CPUs being way faster than quad/octa phone SoCs.

Besides which, if you want a cheap quad, you can get one – the Atom/Kabini option costs like $70. Waaaaaay cheaper than a dual i3.

Voxilla · Aug 7, 2015

If you want to know how fast various ARM and x86 based multi-cores perform on a well threaded and SIMD optimized application have a look at this App.
(not to the point, but a A53 is way faster as a Krait at similar clock frequency.)
Unfortunately Atom is no competition here and it get's worse in 64 bit.

Where are the cheap 16-core desktop processors?

Blazkowicz

Rodéric

a.k.a. Ingenu

tunafish

sebbbi

Ethatron

sebbbi

Ethatron

Voxilla

Kaarlisk

Voxilla

Similar threads