A Comparison: SSE4, AVX & VMX

pjbliverpool · Apr 28, 2012

As it stands currently it can be argued that there are 3 major CPU SIMD instruction sets in use for modern high end gaming. (okay, ignoring SPU's).

Those being:

SSE4: Used on Pernyn and Nehalem (in slightely different configurations)
AVX: Used on the very latest PC CPU architecures, namely Sandybridge, Bulldozer and Ivybridge.
VMX: Used in Xenon x3 and in a slightely reduced form in the PPU on Cell

So given the same theoretical throughput, what are the general thoughts about which of these instructions sets is best suited for modern gaming?

Obviously AVX has twice the theoretical single precision throughput of SSE4 and VMX per clock so lets say were using as near as dammin 100% vectorised code on the following hypothetical CPU's:

1x Penryn Core @ 3.2 Ghz
1x SandyBridge Core @ 1.6 Ghz
1x Xenon core @3.2Ghz

Any views on how these would fair against one another?

Davros · Apr 28, 2012

Is avx used in any games ?
or is it transparent to the programmer. I'm guessing sse3 is needed for older cpu's, my cpu doesnt support sse4

rpg.314 · Apr 29, 2012

so lets say were using as near as dammin 100% vectorised code on the following hypothetical CPU's:

Then all of them suck. You should be using a GPU.

mczak · Apr 29, 2012

You cannot really say which instruction set is faster as that would be dependent on implementation. Latency and throughput of i.e. sse2 instructions vary greatly between different cpus.
Furthermore the instruction set of AVX isn't actually different to SSE(4), it's exactly the same instructions just extended to 256bit (well for floats only - 256bit ints need to wait til AVX2, Haswell). The instructions are just mostly slightly different with AVX since the vex encoding has non-destructive (3 operand) syntax (makes the instructions slightly larger but saves most register-register move instructions which should be good for some small performance improvement).
AVX with ints is thus just just minimally faster than SSE4 on the same cpu (the only advantage comes from less move instructions), and with floats it's a bit more than twice as fast in theory (except for divisions on sandy as the divide unit is only 4-wide though Ivy "fixed" that). This assumes though your algorithm really can be adjusted to use 8-wide floats trivially, and further assumes no load/store bottlenecks (sandy can load 2 128bit values and store 1 128bit value per clock) not to mention obviously other things like limitations due to memory bandwidth or latency also still are the same.

I don't know much about VMX, I believe it has some better support for horizontal operations and shuffles but if you can benefit from such instructions can't be said generally. About VMX on Xenon I have absolutely no idea what the throughput for even the "basic" operations (float vec mul, add) are just because the instructions are 4-wide doesn't tell you much what the cpu can do per clock, not sure if that information was published anywhere for Xenon (it might be possible that just like older cpus supporting sse2 they really only have 2-wide instead of 4-wide execution units for instance).

tunafish · Apr 29, 2012

AVX1 is honestly not all that interesting. Getting 8-wide parallelism without gather is a whole lot harder than 4-wide. AVX2, to be released with Haswell, however is very. All the low-level coders I routinely talk with are pretty stoked for the gather support and FMA. Not only does it make "lists of elements" style code a lot easier to vectorize, it should finally make reasonable gains from autovectorization a reality. Vector instructions with gather are just better than than ones without.

mczak said:
not sure if that information was published anywhere for Xenon (it might be possible that just like older cpus supporting sse2 they really only have 2-wide instead of 4-wide execution units for instance)

I'm reasonably certain that VMX is full-width, and has always been.

pjbliverpool · Apr 29, 2012

Thanks for the input guys. So it sounds like AVX is most certainly not just 2x SSE4 but AVX2 coming with Haswell might be getting close? Sounds like there's a lot to be excited about in Haswell.

Does VMX support FMA? I assume that's quite advantageous for games and so would give it a leg up in some respects over AVX?

Davros · Apr 29, 2012

Do you need to code specifically for sse4/avx ?
Back in the day (cue old fart story) if I wanted to support x87 i would just use a comiler directive ($N i think) and that was it, job done if the pc had a math co-pro it would get used, if not the program would just use the integer unit

tunafish · Apr 29, 2012

rpg.314 said:
Then all of them suck. You should be using a GPU.

Just to elaborate. Most loads do not vectorize that easily on all implementations. That's why comparing ideal cases is pointless -- you just don't see them that much in the real world. The ability to vectorize more cases is much more important than the optimal throughput in the optimal case.

pjbliverpool said:
Thanks for the input guys. So it sounds like AVX is most certainly not just 2x SSE4 but AVX2 coming with Haswell might be getting close? Sounds like there's a lot to be excited about in Haswell.

I'd say that AVX2 doesn't "get close to being 2x SSE4". It's much better than that. It will allow a lot of code that is currently done with single elements to be autovectorized by the compiler.

Does VMX support FMA? I assume that's quite advantageous for games and so would give it a leg up in some respects over AVX?

It does. Note that AVX (and SSE) can do both a multiply and an add in the same clock, so the advantage isn't that dramatic.

Davros said:
Do you need to code specifically for sse4/avx ?
Back in the day (cue old fart story) if I wanted to support x87 i would just use a comiler directive ($N i think) and that was it, job done if the pc had a math co-pro it would get used, if not the program would just use the integer unit

For doing math on single elements, yeah sure. But that's not all that much faster than x87. SIMD does not make the operations any faster, it allows you to do more of them at the same time. So instead of loading two individual elements, you load two vectors of 4 or 8 and multiply each element of one vector with the corresponding element of the other vector. So not only do you need to use special instructions, pre-AVX2 you have to layout your data so that you can load consecutive (16-byte aligned) elements into memory. And since cross-lane operations are slow, you ideally want the vectors to have elements from different objects. So instead of putting the value in the object, you have to build an array that has one value from each object, for each value in said objects.

You can probably see why this gets hairy fast. It's hard to do by hand, and nigh-impossible to do automatically by a compiler. There is some downright heroic work on the subject by the Intel and GCC teams, but even they really don't get that much speedup from autovectorized code. So today, only the things that are absolutely trivial tend to get optimized. (position and speed = 2 4-element vectors.)

AVX2 brings gather instructions, which are basically vectorized loads. they take a base address and a vector full of offsets, and fill the target register with [base + offset]. This should make vector instructions useful in a lot of places they weren't before, because a lot of loops can then be trivially vectorized by the compiler.

fellix · Apr 29, 2012

Let's not forget the hardware transactional memory support, primed for Haswell too. It will further optimize memory pipeline performance under heavy MP loads.

sebbbi · Apr 29, 2012

VMX128 is actually a very good set of instructions (compared to SSE at least). It has very good shuffles/inserts/select, multiply-add, complex bit packing instructions (including float16 conversion), (AOS) dot product, etc. However instruction set is only one side of the coin, the other is the CPU architecture implementing the instruction set.

Nothing of course compares to AVX2 (in Haswell). But gather is only good if it is fast enough, and nobody really knows that yet. 256 bit wide integer operations are of course nice addition as well.

pjbliverpool · Apr 29, 2012

Sounds like Haswell going to be a pretty impressive chip. I wonder how long it will be before CPU's drop specialised SIMD units altogether though and move vector processing to the GPU's. Are we getting close to that yet? Or would GPU's be unsuitable as complete replacements?

I know AMD has been hinting about it in a future fusion iteration but I'm not sure whether that would be a complete replacement for the CPU's SIMD abilities or just complimentary.

3dilettante · Apr 30, 2012

There would need to be some pretty striking advances in implementation to allow for a CPU FP unit to be completely stripped out of the CPU core.
The latency in hopping from a CPU to a GPU would be unacceptable for workloads that require higher straightline performance. Problems that do not need much more data level parallelism than a CPU provides would also be a waste on a CU unit that needs four to eight times as many work items.

tunafish · Apr 30, 2012

pjbliverpool said:
Sounds like Haswell going to be a pretty impressive chip. I wonder how long it will be before CPU's drop specialised SIMD units altogether though and move vector processing to the GPU's. Are we getting close to that yet? Or would GPU's be unsuitable as complete replacements?

It's all about latency. If you were to move the processing to the GPU, even on the same die you are talking of several extra cycles. No matter how awesome throughput you have, that would still hurt you on a lot of loads.

I really don't think that the cpu vector units will ever be dropped. More likely, either they will evolve into the GPU ones (expand avx to full width, put 4-8 threads into the frontend, run GPU code on the CPU), or at some point the manufacturers will stop adding to them, and just put all the new advancements in the new dedicated vector block.

Nick · May 5, 2012

pjbliverpool said:
Sounds like Haswell going to be a pretty impressive chip. I wonder how long it will be before CPU's drop specialised SIMD units altogether though and move vector processing to the GPU's.

The reverse will happen. Note that a Haswell quad-core will be capable of 500 GFLOPS, while today's 22 nm HD 4000 can only do about 300 GFLOPS. GPUs also still have a lot of catching up to do to support complex code and not choke due to latency and bandwidth. So you can't get rid of the CPU's SIMD units any time soon, and the GPU is evolving into a CPU architecture to support more complex generic code. So the GPU and CPU are converging.

Eventually it will make sense to just move all programmable throughput computing to the CPU. AVX2 will already be perfectly suitable for graphics shaders. The only remaining deal breaker is the higher power consumption. But this can be tackled with AVX-1024. The VEX encoding already supports extending it to 1024-bit registers, and by executing such instructions on 256-bit units in four cycles, the CPU's front-end and scheduler will have four times less switching activity, hence dramatically lowering the power consumption. A 16 nm successor to Haswell could deliver 2 TFLOPS for the same die size and not break a sweat.

GPGPU is dying. Even though AMD is making its GPU architecture more flexible, NVIDIA went the other direction with Kepler. And on top of that you get wildly inconsistent performance between discrete and integrated parts. So GPGPU is utter rubbish for mainstream applications. Developers will instead focus on AVX2, since that will be available in every CPU from Haswell forward, and is only going to get more powerful.

Davros · May 5, 2012

Nick said:
NVIDIA went the other direction with Kepler.

Perhaps nv are trying to ensure that people who do GPGPU will buy tesla cards, but then again theres the risk people on a tight budget would buy amd unless they are locked into nv's tool chain.

pjbliverpool · May 5, 2012

Nick said:
The reverse will happen. Note that a Haswell quad-core will be capable of 500 GFLOPS, while today's 22 nm HD 4000 can only do about 300 GFLOPS. GPUs also still have a lot of catching up to do to support complex code and not choke due to latency and bandwidth. So you can't get rid of the CPU's SIMD units any time soon, and the GPU is evolving into a CPU architecture to support more complex generic code. So the GPU and CPU are converging.

Eventually it will make sense to just move all programmable throughput computing to the CPU. AVX2 will already be perfectly suitable for graphics shaders. The only remaining deal breaker is the higher power consumption. But this can be tackled with AVX-1024. The VEX encoding already supports extending it to 1024-bit registers, and by executing such instructions on 256-bit units in four cycles, the CPU's front-end and scheduler will have four times less switching activity, hence dramatically lowering the power consumption. A 16 nm successor to Haswell could deliver 2 TFLOPS for the same die size and not break a sweat.

GPGPU is dying. Even though AMD is making its GPU architecture more flexible, NVIDIA went the other direction with Kepler. And on top of that you get wildly inconsistent performance between discrete and integrated parts. So GPGPU is utter rubbish for mainstream applications. Developers will instead focus on AVX2, since that will be available in every CPU from Haswell forward, and is only going to get more powerful.

Interesting stuff cheers. It'd certainly be great to see developers start to really take advantage of the vector processing on PC CPU's. I can't help but think that AVX is pretty underutilised at the moment, obviously AVX2 is going to be a lot more useful so once it starts becoming the standard hopefully developers will start pushing it to its limits thus driving it forwards to more GPU like performance.

I'm not sure how you get 500 GFLOPS out of a quad Haswell though? Even running at 4 Ghz (which is certainly possible) if would need to be capable of twice the single precision FLOPs as Ivy Bridge. Is AVX2 going to double the throughput of AVX? (32 flops per cycle vs 16)

fellix · May 5, 2012

pjbliverpool said:
I'm not sure how you get 500 GFLOPS out of a quad Haswell though? Even running at 4 Ghz (which is certainly possible) if would need to be capable of twice the single precision FLOPs as Ivy Bridge. Is AVX2 going to double the throughput of AVX? (32 flops per cycle vs 16)

Dual FMA3 pipelines, replacing the current ADD and MUL vector units?

denev2004 · May 14, 2012

fellix said:
Dual FMA3 pipelines, replacing the current ADD and MUL vector units?

That's not enough if you're talking about DP.

Even talking about SP it's just barely enough...Cos I don't think a 4-core-haswell can went up to 4Ghz

fellix · May 14, 2012

By the time Haswell is out, I think Intel should already have a refined 22nm process up and running. After all, Haswell will be the first native architecture built for Tri-Gate.

pjbliverpool · May 14, 2012

fellix said:
Dual FMA3 pipelines, replacing the current ADD and MUL vector units?

Is this actually what AVX will have or just a guess at this point. 1Tflop SP from an 8 core x86 would be pretty impressive!

EDIT: Ive no doubt Haswell will be capable of hitting 4 Ghz but I doubt Intel will clock it that high given the lack of competition. I'm fairly sure intel could have been releasing stock 4ghz CPU's since Sandybridge if they'd have felt the need.

A Comparison: SSE4, AVX & VMX

pjbliverpool

B3D Scallywag

Davros

rpg.314

mczak

tunafish

pjbliverpool

B3D Scallywag

Davros

tunafish

fellix

sebbbi

pjbliverpool

B3D Scallywag

3dilettante

tunafish

Nick

Davros

pjbliverpool

B3D Scallywag

fellix

denev2004

fellix

pjbliverpool

B3D Scallywag

Similar threads