A Comparison: SSE4, AVX & VMX

Discussion in 'PC Industry' started by pjbliverpool, Apr 28, 2012.

  1. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,454
    Location:
    Guess...
    As it stands currently it can be argued that there are 3 major CPU SIMD instruction sets in use for modern high end gaming. (okay, ignoring SPU's).

    Those being:

    SSE4: Used on Pernyn and Nehalem (in slightely different configurations)
    AVX: Used on the very latest PC CPU architecures, namely Sandybridge, Bulldozer and Ivybridge.
    VMX: Used in Xenon x3 and in a slightely reduced form in the PPU on Cell

    So given the same theoretical throughput, what are the general thoughts about which of these instructions sets is best suited for modern gaming?

    Obviously AVX has twice the theoretical single precision throughput of SSE4 and VMX per clock so lets say were using as near as dammin 100% vectorised code on the following hypothetical CPU's:

    1x Penryn Core @ 3.2 Ghz
    1x SandyBridge Core @ 1.6 Ghz
    1x Xenon core @3.2Ghz

    Any views on how these would fair against one another?
     
  2. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    12,921
    Is avx used in any games ?
    or is it transparent to the programmer. I'm guessing sse3 is needed for older cpu's, my cpu doesnt support sse4
     
  3. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Location:
    /
    Then all of them suck. You should be using a GPU.
     
  4. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    2,914
    You cannot really say which instruction set is faster as that would be dependent on implementation. Latency and throughput of i.e. sse2 instructions vary greatly between different cpus.
    Furthermore the instruction set of AVX isn't actually different to SSE(4), it's exactly the same instructions just extended to 256bit (well for floats only - 256bit ints need to wait til AVX2, Haswell). The instructions are just mostly slightly different with AVX since the vex encoding has non-destructive (3 operand) syntax (makes the instructions slightly larger but saves most register-register move instructions which should be good for some small performance improvement).
    AVX with ints is thus just just minimally faster than SSE4 on the same cpu (the only advantage comes from less move instructions), and with floats it's a bit more than twice as fast in theory (except for divisions on sandy as the divide unit is only 4-wide though Ivy "fixed" that). This assumes though your algorithm really can be adjusted to use 8-wide floats trivially, and further assumes no load/store bottlenecks (sandy can load 2 128bit values and store 1 128bit value per clock) not to mention obviously other things like limitations due to memory bandwidth or latency also still are the same.

    I don't know much about VMX, I believe it has some better support for horizontal operations and shuffles but if you can benefit from such instructions can't be said generally. About VMX on Xenon I have absolutely no idea what the throughput for even the "basic" operations (float vec mul, add) are just because the instructions are 4-wide doesn't tell you much what the cpu can do per clock, not sure if that information was published anywhere for Xenon (it might be possible that just like older cpus supporting sse2 they really only have 2-wide instead of 4-wide execution units for instance).
     
  5. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    478
    AVX1 is honestly not all that interesting. Getting 8-wide parallelism without gather is a whole lot harder than 4-wide. AVX2, to be released with Haswell, however is very. All the low-level coders I routinely talk with are pretty stoked for the gather support and FMA. Not only does it make "lists of elements" style code a lot easier to vectorize, it should finally make reasonable gains from autovectorization a reality. Vector instructions with gather are just better than than ones without.

    I'm reasonably certain that VMX is full-width, and has always been.
     
  6. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,454
    Location:
    Guess...
    Thanks for the input guys. So it sounds like AVX is most certainly not just 2x SSE4 but AVX2 coming with Haswell might be getting close? Sounds like there's a lot to be excited about in Haswell.

    Does VMX support FMA? I assume that's quite advantageous for games and so would give it a leg up in some respects over AVX?
     
  7. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    12,921
    Do you need to code specifically for sse4/avx ?
    Back in the day (cue old fart story) if I wanted to support x87 i would just use a comiler directive ($N i think) and that was it, job done if the pc had a math co-pro it would get used, if not the program would just use the integer unit
     
  8. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    478
    Just to elaborate. Most loads do not vectorize that easily on all implementations. That's why comparing ideal cases is pointless -- you just don't see them that much in the real world. The ability to vectorize more cases is much more important than the optimal throughput in the optimal case.

    I'd say that AVX2 doesn't "get close to being 2x SSE4". It's much better than that. It will allow a lot of code that is currently done with single elements to be autovectorized by the compiler.

    It does. Note that AVX (and SSE) can do both a multiply and an add in the same clock, so the advantage isn't that dramatic.

    For doing math on single elements, yeah sure. But that's not all that much faster than x87. SIMD does not make the operations any faster, it allows you to do more of them at the same time. So instead of loading two individual elements, you load two vectors of 4 or 8 and multiply each element of one vector with the corresponding element of the other vector. So not only do you need to use special instructions, pre-AVX2 you have to layout your data so that you can load consecutive (16-byte aligned) elements into memory. And since cross-lane operations are slow, you ideally want the vectors to have elements from different objects. So instead of putting the value in the object, you have to build an array that has one value from each object, for each value in said objects.

    You can probably see why this gets hairy fast. It's hard to do by hand, and nigh-impossible to do automatically by a compiler. There is some downright heroic work on the subject by the Intel and GCC teams, but even they really don't get that much speedup from autovectorized code. So today, only the things that are absolutely trivial tend to get optimized. (position and speed = 2 4-element vectors.)

    AVX2 brings gather instructions, which are basically vectorized loads. they take a base address and a vector full of offsets, and fill the target register with [base + offset]. This should make vector instructions useful in a lot of places they weren't before, because a lot of loops can then be trivially vectorized by the compiler.
     
  9. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,347
    Location:
    Varna, Bulgaria
    Let's not forget the hardware transactional memory support, primed for Haswell too. It will further optimize memory pipeline performance under heavy MP loads.
     
  10. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,185
    Location:
    Helsinki, Finland
    VMX128 is actually a very good set of instructions (compared to SSE at least). It has very good shuffles/inserts/select, multiply-add, complex bit packing instructions (including float16 conversion), (AOS) dot product, etc. However instruction set is only one side of the coin, the other is the CPU architecture implementing the instruction set.

    Nothing of course compares to AVX2 (in Haswell). But gather is only good if it is fast enough, and nobody really knows that yet. 256 bit wide integer operations are of course nice addition as well.
     
  11. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,454
    Location:
    Guess...
    Sounds like Haswell going to be a pretty impressive chip. I wonder how long it will be before CPU's drop specialised SIMD units altogether though and move vector processing to the GPU's. Are we getting close to that yet? Or would GPU's be unsuitable as complete replacements?

    I know AMD has been hinting about it in a future fusion iteration but I'm not sure whether that would be a complete replacement for the CPU's SIMD abilities or just complimentary.
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    6,830
    Location:
    Well within 3d
    There would need to be some pretty striking advances in implementation to allow for a CPU FP unit to be completely stripped out of the CPU core.
    The latency in hopping from a CPU to a GPU would be unacceptable for workloads that require higher straightline performance. Problems that do not need much more data level parallelism than a CPU provides would also be a waste on a CU unit that needs four to eight times as many work items.
     
  13. tunafish

    Regular

    Joined:
    Aug 19, 2011
    Messages:
    478
    It's all about latency. If you were to move the processing to the GPU, even on the same die you are talking of several extra cycles. No matter how awesome throughput you have, that would still hurt you on a lot of loads.

    I really don't think that the cpu vector units will ever be dropped. More likely, either they will evolve into the GPU ones (expand avx to full width, put 4-8 threads into the frontend, run GPU code on the CPU), or at some point the manufacturers will stop adding to them, and just put all the new advancements in the new dedicated vector block.
     
  14. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Location:
    Montreal, Quebec
    The reverse will happen. Note that a Haswell quad-core will be capable of 500 GFLOPS, while today's 22 nm HD 4000 can only do about 300 GFLOPS. GPUs also still have a lot of catching up to do to support complex code and not choke due to latency and bandwidth. So you can't get rid of the CPU's SIMD units any time soon, and the GPU is evolving into a CPU architecture to support more complex generic code. So the GPU and CPU are converging.

    Eventually it will make sense to just move all programmable throughput computing to the CPU. AVX2 will already be perfectly suitable for graphics shaders. The only remaining deal breaker is the higher power consumption. But this can be tackled with AVX-1024. The VEX encoding already supports extending it to 1024-bit registers, and by executing such instructions on 256-bit units in four cycles, the CPU's front-end and scheduler will have four times less switching activity, hence dramatically lowering the power consumption. A 16 nm successor to Haswell could deliver 2 TFLOPS for the same die size and not break a sweat.

    GPGPU is dying. Even though AMD is making its GPU architecture more flexible, NVIDIA went the other direction with Kepler. And on top of that you get wildly inconsistent performance between discrete and integrated parts. So GPGPU is utter rubbish for mainstream applications. Developers will instead focus on AVX2, since that will be available in every CPU from Haswell forward, and is only going to get more powerful.
     
  15. Davros

    Legend

    Joined:
    Jun 7, 2004
    Messages:
    12,921
    Perhaps nv are trying to ensure that people who do GPGPU will buy tesla cards, but then again theres the risk people on a tight budget would buy amd unless they are locked into nv's tool chain.
     
  16. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,454
    Location:
    Guess...
    Interesting stuff cheers. It'd certainly be great to see developers start to really take advantage of the vector processing on PC CPU's. I can't help but think that AVX is pretty underutilised at the moment, obviously AVX2 is going to be a lot more useful so once it starts becoming the standard hopefully developers will start pushing it to its limits thus driving it forwards to more GPU like performance.

    I'm not sure how you get 500 GFLOPS out of a quad Haswell though? Even running at 4 Ghz (which is certainly possible) if would need to be capable of twice the single precision FLOPs as Ivy Bridge. Is AVX2 going to double the throughput of AVX? (32 flops per cycle vs 16)
     
  17. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,347
    Location:
    Varna, Bulgaria
    Dual FMA3 pipelines, replacing the current ADD and MUL vector units?
     
  18. denev2004

    Newcomer

    Joined:
    Apr 28, 2010
    Messages:
    143
    Location:
    China
    That's not enough if you're talking about DP.

    Even talking about SP it's just barely enough...Cos I don't think a 4-core-haswell can went up to 4Ghz
     
  19. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,347
    Location:
    Varna, Bulgaria
    By the time Haswell is out, I think Intel should already have a refined 22nm process up and running. After all, Haswell will be the first native architecture built for Tri-Gate.
     
  20. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,454
    Location:
    Guess...
    Is this actually what AVX will have or just a guess at this point. 1Tflop SP from an 8 core x86 would be pretty impressive!

    EDIT: Ive no doubt Haswell will be capable of hitting 4 Ghz but I doubt Intel will clock it that high given the lack of competition. I'm fairly sure intel could have been releasing stock 4ghz CPU's since Sandybridge if they'd have felt the need.
     

Share This Page

  • About Beyond3D

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...