Intel reveals AVX2 instruction set

Discussion in 'PC Industry' started by fellix, Jun 12, 2011.

  1. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,490
    Likes Received:
    400
    Location:
    Varna, Bulgaria
    Major feature points of the new ISA are support for 256-bit integer vector format and memory gather operations:
    White paper
     
  2. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    IIRC, most of the integer ops in SSEx were aimed at video encode/decode.

    With ff hw for the same, what is the point of wasting hw there?

    Interestingly, no scatter and no widening of vec width. So not before 2015 at best.
     
  3. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    There are lots of formats Quick Sync doesn't support. Also, integer vector operations have plenty of other uses besides video. Furthermore, it doesn't necessarily waste hardware. They might decide to execute 256-bit instructions on 128-bit execution units in two cycles. Considering that everything else is in place already, widening the ALUs to 256-bit isn't all that expensive either.
    Scatter is nowhere near as important as gather. And instead of widening the vectors it seems Haswell will support FMA. Considering that it will have two such units per core, this will also provide a nice increase in floating-point throughput. So even with 256-bit integer units, they're keeping a focus on floating-point performance.

    I don't think widening the execution units makes much sense anyway. Instead, they should just widen the registers, and execute 1024-bit instructions on 256-bit execution units, in four cycles. That would dramatically lower the power consumed by the out-of-order execution, and help hide latency. It might even rival future GPUs, which continue to have to sacrifice performance/Watt for programmability. So it's all converging toward the same architecture, except that the CPU is already way faster at serial tasks which makes it win hands down at workloads which suffer from Amdahl's Law. The way things are scaling, that will become more critical than raw power.

    NVIDIA needs Project Denver more than ever. Let's just hope they know how to turn it into a homogeneous architecture which can turn the heat on Intel...
     
  4. ninelven

    Veteran

    Joined:
    Dec 27, 2002
    Messages:
    1,709
    Likes Received:
    122
    I don't think Nvidia or AMD need homogeneous architectures or intend to go that route. A heterogeneous system which developers can treat as homogeneous should suffice.
     
  5. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    How would you achieve that?
     
  6. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Like what?

    Also
    Where does it say that it will have 2 fma units/core?

    You'll need 6 reads and 2 writes per clock to feed these two units. The current limit is 4. Hardly a forgone conclusion.
     
  7. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,766
    Likes Received:
    146
    Location:
    Taiwan
    A while ago I wrote a piece puzzle solver, which uses bit fields for board and pieces, so it can easily use SSE2 integer instructions for better performance. The board and pieces are too small to use AVX2 (128 bits is suffice for this program) but I can imagine some programs will be able to use 256 bits quite well.

    Big number libraries and some cryptography libraries can also use SSE2 integer instructions.
     
  8. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    These use bitwise ops. More to the point, I expect most of them to be using the "regular" version of operations. And these are only a part of the overall ops. Many of these instructions are video encode/decode specific which will be better off using the ff hw.
     
  9. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,766
    Likes Received:
    146
    Location:
    Taiwan
    Actually, the only SSE2 integer instruction which can be said to be "video encode specific" is probably the SAD (sum of absolute difference), although it too is very useful in computer vision algorithms. The ability to efficiently handle multiple 8 bits or 16 bits integers is still quite valuable outside of video encode/decode applications.
     
  10. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Any algorithm using some integers really. There's a gazillion code loops with independent iterations which can be parallelized more effectively with gather and extensive integer vector instructions. Their use in OpenCL and other compute applications also goes way beyond video processing. Trying to sum them up would be futile, and some uses haven't even been conceived yet. Use your imagination!

    And again, this probably costs less than 1% of die space. The reasons not to include these instructions don't outweigh the potential uses. GPUs have also continuously added support for seemingly exotic features at times when most people didn't see the point. Most of such features are nowadays unthinkable not to be supported...
    Haswell has to outperform Bulldozer. Also, it makes no sense to extend integer operations to 256-bit but cripple floating-point performance.
    You don't necessarily need 6 read ports. The last operand can be fetched in a later cycle. Also in many cases the bypass network provides some of the operands. So even with 4 read ports the average throughput can be really high.

    The register file can also be multi-banked to provide 6 read ports at a low cost. I don't know how Bulldozer does it, but it's clearly possible. And Haswell will use a far superior fab process. There's no reason to doubt it will have two FMA units per core.
     
  11. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
  12. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Again, more handwaving.

    BD has half the per core fp throughput of sandy bridge.

    How will that give enough operands for 2 fma's per clock?
     
  13. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    It will need 2x cache rw bw. With gather, I would expect it to be a huge addition.

    The compilers will be hard pressed to issue 2 fma's per clock from only 16 avx registers. The FP unit is a big. I don't see the utilization compensating for the area and the power cost.

    Bottom line, I'll believe it when I see it.
     
  14. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    Face recognition, gesture recognition, speech recognition, speech encoding/decoding, compression, encryption, ray tracing, packet inspection, data sorting, financial computing, path finding, artificial intelligence, ...

    You also need to realize that pretty much every loop with independent iterations and some integer data types, could benefit from AVX2's support for 256-bit integer operations (and gather). Such loops are present (and often a hotspot) in any respectable sized code base.

    Also, integer processing is critical for implementing transcendental floating-point functions. So even heavily floating-point oriented applications can benefit significantly. And last but not least, Bulldozer appears to have twice the integer vector processing capability so Intel can't risk to stay behind.
    8-core Bulldozer will be up against 4-core Sandy Bridge and Ivy Bridge. A few ALUs dedicated to a specific thread doesn't really count as a separate core. Nearly everything else is shared, making a Bulldozer module more like an alternative to a single Intel core with Hyper-Threading.

    As the software starts to adopt AVX-256, and Intel offers CPUs with more cores, AMD will have no other choice but to equip FlexFP with a pair of 256-bit FMA units. Intel already has 256-bit units, so goes the other direction: adding FMA support to each unit. Neither company is going to allow the other to take a significant lead in floating-point performance, since that would cost them big deals in the HPC market.
    4 operands in the first cycle, 2 in the next. Of course this means in the worst case it can't sustain 2 FMA's per clock, but like I said the bypass network reduces the pressure on the register file. In fact when you have a high density of FMA instructions they're bound to have close dependencies so the result of the previous instruction can be fed into the top of the pipeline again without the need to occupy a register file port.

    Note that for many years Intel had only 3 register file read ports, and hardly anyone every ran into this bottleneck, precisely because the bypass network provides a lot of the operands.
     
  15. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    All suitable for GPU acceleration. :cool:
     
  16. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,135
    Likes Received:
    2,935
    Location:
    Well within 3d
    Can you explain what savings there are on the part of the OoO engine?
    It doesn't stop trying to pick pending operations just because one issue port is stuck on a multicycle non-pipelined operation.


    I assume you are including the IMAC along with the two integer SIMD pipes?
    I suppose if the IMAC gets used this would be the case for a 4-module chip.

    There are two physically distinct issue networks, instruction control units, integer register files, and data caches. Aside from a shared decoder and icache, there isn't anything in a bulldozer core that is less of a core than a CPU in the days prior to floating point coprocessors.
    Whether that is the best solution for the full range of markets the chip will be sold in is a different debate.
     
  17. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    not those later ones I would say : "data sorting, financial computing, path finding, artificial intelligence"
     
  18. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    The CPU sorting algorithm all recent sorting papers seem to benchmark against is fully vectorized. Authors state a 3.3x improvement over scalar version (128 bit SSE). Full paper can be grabbed here: http://portal.acm.org/citation.cfm?id=1454171

    The best GPU data sorting algorithms are around 10x faster than the CPU versions (high end CPU vs high end GPU). I think this is the fastest published GPU sorter currently: http://portal.acm.org/citation.cfm?id=1854273.1854344. Also there's many data mining and financial computing applications for GPUs. Check a recent survey in the field.
     
  19. denev2004

    Newcomer

    Joined:
    Apr 28, 2010
    Messages:
    143
    Likes Received:
    0
    Location:
    China
    Mostly I want to care about when will AVX2 put on. I think Haswell might use LNI, So...
    Is that mean that Intel wants IVY Bridge use AVX2 but still kick out its FMA3 function?
     
  20. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Neither Haswell nor Ivy will have LNI.

    Haswell will have AVX2.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...