Intel reveals AVX2 instruction set

fellix · Jun 12, 2011

Major feature points of the new ISA are support for 256-bit integer vector format and memory gather operations:

AVX2 extends Intel AVX by promoting most of the 128-bit SIMD integer instructions with 256-bit numeric processing capabilities. AVX2 instructions follow the same programming model as AVX instructions.
In addition, AVX2 provide enhanced functionalities for broadcast/permute operations on data elements, vector shift instructions with variable-shift count per data element, and instructions to fetch non-contiguous data elements from memory.

White paper

rpg.314 · Jun 12, 2011

IIRC, most of the integer ops in SSEx were aimed at video encode/decode.

With ff hw for the same, what is the point of wasting hw there?

Interestingly, no scatter and no widening of vec width. So not before 2015 at best.

Nick · Jun 13, 2011

rpg.314 said:
IIRC, most of the integer ops in SSEx were aimed at video encode/decode.

With ff hw for the same, what is the point of wasting hw there?

There are lots of formats Quick Sync doesn't support. Also, integer vector operations have plenty of other uses besides video. Furthermore, it doesn't necessarily waste hardware. They might decide to execute 256-bit instructions on 128-bit execution units in two cycles. Considering that everything else is in place already, widening the ALUs to 256-bit isn't all that expensive either.

Interestingly, no scatter and no widening of vec width. So not before 2015 at best.

Scatter is nowhere near as important as gather. And instead of widening the vectors it seems Haswell will support FMA. Considering that it will have two such units per core, this will also provide a nice increase in floating-point throughput. So even with 256-bit integer units, they're keeping a focus on floating-point performance.

I don't think widening the execution units makes much sense anyway. Instead, they should just widen the registers, and execute 1024-bit instructions on 256-bit execution units, in four cycles. That would dramatically lower the power consumed by the out-of-order execution, and help hide latency. It might even rival future GPUs, which continue to have to sacrifice performance/Watt for programmability. So it's all converging toward the same architecture, except that the CPU is already way faster at serial tasks which makes it win hands down at workloads which suffer from Amdahl's Law. The way things are scaling, that will become more critical than raw power.

NVIDIA needs Project Denver more than ever. Let's just hope they know how to turn it into a homogeneous architecture which can turn the heat on Intel...

ninelven · Jun 13, 2011

Let's just hope they know how to turn it into a homogeneous architecture which can turn the heat on Intel

I don't think Nvidia or AMD need homogeneous architectures or intend to go that route. A heterogeneous system which developers can treat as homogeneous should suffice.

Nick · Jun 14, 2011

ninelven said:
A heterogeneous system which developers can treat as homogeneous should suffice.

How would you achieve that?

rpg.314 · Jun 14, 2011

Nick said:
Also, integer vector operations have plenty of other uses besides video.

Like what?

Also

Intel said:
AVX2’s integer support is particularly useful for processing visual data commonly encountered in consumer imaging and video processing workloads.

Scatter is nowhere near as important as gather. And instead of widening the vectors it seems Haswell will support FMA. Considering that it will have two such units per core, this will also provide a nice increase in floating-point throughput.

Where does it say that it will have 2 fma units/core?

You'll need 6 reads and 2 writes per clock to feed these two units. The current limit is 4. Hardly a forgone conclusion.

pcchen · Jun 14, 2011

rpg.314 said:
Like what?

A while ago I wrote a piece puzzle solver, which uses bit fields for board and pieces, so it can easily use SSE2 integer instructions for better performance. The board and pieces are too small to use AVX2 (128 bits is suffice for this program) but I can imagine some programs will be able to use 256 bits quite well.

Big number libraries and some cryptography libraries can also use SSE2 integer instructions.

rpg.314 · Jun 14, 2011

These use bitwise ops. More to the point, I expect most of them to be using the "regular" version of operations. And these are only a part of the overall ops. Many of these instructions are video encode/decode specific which will be better off using the ff hw.

pcchen · Jun 14, 2011

rpg.314 said:
These use bitwise ops. More to the point, I expect most of them to be using the "regular" version of operations. And these are only a part of the overall ops. Many of these instructions are video encode/decode specific which will be better off using the ff hw.

Actually, the only SSE2 integer instruction which can be said to be "video encode specific" is probably the SAD (sum of absolute difference), although it too is very useful in computer vision algorithms. The ability to efficiently handle multiple 8 bits or 16 bits integers is still quite valuable outside of video encode/decode applications.

Nick · Jun 14, 2011

rpg.314 said:
Like what?

Any algorithm using some integers really. There's a gazillion code loops with independent iterations which can be parallelized more effectively with gather and extensive integer vector instructions. Their use in OpenCL and other compute applications also goes way beyond video processing. Trying to sum them up would be futile, and some uses haven't even been conceived yet. Use your imagination!

And again, this probably costs less than 1% of die space. The reasons not to include these instructions don't outweigh the potential uses. GPUs have also continuously added support for seemingly exotic features at times when most people didn't see the point. Most of such features are nowadays unthinkable not to be supported...

Where does it say that it will have 2 fma units/core?

Haswell has to outperform Bulldozer. Also, it makes no sense to extend integer operations to 256-bit but cripple floating-point performance.

You'll need 6 reads and 2 writes per clock to feed these two units. The current limit is 4. Hardly a forgone conclusion.

You don't necessarily need 6 read ports. The last operand can be fetched in a later cycle. Also in many cases the bypass network provides some of the operands. So even with 4 read ports the average throughput can be really high.

The register file can also be multi-banked to provide 6 read ports at a low cost. I don't know how Bulldozer does it, but it's clearly possible. And Haswell will use a far superior fab process. There's no reason to doubt it will have two FMA units per core.

Nick · Jun 14, 2011

Quoting the blog post (http://software.intel.com/en-us/blogs/2011/06/13/haswell-new-instruction-descriptions-now-available/):

Intel said:
Floating Point Multiply Accumulate – Our floating-point multiply accumulate significantly increases peak flops and provides improved precision to further improve transcendental mathematics.

Emphasis mine. This pretty much confirms they're planning on adding two FMA units per core.

rpg.314 · Jun 14, 2011

Nick said:
Any algorithm using some integers really. There's a gazillion code loops with independent iterations which can be parallelized more effectively with gather and extensive integer vector instructions. Their use in OpenCL and other compute applications also goes way beyond video processing. Trying to sum them up would be futile, and some uses haven't even been conceived yet. Use your imagination!

Again, more handwaving.

Haswell has to outperform Bulldozer.

BD has half the per core fp throughput of sandy bridge.

You don't necessarily need 6 read ports. The last operand can be fetched in a later cycle.

How will that give enough operands for 2 fma's per clock?

rpg.314 · Jun 14, 2011

Nick said:
Quoting the blog post (http://software.intel.com/en-us/blogs/2011/06/13/haswell-new-instruction-descriptions-now-available/):

Emphasis mine. This pretty much confirms they're planning on adding two FMA units per core.

It will need 2x cache rw bw. With gather, I would expect it to be a huge addition.

The compilers will be hard pressed to issue 2 fma's per clock from only 16 avx registers. The FP unit is a big. I don't see the utilization compensating for the area and the power cost.

Bottom line, I'll believe it when I see it.

Nick · Jun 14, 2011

rpg.314 said:
Again, more handwaving.

Face recognition, gesture recognition, speech recognition, speech encoding/decoding, compression, encryption, ray tracing, packet inspection, data sorting, financial computing, path finding, artificial intelligence, ...

You also need to realize that pretty much every loop with independent iterations and some integer data types, could benefit from AVX2's support for 256-bit integer operations (and gather). Such loops are present (and often a hotspot) in any respectable sized code base.

Also, integer processing is critical for implementing transcendental floating-point functions. So even heavily floating-point oriented applications can benefit significantly. And last but not least, Bulldozer appears to have twice the integer vector processing capability so Intel can't risk to stay behind.

BD has half the per core fp throughput of sandy bridge.

8-core Bulldozer will be up against 4-core Sandy Bridge and Ivy Bridge. A few ALUs dedicated to a specific thread doesn't really count as a separate core. Nearly everything else is shared, making a Bulldozer module more like an alternative to a single Intel core with Hyper-Threading.

As the software starts to adopt AVX-256, and Intel offers CPUs with more cores, AMD will have no other choice but to equip FlexFP with a pair of 256-bit FMA units. Intel already has 256-bit units, so goes the other direction: adding FMA support to each unit. Neither company is going to allow the other to take a significant lead in floating-point performance, since that would cost them big deals in the HPC market.

How will that give enough operands for 2 fma's per clock?

4 operands in the first cycle, 2 in the next. Of course this means in the worst case it can't sustain 2 FMA's per clock, but like I said the bypass network reduces the pressure on the register file. In fact when you have a high density of FMA instructions they're bound to have close dependencies so the result of the previous instruction can be fed into the top of the pipeline again without the need to occupy a register file port.

Note that for many years Intel had only 3 register file read ports, and hardly anyone every ran into this bottleneck, precisely because the bypass network provides a lot of the operands.

rpg.314 · Jun 14, 2011

Nick said:
Face recognition, gesture recognition, speech recognition, speech encoding/decoding, compression, encryption, ray tracing, packet inspection, data sorting, financial computing, path finding, artificial intelligence, ...

All suitable for GPU acceleration.

3dilettante · Jun 14, 2011

Nick said:
I don't think widening the execution units makes much sense anyway. Instead, they should just widen the registers, and execute 1024-bit instructions on 256-bit execution units, in four cycles. That would dramatically lower the power consumed by the out-of-order execution, and help hide latency.

Can you explain what savings there are on the part of the OoO engine?
It doesn't stop trying to pick pending operations just because one issue port is stuck on a multicycle non-pipelined operation.

Also, integer processing is critical for implementing transcendental floating-point functions. So even heavily floating-point oriented applications can benefit significantly. And last but not least, Bulldozer appears to have twice the integer vector processing capability so Intel can't risk to stay behind.

I assume you are including the IMAC along with the two integer SIMD pipes?
I suppose if the IMAC gets used this would be the case for a 4-module chip.

8-core Bulldozer will be up against 4-core Sandy Bridge and Ivy Bridge. A few ALUs dedicated to a specific thread doesn't really count as a separate core. Nearly everything else is shared, making a Bulldozer module more like an alternative to a single Intel core with Hyper-Threading.

There are two physically distinct issue networks, instruction control units, integer register files, and data caches. Aside from a shared decoder and icache, there isn't anything in a bulldozer core that is less of a core than a CPU in the days prior to floating point coprocessors.
Whether that is the best solution for the full range of markets the chip will be sold in is a different debate.

Blazkowicz · Jun 14, 2011

rpg.314 said:
All suitable for GPU acceleration.

not those later ones I would say : "data sorting, financial computing, path finding, artificial intelligence"

sebbbi · Jun 22, 2011

rpg.314 said:
Again, more handwaving.

The CPU sorting algorithm all recent sorting papers seem to benchmark against is fully vectorized. Authors state a 3.3x improvement over scalar version (128 bit SSE). Full paper can be grabbed here: http://portal.acm.org/citation.cfm?id=1454171

Blazkowicz said:
not those later ones I would say : "data sorting, financial computing, path finding, artificial intelligence"

The best GPU data sorting algorithms are around 10x faster than the CPU versions (high end CPU vs high end GPU). I think this is the fastest published GPU sorter currently: http://portal.acm.org/citation.cfm?id=1854273.1854344. Also there's many data mining and financial computing applications for GPUs. Check a recent survey in the field.

denev2004 · Oct 11, 2011

Mostly I want to care about when will AVX2 put on. I think Haswell might use LNI, So...
Is that mean that Intel wants IVY Bridge use AVX2 but still kick out its FMA3 function?

rpg.314 · Oct 11, 2011

Neither Haswell nor Ivy will have LNI.

Haswell will have AVX2.

Intel reveals AVX2 instruction set

fellix

rpg.314

Nick

ninelven

PM

Nick

rpg.314

pcchen

Moderator

rpg.314

pcchen

Moderator

Nick

Nick

rpg.314

rpg.314

Nick

rpg.314

3dilettante

Blazkowicz

sebbbi

denev2004

rpg.314

Similar threads