Intel extends AVX to 512-bit

Nick · Jul 23, 2013

http://software.intel.com/en-us/blogs/2013/07/10/avx-512-instructions

iMacmatician · Jul 23, 2013

It was known to be possible for a long time but it's good to see some sort of timeframe for its introduction (as an earlier slide showed 2015 for Knights Landing).

I wonder if AVX-512 is also coming in Skylake?

Nick · Jul 24, 2013

iMacmatician said:
It was known to be possible for a long time but it's good to see some sort of timeframe for its introduction (as an earlier slide showed 2015 for Knights Landing).

I wonder if AVX-512 is also coming in Skylake?

Intel said:
The evolution to Intel AVX-512 contributes to our goal to grow peak FLOP/sec by 8X over 4 generations: 2X with AVX1.0 with the Sandy Bridge architecture over the prior SSE4.2, extended by Ivy Bridge architecture with 16-bit float and random number support, 2X with AVX2.0 and its fused multiply-add (FMA) in the Haswell architecture and then 2X more with Intel AVX-512.

To achieve 8x in FLOP/sec over 4 (architecture) generations, Skylake would indeed have to feature AVX-512:

Nehalem/Westmere: 128-bit FMUL+FADD
Sandy Bridge/Ivy Bridge: 256-bit FMUL+FADD
Haswell/Broadwell: 256-bit FMA+FMA
Skylake/Skymont: 512-bit FMA+FMA

Since Knights Corner already supported 512-bit, and they explicitly mention Sandy/Ivy Bridge and Haswell as part of the 'generations', it's clearly the CPU line they're talking about.

Frontino · Jul 24, 2013

Why not make AVX available for the graphics pipeline too, in some way, and save die space while also debuting 1024 bit earlier?

sebbbi · Jul 24, 2013

With AVX-512 Intel seems to be focusing on more efficient execution of SPMD style programs. They have dedicated mask registers (and according to the reference, most instructions support lane masking), both scatter and (improved) gather instructions, full wide (512 bit) integer operations (for address calculation) and embedded splatting (+ many other very interesting instructions). Now we just need a good standardized SPMD extension for C++ (C++ AMP is quite good, but not cross platform yet).

Skylake high end i7 consumer model (6 core at 4 GHz) would be: 16 (lanes) * 4 (2 x FMA pipes) * 6 (cores) * 4 (GHz) = 1.5 GFLOPS. High end Xeon server models would have at least 12 cores (as we already have Ivy Bridge based Xeons with 12 cores). Those chips would reach 3.0 GFLOPS (assuming no improvements in core counts). But that's for 2015. Current high end GPUs perform more FLOPS than that (but of course much lower efficiency in many algorithms). So the CPU is still a few years late in raw FLOPS race, but catching up nicely.

Zeross · Jul 24, 2013

sebbbi said:
With AVX-512 Intel seems to be focusing on more efficient execution of SPMD style programs. They have dedicated mask registers (and according to the reference, most instructions support lane masking), both scatter and (improved) gather instructions, full wide (512 bit) integer operations (for address calculation) and embedded splatting (+ many other very interesting instructions). Now we just need a good standardized SPMD extension for C++ (C++ AMP is quite good, but not cross platform yet).

Skylake high end i7 consumer model (6 core at 4 GHz) would be: 16 (lanes) * 4 (2 x FMA pipes) * 6 (cores) * 4 (GHz) = 1.5 GFLOPS. High end Xeon server models would have at least 12 cores (as we already have Ivy Bridge based Xeons with 12 cores). Those chips would reach 3.0 GFLOPS (assuming no improvements in core counts). But that's for 2015. Current high end GPUs perform more FLOPS than that (but of course much lower efficiency in many algorithms). So the CPU is still a few years late in raw FLOPS race, but catching up nicely.

you lean Tflops

In 1998 my K6-2 300 with 3DNow! reached 1.2GFlops time flies for sure

Paran · Jul 24, 2013

sebbbi said:
Skylake high end i7 consumer model (6 core at 4 GHz) would be: 16 (lanes) * 4 (2 x FMA pipes) * 6 (cores) * 4 (GHz) = 1.5 GFLOPS. High end Xeon server models would have at least 12 cores (as we already have Ivy Bridge based Xeons with 12 cores). Those chips would reach 3.0 GFLOPS (assuming no improvements in core counts). But that's for 2015. Current high end GPUs perform more FLOPS than that (but of course much lower efficiency in many algorithms). So the CPU is still a few years late in raw FLOPS race, but catching up nicely.

HSW already gets more cores than IVB. Haswell-EP up to 15 cores, Haswell-EX up to 18 cores.

sebbbi · Jul 24, 2013

Paran said:
HSW already gets more cores than IVB. Haswell-EP up to 15 cores, Haswell-EX up to 18 cores.

(Not that I don't believe these), but are these confirmed by Intel, or only rumors so far? And has Intel confirmed any release dates (or estimates) for these yet?

18 core Skylake & 4 GHz would be 4.5 TFLOPS

Hornet · Jul 24, 2013

sebbbi said:
With AVX-512 Intel seems to be focusing on more efficient execution of SPMD style programs. They have dedicated mask registers (and according to the reference, most instructions support lane masking), both scatter and (improved) gather instructions, full wide (512 bit) integer operations (for address calculation) and embedded splatting (+ many other very interesting instructions). Now we just need a good standardized SPMD extension for C++ (C++ AMP is quite good, but not cross platform yet).

I was thinking the same. AVX2 is not nearly enough for that. With AVX-512 it should be easier for OpenCL kernels optimized for the GPU to run well on the CPU, except the ones that have a poor locality and require the bandwidth and latency hiding of GPUs. I wonder if Intel will also significantly improve the L3 cache bandwidth, which they didn't on Haswell.

In that timeframe, however, I expect nvidia to sell high-end SoCs with ARM cores for supercomputers with the equivalent of AMD's HUMA, which would partially negate the programmability advantage of CPUs.

willardjuice · Jul 24, 2013

I wonder if Intel will also significantly improve the L3 cache bandwidth, which they didn't on Haswell.

Large amounts of edram (i.e. L4) could help here.

fuboi · Jul 24, 2013

Nick said:
To achieve 8x in FLOP/sec over 4 (architecture) generations, Skylake would indeed have to feature AVX-512:

Nehalem/Westmere: 128-bit FMUL+FADD
Sandy Bridge/Ivy Bridge: 256-bit FMUL+FADD
Haswell/Broadwell: 256-bit FMA+FMA
Skylake/Skymont: 512-bit FMA+FMA

Since Knights Corner already supported 512-bit, and they explicitly mention Sandy/Ivy Bridge and Haswell as part of the 'generations', it's clearly the CPU line they're talking about.

Sorry, it is not. They might "define" knigths whatever as a xeon, because it's a Xeon (Phi), and it's a x86 evolution. It also supports "AVX" because intel just decided to brand LRB vector instructions v.3 as "AVX-512". That implies they will merge AVX-512 into mainstream x86 cores in the future, without specifying a timeline. Again, implies, they might still mantain separate vector instructions branded with similar words. AVX-512, AVX-1024, ... for the compute x86s, AVX-2.1, AVX-3.0, AVXX-3.0, AVX-4.1, AVX-4.2, ... for the patent minefield laying on mainstream x86.

Until there's a press release specifically stating what we want to believe, don't believe any vaguely worded information that kinda implies what we wish was true.

Neo · Jul 24, 2013

fuboi said:
Sorry, it is not. They might "define" knigths whatever as a xeon, because it's a Xeon (Phi), and it's a x86 evolution. It also supports "AVX" because intel just decided to brand LRB vector instructions v.3 as "AVX-512". That implies they will merge AVX-512 into mainstream x86 cores in the future, without specifying a timeline. Again, implies, they might still mantain separate vector instructions branded with similar words. AVX-512, AVX-1024, ... for the compute x86s, AVX-2.1, AVX-3.0, AVXX-3.0, AVX-4.1, AVX-4.2, ... for the patent minefield laying on mainstream x86.

Until there's a press release specifically stating what we want to believe, don't believe any vaguely worded information that kinda implies what we wish was true.

Then what would the 8x over 4 generations refer to exactly, and why mention Sandy Bridge/Ivy Bridge and Haswell as part of those generations? How do you justify a jump to MIC as a successor generation to Haswell?

Intel said:
Intel AVX-512 will be first implemented in the future Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing, and will also be supported by some future Xeon processors scheduled to be introduced after Knights Landing.

Why would they say it will "also" be supported by another "Xeon", if there's already a Xeon (Phi) with confirmed support?

Just asking. The hints at Skylake supporting AVX-512 are quite strong. Other than the article not speaking in absolutes, nothing seems to hint at this merely being a renaming of the Xeon Phi ISA to AVX-512.

3dilettante · Jul 24, 2013

There are encoding and instruction behavior differences between the current Xeon Phi ISA and AVX-512, although the precursor role of Phi's ISA is obvious.

Scatter makes it into the mainline, and we can see some of the more stringent ordering requirements that would have made delaying its implementation relative to gather compelling.

The fun with optional extensions to the vector extensions continues, which probably means we can expect there to be various processor lines that are in progress that might get the base instructions, some that might get the extra functions, and some that get a mix of the rest.

It makes me thing we'll be seeing Phis with the full set of transcendentals and prefetch extensions, Silvermont descendants with delayed adoption--then base, and mainline cores at Skywell(edit: Skylake) or later having whatever wacky decoder wheel Intel decides to use for consumers, and Xeons that will operate at either with upmarket consumer cores or the delayed cadence of the EP and EX lines.

There are some nifty other instructions not part of the SIMD extensions I'm looking at.

fuboi · Jul 24, 2013

Neo said:
Then what would the 8x over 4 generations refer to exactly, and why mention Sandy Bridge/Ivy Bridge and Haswell as part of those generations? How do you justify a jump to MIC as a successor generation to Haswell?

Four generations of x86 cores. MIC-next is a x86 core, it came after haswell, so it is a generation after haswell, easy! It's a "xeon", like its predecessors.

Why would they say it will "also" be supported by another "Xeon", if there's already a Xeon (Phi) with confirmed support?

Ah, missed that sentence. Is this the sentence you mean: "Intel AVX-512 will be first implemented in the future Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing, and will also be supported by some future Xeon processors scheduled to be introduced after Knights Landing"
I'd say "some future Xeon processors" includes MIC successors to "Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing".

Just asking. The hints at Skylake supporting AVX-512 are quite strong. Other than the article not speaking in absolutes, nothing seems to hint at this merely being a renaming of the Xeon Phi ISA to AVX-512.

I agree. But I've learnt to read PR articles as if I were reading a contract to buy a house. Assume hostile&deceptive intent, read every comma and every space. Don't read what isn't there.

iMacmatician · Jul 24, 2013

Also, "4 generations" may not necessarily mean 4 consecutive generations. (Some evidence is from Ivy Bridge being considered as a separate architecture from Sandy Bridge in the article.)

LiXiangyang · Jul 25, 2013

Today's GPU already has 1024-bit wide vector-computing unit, so AVX-512 is not that new.

However MIC-2(KL) could just like today's CPU, its core can exeucte two instead of one set of 512 bit vector maths ops per cycle, which means they are likley has twice the computing power of today's MIC per core count, and more robust against computing task comparing to today's GPU's 1024 bit vector ALUs.

Of cause, Intel is also likely to increase the total core-count and freqency of their next generation MIC, on several different occusions talking with Intel's stuff, they strongly hinted that MIC-2's target will be 4+ Tflops, thats about 4X of the performance of current MIC, and comparable to the target T flops of Nvidia's Maxwell.

But of cause, just like Kepler, Maxwell could also be optimized at a higher SP peak, something like ~12Tflops at SP, shouldnt underestimate the importances of SP peak since many science task can take advantage of high SP output, and also, NV's SP performance is about the same as their products' integer performance, and in many computing programs, integer performance is happen to be very important, from raw int maths to computing the indexes of matrices etc.

moozoo · Jul 25, 2013

So will AVX-512 remove the iGPU? as per Larrabee?
Will AVX-1024?

Until this happens I don't see OpenCL, C++Amp etc going away, since you need them to utilize the GFlops in the iGPU shaders.

How will the iGPU in skylake compare flops wise with the AVX-512 of its CPUs.
I presume the iGPU in skylakes will be an advancement over Haswell because Intel needs to compete with AMD's APU's.

Besides being behind in the hardware I also see Intel lagging in the tooling.
e.g. I am unable to construct an OpenCL kernel that will emit a AVX2 FMA instruction.
From what I can tell Intel's OpenCL kernel performance profiling doesn't spit out a large amount of information compared to the AMD tools.
i.e. its missing all the hardware performance counters etc.
For the CPU your suppose to use Vtune, but I have a lot of trouble relating Vtune performance information in terms of OpenCL/GPGPU concepts.

iMacmatician · Jul 25, 2013

The GT3e in the 4770R has 832 SP GFLOPS at its max Turbo of 1.3 GHz if I calculated correctly ((1.3 GHz) x (8 MAD/clock) x (2 for MAD + MAD) x (40 shaders)) or 128 GFLOPS with the 200 MHz base frequency. I wouldn't be surprised if Skylake's highest-end IGP ends up with ≥ 2 times the GFLOPS of the Haswell GT3e.

Assuming Skylake has AVX-512 and similar CPU clock speeds to Haswell, then we could see around 896 SP GFLOPS for 4 cores ((3.5 GHz) x (512-bit) / (32-bit SP) x (4 for FMA + FMA) x (4 cores)), likely more for parts with more cores. But I doubt there will be one part with the highest-performing CPU cores and the highest-performing IGP.

pjbliverpool · Jul 25, 2013

iMacmatician said:
Assuming Skylake has AVX-512 and similar CPU clock speeds to Haswell, then we could see around 448 SP GFLOPS for 4 cores ((3.5 GHz) x (512-bit) / (32-bit SP) x (2 for FMA + FMA) x (4 cores))

Wouldn't it be double that? I.e. 896 GFLOPS? Each of the 2 FMA's are worth 2 FLOPS aren't they?

sebbbi · Jul 25, 2013

Hornet said:
AVX-512 it should be easier for OpenCL kernels optimized for the GPU to run well on the CPU, except the ones that have a poor locality and require the bandwidth and latency hiding of GPUs.

GPUs might have better latency hiding in some scenarios, but that's not true in general case. Most GPUs scale number of active threads by (peak) registers usage in the shader (simpler shaders thus have better latency hiding, and more complex ones have worse, even if the peak register usage is in a branch that isn't taken). The total thread count might seem really big, but a single execution unit (or SM, or whatever you call it) doesn't have that many extra active threads to hide latency (for a relatively complex shader you might just get "8 way hyperthreading"). CPUs on the other hand have register renaming to allocate registers dynamically (both HT threads use the same pool) by actual demand. This is much better for complex (long) programs with lots of branches.

Haswell CPU is an out-of-order core and has 192 entry ROB (ROB is likely even larger in Skylake). Compared to GPU, Haswell can additionally hide memory latency by executing (up to 192) other (independent) instructions from either of the 2 instruction streams if an instruction stalls. Intel GPU (and most other GPUs) execute instructions strictly in the order they arrive. A thread immediately stalls if data is not available. Because in-order threads stall much more easily, a GPU needs more simultaneous threads to have acceptable latency hiding capability. Having more simultaneous threads active isn't always a good practice, since it can easily lead to cache trashing, as the recently used data sets of all currently active threads need to be in cache simultaneously. Good balance of TLP and ILP is the best way to hide latency (in architectures that have caches).

It's a misconception that GPU has somehow superior latency hiding compared to CPU. This is because GPU access patterns have been historically very simple (mostly graphics rendering with almost linear access patterns). Most academical papers about complex GPU compute algorithms state that access pattern optimizations yield over 10x gains in performance. Complex algorithms on GPU often need much more programmer effort for memory access optimization compared to similar CPU algorithms. GPU performance falls of a cliff with bad memory access patterns. GPU cannot hide latency if the bad access pattern combined with lots of simultaneously active threads trashes all the caches. And this of course increases bandwidth usage as well (because same cache lines need to be repeatedly reloaded multiple times).

Intel extends AVX to 512-bit

Nick

iMacmatician

Nick

Frontino

sebbbi

Zeross

Paran

sebbbi

Hornet

willardjuice

super willyjuice

fuboi

Neo

3dilettante

fuboi

iMacmatician

LiXiangyang

moozoo

iMacmatician

pjbliverpool

B3D Scallywag

sebbbi

Similar threads