It was known to be possible for a long time but it's good to see some sort of timeframe for its introduction (as an earlier slide showed 2015 for Knights Landing).
I wonder if AVX-512 is also coming in Skylake?
To achieve 8x in FLOP/sec over 4 (architecture) generations, Skylake would indeed have to feature AVX-512:Intel said:The evolution to Intel AVX-512 contributes to our goal to grow peak FLOP/sec by 8X over 4 generations: 2X with AVX1.0 with the Sandy Bridge architecture over the prior SSE4.2, extended by Ivy Bridge architecture with 16-bit float and random number support, 2X with AVX2.0 and its fused multiply-add (FMA) in the Haswell architecture and then 2X more with Intel AVX-512.
With AVX-512 Intel seems to be focusing on more efficient execution of SPMD style programs. They have dedicated mask registers (and according to the reference, most instructions support lane masking), both scatter and (improved) gather instructions, full wide (512 bit) integer operations (for address calculation) and embedded splatting (+ many other very interesting instructions). Now we just need a good standardized SPMD extension for C++ (C++ AMP is quite good, but not cross platform yet).
Skylake high end i7 consumer model (6 core at 4 GHz) would be: 16 (lanes) * 4 (2 x FMA pipes) * 6 (cores) * 4 (GHz) = 1.5 GFLOPS. High end Xeon server models would have at least 12 cores (as we already have Ivy Bridge based Xeons with 12 cores). Those chips would reach 3.0 GFLOPS (assuming no improvements in core counts). But that's for 2015. Current high end GPUs perform more FLOPS than that (but of course much lower efficiency in many algorithms). So the CPU is still a few years late in raw FLOPS race, but catching up nicely.
Skylake high end i7 consumer model (6 core at 4 GHz) would be: 16 (lanes) * 4 (2 x FMA pipes) * 6 (cores) * 4 (GHz) = 1.5 GFLOPS. High end Xeon server models would have at least 12 cores (as we already have Ivy Bridge based Xeons with 12 cores). Those chips would reach 3.0 GFLOPS (assuming no improvements in core counts). But that's for 2015. Current high end GPUs perform more FLOPS than that (but of course much lower efficiency in many algorithms). So the CPU is still a few years late in raw FLOPS race, but catching up nicely.
(Not that I don't believe these), but are these confirmed by Intel, or only rumors so far? And has Intel confirmed any release dates (or estimates) for these yet?HSW already gets more cores than IVB. Haswell-EP up to 15 cores, Haswell-EX up to 18 cores.
With AVX-512 Intel seems to be focusing on more efficient execution of SPMD style programs. They have dedicated mask registers (and according to the reference, most instructions support lane masking), both scatter and (improved) gather instructions, full wide (512 bit) integer operations (for address calculation) and embedded splatting (+ many other very interesting instructions). Now we just need a good standardized SPMD extension for C++ (C++ AMP is quite good, but not cross platform yet).
I wonder if Intel will also significantly improve the L3 cache bandwidth, which they didn't on Haswell.
To achieve 8x in FLOP/sec over 4 (architecture) generations, Skylake would indeed have to feature AVX-512:
Nehalem/Westmere: 128-bit FMUL+FADD
Sandy Bridge/Ivy Bridge: 256-bit FMUL+FADD
Haswell/Broadwell: 256-bit FMA+FMA
Skylake/Skymont: 512-bit FMA+FMA
Since Knights Corner already supported 512-bit, and they explicitly mention Sandy/Ivy Bridge and Haswell as part of the 'generations', it's clearly the CPU line they're talking about.
Then what would the 8x over 4 generations refer to exactly, and why mention Sandy Bridge/Ivy Bridge and Haswell as part of those generations? How do you justify a jump to MIC as a successor generation to Haswell?Sorry, it is not. They might "define" knigths whatever as a xeon, because it's a Xeon (Phi), and it's a x86 evolution. It also supports "AVX" because intel just decided to brand LRB vector instructions v.3 as "AVX-512". That implies they will merge AVX-512 into mainstream x86 cores in the future, without specifying a timeline. Again, implies, they might still mantain separate vector instructions branded with similar words. AVX-512, AVX-1024, ... for the compute x86s, AVX-2.1, AVX-3.0, AVXX-3.0, AVX-4.1, AVX-4.2, ... for the patent minefield laying on mainstream x86.
Until there's a press release specifically stating what we want to believe, don't believe any vaguely worded information that kinda implies what we wish was true.
Why would they say it will "also" be supported by another "Xeon", if there's already a Xeon (Phi) with confirmed support?Intel said:Intel AVX-512 will be first implemented in the future Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing, and will also be supported by some future Xeon processors scheduled to be introduced after Knights Landing.
Four generations of x86 cores. MIC-next is a x86 core, it came after haswell, so it is a generation after haswell, easy! It's a "xeon", like its predecessors.Then what would the 8x over 4 generations refer to exactly, and why mention Sandy Bridge/Ivy Bridge and Haswell as part of those generations? How do you justify a jump to MIC as a successor generation to Haswell?
Ah, missed that sentence. Is this the sentence you mean: "Intel AVX-512 will be first implemented in the future Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing, and will also be supported by some future Xeon processors scheduled to be introduced after Knights Landing"Why would they say it will "also" be supported by another "Xeon", if there's already a Xeon (Phi) with confirmed support?
I agree. But I've learnt to read PR articles as if I were reading a contract to buy a house. Assume hostile&deceptive intent, read every comma and every space. Don't read what isn't there.Just asking. The hints at Skylake supporting AVX-512 are quite strong. Other than the article not speaking in absolutes, nothing seems to hint at this merely being a renaming of the Xeon Phi ISA to AVX-512.
Assuming Skylake has AVX-512 and similar CPU clock speeds to Haswell, then we could see around 448 SP GFLOPS for 4 cores ((3.5 GHz) x (512-bit) / (32-bit SP) x (2 for FMA + FMA) x (4 cores))
GPUs might have better latency hiding in some scenarios, but that's not true in general case. Most GPUs scale number of active threads by (peak) registers usage in the shader (simpler shaders thus have better latency hiding, and more complex ones have worse, even if the peak register usage is in a branch that isn't taken). The total thread count might seem really big, but a single execution unit (or SM, or whatever you call it) doesn't have that many extra active threads to hide latency (for a relatively complex shader you might just get "8 way hyperthreading"). CPUs on the other hand have register renaming to allocate registers dynamically (both HT threads use the same pool) by actual demand. This is much better for complex (long) programs with lots of branches.AVX-512 it should be easier for OpenCL kernels optimized for the GPU to run well on the CPU, except the ones that have a poor locality and require the bandwidth and latency hiding of GPUs.