Intel extends AVX to 512-bit

sebbbi · Jul 25, 2013

pjbliverpool said:
Wouldn't it be double that? I.e. 896 GFLOPS? Each of the 2 FMA's are worth 2 FLOPS aren't they?

Yes, Haswell has two FMA ports, and each FMA operation is worth 2 FLOP. So total 8 lanes * 4 FLOP = 32 FLOP per cycle (for a single core). If Skylake has AVX 512 and two FMA ports like Haswell, it can do 16 lanes * 4 FLOP = 64 FLOP per cycle (per core).

iMacmatician · Jul 25, 2013

pjbliverpool said:
Wouldn't it be double that? I.e. 896 GFLOPS? Each of the 2 FMA's are worth 2 FLOPS aren't they?

Yeah you're right. Thanks.

Homeles · Jul 25, 2013

Didn't the move to 256-bit wide AVX2 "require" a doubling in cache bandwidth? If so, wouldn't a 512-bit wide AVX3 require another doubling?

If it does, then how would Intel go about doubling the bandwidth? Doubling L1D and L2 cache sizes? Would the instruction cache bandwith need to be doubled as well? Registers?

mczak · Jul 25, 2013

Homeles said:
Didn't the move to 256-bit wide AVX2 "require" a doubling in cache bandwidth? If so, wouldn't a 512-bit wide AVX3 require another doubling?

AVX essentially required doubling the cache bandwidth to keep the same bandwidth per instruction (so that would have been 1 256bit load + 1 256bit store per clock instead of the 128bit ones). But intel didn't actually do that, Sandy Bridge (through some clever usage of the 2 AGUs) could handle 2x128bit load + 1 128bit store (and those 2x128bit really need to be a splitted up 256bit load for achieving peak throughput).
AVX2 doesn't really change the picture at all, however FMA certainly does as for 2 FMAs per clock you really need more than a single load per clock. So Haswell has 3 AGUs, and the paths to cache are all 256bit wide so it indeed doubles bandwidth compared to Sandy Bridge (2x256bit load, 1x256bit store).

If it does, then how would Intel go about doubling the bandwidth? Doubling L1D and L2 cache sizes? Would the instruction cache bandwidth need to be doubled as well? Registers?

Yes it does, though cache size doesn't really come into play. Probably the bank organization might be different. The "straightforward" solution would be to just have 2 512bit loads + 1 512bit store per clock of course, but who knows how it will work (as there's also presumably (fast) gather so some more changes for load/store may be in order).

liolio · Jul 25, 2013

I find it funny that Intel CPUs are going to end wider than their iGPU, that is a shocker.

Tiny question to increase its iGPU performances and make the platform more compelling for gamer as well as putting their new SIMD to good use, could Intel moves the vertex handling back to the CPU cores?

Frontino · Jul 25, 2013

If they kept only the Fixed Function Units for D3D compliance and widened the AVX directly to 1024 bit, wouldn't it be better/possible?
So then, when D3D gets dumped they only take out the FFUs.

Neo · Jul 25, 2013

fuboi said:
Four generations of x86 cores. MIC-next is a x86 core, it came after haswell, so it is a generation after haswell, easy! It's a "xeon", like its predecessors.

Xeon Phi was launched before Haswell, so why would they state that the "goal" was 8x peak FLOPS in 4 generations if they had it earlier? Also, first they mention Haswell's FMA, and "then 2X more with Intel AVX-512". That doesn't make sense unless they talk about something other than MICs. And why would they call the "evolution" to AVX-512 a contributor to the 8x goal? The only thing that can evolve to 512-bit, is the CPU line.

I'd say "some future Xeon processors" includes MIC successors to "Intel® Xeon Phi™ processor and coprocessor known by the code name Knights Landing".

But why say another MIC successor will also support it? That makes no sense. Also, he makes the distinction between Xeon processors, and Xeon Phi coprocessors, and then goes on to say the 512-bit capabilities from the coprocessor will be brought into the official Intel instruction set in a way that can be utilized in processors as well. So clearly AVX-512 is coming to CPUs.

I agree. But I've learnt to read PR articles as if I were reading a contract to buy a house. Assume hostile&deceptive intent, read every comma and every space. Don't read what isn't there.

Alright, let's assume Skylake doesn't support AVX-512. Then the only way this announcement could possibly make sense, is if there's a future Xeon CPU which has its own architecture (out-of-order and all that - worthy of the name), distinct from MIC and the desktop CPU architecture.

The question then becomes: how likely is it that Intel would create a separate architecture for the various markets targeted by the Xeon brand? Is that worth the effort? And why keep it away from mainline CPUs, which also service several markets? Or would Xeons become the new high-end CPUs for gamers and such, while Skylake targets low-end desktops, laptops and smaller mobile devices?

iMacmatician · Jul 25, 2013

Neo said:
The question then becomes: how likely is it that Intel would create a separate architecture for the various markets targeted by the Xeon brand? Is that worth the effort? And why keep it away from mainline CPUs, which also service several markets? Or would Xeons become the new high-end CPUs for gamers and such, while Skylake targets low-end desktops, laptops and smaller mobile devices?

Not saying the following will happen, but is it possible (or even useful) to have AVX-512 in Skylake but disable it for non-Xeon parts (leaving them with AVX 2.0)?

Nick · Jul 25, 2013

Frontino said:
If they kept only the Fixed Function Units for D3D compliance and widened the AVX directly to 1024 bit, wouldn't it be better/possible?
So then, when D3D gets dumped they only take out the FFUs.

You don't need fixed-function units. Everything can be turned into new SIMD instructions. Graphics is becoming 100% programmable anyway. People are doing anti-aliasing in shaders now, you can fetch unfiltered texels to do your own filtering, Tegra has programmable blending, etc.

Future of 3D Graphics is in Software

AlexV · Jul 25, 2013

Nick, we don't like sock puppets (in this case alter-ego Neo) around here. You will not be warned again.

moozoo · Jul 26, 2013

iMacmatician said:
The GT3e in the 4770R has 832 SP GFLOPS at its max Turbo of 1.3 GHz
...
I wouldn't be surprised if Skylake's highest-end IGP ends up with ≥ 2 times the GFLOPS of the Haswell GT3e.

Assuming Skylake has AVX-512 and similar CPU clock speeds to Haswell, then we could see around 896 SP GFLOPS for 4 cores
...

So Skylakes might have ~1600 GFlops on its iGPU and 896 GFLOPS on its CPU cores.

3dilettante · Jul 26, 2013

sebbbi said:
Haswell CPU is an out-of-order core and has 192 entry ROB (ROB is likely even larger in Skylake). Compared to GPU, Haswell can additionally hide memory latency by executing (up to 192) other (independent) instructions from either of the 2 instruction streams if an instruction stalls.

In any given cycle, the scheduler in Haswell can pick from 60 entries to send down the issue pipes, although this is not too much of a problem if we are going with the very, very optimistic case of 192 independent ROB entries ready to go on the rename and issue side of the pipeline. That case could live with as many scheduler entries as there are issue ports, as unlikely as it is.

Determining latency hiding capability would also depend on what is meant by that term, such as whether we are looking primarily at whether the vector units are idling or not. This is typically what people look at when analyzing the latency hiding capability of a GPU, although that's not the same as a thread stalling in either architecture.

A stall-free condition and optimum instruction mix for Haswell's 192 ROB entries, a single main memory access excepted, would take 24 cycles to run through.
A purely VALU-focused measurement would take the 168 AVX registers, subtract 32 for the architectural state for the two threads, divide that by two for the two FMA ports to yield a respectable 68 cycles of latency hiding.
That's 8 or ~22 nanoseconds at 3 GHz, which is good enough for an on-die memory access.

That isn't a good measure of MLP, which could be used to overlap miss penalties. Load and store buffers (72 and 48) and the 10 or 16 outstanding line misses (L1 and L2 respectively) would track that better.
It's difficult to find corresponding access numbers for GPUs. The ISA-permitted theoretical peaks for GCN, for example, are extremely high.

pjbliverpool · Jul 26, 2013

moozoo said:
So Skylakes might have ~1600 GFlops on its iGPU and 896 GFLOPS on its CPU cores.

Only in the 4 core version though. We can safely assume that Skylake-E will have a minimum of 8 cores but im hoping that even the mainstream models will be bumped up to 6 or preferably 8 cores.

Nick · Jul 27, 2013

Neo said:
Alright, let's assume Skylake doesn't support AVX-512. Then the only way this announcement could possibly make sense, is if there's a future Xeon CPU which has its own architecture (out-of-order and all that - worthy of the name), distinct from MIC and the desktop CPU architecture.

The question then becomes: how likely is it that Intel would create a separate architecture for the various markets targeted by the Xeon brand? Is that worth the effort? And why keep it away from mainline CPUs, which also service several markets? Or would Xeons become the new high-end CPUs for gamers and such, while Skylake targets low-end desktops, laptops and smaller mobile devices?

I guess another possibility is if this future Xeon is actually a socketed MIC. That way it doesn't have to be a coprocessor, and it could justify dropping the Phi suffix.

rapso · Jul 29, 2013

Frontino said:
Why not make AVX available for the graphics pipeline too, in some way, and save die space while also debuting 1024 bit earlier?

that's what I'm saying all the time also

merging the x86 (core2duo etc.) line with LRB with iGPU to have unified vector unit architectures. and then (kind of like the bulldozer shares the vector unit across 2 cores) feed them from various front ends.
assuming Intel's SIMDs are growing and growing, it does not make any sense to let it idle while you're waiting for the other 50% of the die to render some images (and vice versa)

looking at the iGPU unit layout here: http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2 it makes me think that those 2x4 SIMD units are actually AVX2 compatible units.

and if you think about it: GPUs, HPC-CPUs and desktop/mainstream CPUs, it's not really about the data transforming units (at least not anymore), it's rather about how you feed them.

Intel extends AVX to 512-bit

sebbbi

iMacmatician

Homeles

mczak

liolio

Aquoiboniste

Frontino

Neo

iMacmatician

Nick

AlexV

Heteroscedasticitate

moozoo

3dilettante

pjbliverpool

B3D Scallywag

Nick

rapso

Similar threads