Intel extends AVX to 512-bit

Discussion in 'Architecture and Products' started by Nick, Jul 23, 2013.

  1. sebbbi

    sebbbi Veteran

    Yes, Haswell has two FMA ports, and each FMA operation is worth 2 FLOP. So total 8 lanes * 4 FLOP = 32 FLOP per cycle (for a single core). If Skylake has AVX 512 and two FMA ports like Haswell, it can do 16 lanes * 4 FLOP = 64 FLOP per cycle (per core).
     
  2. iMacmatician

    iMacmatician Regular

    Yeah you're right. Thanks.
     
  3. Homeles

    Homeles Newcomer

    Didn't the move to 256-bit wide AVX2 "require" a doubling in cache bandwidth? If so, wouldn't a 512-bit wide AVX3 require another doubling?

    If it does, then how would Intel go about doubling the bandwidth? Doubling L1D and L2 cache sizes? Would the instruction cache bandwith need to be doubled as well? Registers?
     
  4. mczak

    mczak Veteran

    AVX essentially required doubling the cache bandwidth to keep the same bandwidth per instruction (so that would have been 1 256bit load + 1 256bit store per clock instead of the 128bit ones). But intel didn't actually do that, Sandy Bridge (through some clever usage of the 2 AGUs) could handle 2x128bit load + 1 128bit store (and those 2x128bit really need to be a splitted up 256bit load for achieving peak throughput).
    AVX2 doesn't really change the picture at all, however FMA certainly does as for 2 FMAs per clock you really need more than a single load per clock. So Haswell has 3 AGUs, and the paths to cache are all 256bit wide so it indeed doubles bandwidth compared to Sandy Bridge (2x256bit load, 1x256bit store).

    Yes it does, though cache size doesn't really come into play. Probably the bank organization might be different. The "straightforward" solution would be to just have 2 512bit loads + 1 512bit store per clock of course, but who knows how it will work (as there's also presumably (fast) gather so some more changes for load/store may be in order).
     
  5. liolio

    liolio Aquoiboniste Legend

    I find it funny that Intel CPUs are going to end wider than their iGPU, that is a shocker.

    Tiny question to increase its iGPU performances and make the platform more compelling for gamer as well as putting their new SIMD to good use, could Intel moves the vertex handling back to the CPU cores?
     
  6. Frontino

    Frontino Newcomer

    If they kept only the Fixed Function Units for D3D compliance and widened the AVX directly to 1024 bit, wouldn't it be better/possible?
    So then, when D3D gets dumped they only take out the FFUs.
     
  7. Neo

    Neo Banned

    Xeon Phi was launched before Haswell, so why would they state that the "goal" was 8x peak FLOPS in 4 generations if they had it earlier? Also, first they mention Haswell's FMA, and "then 2X more with Intel AVX-512". That doesn't make sense unless they talk about something other than MICs. And why would they call the "evolution" to AVX-512 a contributor to the 8x goal? The only thing that can evolve to 512-bit, is the CPU line.
    But why say another MIC successor will also support it? That makes no sense. Also, he makes the distinction between Xeon processors, and Xeon Phi coprocessors, and then goes on to say the 512-bit capabilities from the coprocessor will be brought into the official Intel instruction set in a way that can be utilized in processors as well. So clearly AVX-512 is coming to CPUs.
    Alright, let's assume Skylake doesn't support AVX-512. Then the only way this announcement could possibly make sense, is if there's a future Xeon CPU which has its own architecture (out-of-order and all that - worthy of the name), distinct from MIC and the desktop CPU architecture.

    The question then becomes: how likely is it that Intel would create a separate architecture for the various markets targeted by the Xeon brand? Is that worth the effort? And why keep it away from mainline CPUs, which also service several markets? Or would Xeons become the new high-end CPUs for gamers and such, while Skylake targets low-end desktops, laptops and smaller mobile devices?
     
  8. iMacmatician

    iMacmatician Regular

    Not saying the following will happen, but is it possible (or even useful) to have AVX-512 in Skylake but disable it for non-Xeon parts (leaving them with AVX 2.0)?
     
  9. Nick

    Nick Veteran

    You don't need fixed-function units. Everything can be turned into new SIMD instructions. Graphics is becoming 100% programmable anyway. People are doing anti-aliasing in shaders now, you can fetch unfiltered texels to do your own filtering, Tegra has programmable blending, etc.

    Future of 3D Graphics is in Software
     
  10. AlexV

    AlexV Heteroscedasticitate Moderator Veteran

    Nick, we don't like sock puppets (in this case alter-ego Neo) around here. You will not be warned again.
     
  11. moozoo

    moozoo Newcomer

    So Skylakes might have ~1600 GFlops on its iGPU and 896 GFLOPS on its CPU cores.
     
  12. 3dilettante

    3dilettante Legend Alpha

    In any given cycle, the scheduler in Haswell can pick from 60 entries to send down the issue pipes, although this is not too much of a problem if we are going with the very, very optimistic case of 192 independent ROB entries ready to go on the rename and issue side of the pipeline. That case could live with as many scheduler entries as there are issue ports, as unlikely as it is.

    Determining latency hiding capability would also depend on what is meant by that term, such as whether we are looking primarily at whether the vector units are idling or not. This is typically what people look at when analyzing the latency hiding capability of a GPU, although that's not the same as a thread stalling in either architecture.

    A stall-free condition and optimum instruction mix for Haswell's 192 ROB entries, a single main memory access excepted, would take 24 cycles to run through.
    A purely VALU-focused measurement would take the 168 AVX registers, subtract 32 for the architectural state for the two threads, divide that by two for the two FMA ports to yield a respectable 68 cycles of latency hiding.
    That's 8 or ~22 nanoseconds at 3 GHz, which is good enough for an on-die memory access.

    That isn't a good measure of MLP, which could be used to overlap miss penalties. Load and store buffers (72 and 48) and the 10 or 16 outstanding line misses (L1 and L2 respectively) would track that better.
    It's difficult to find corresponding access numbers for GPUs. The ISA-permitted theoretical peaks for GCN, for example, are extremely high.
     
  13. pjbliverpool

    pjbliverpool B3D Scallywag Legend

    Only in the 4 core version though. We can safely assume that Skylake-E will have a minimum of 8 cores but im hoping that even the mainstream models will be bumped up to 6 or preferably 8 cores.
     
  14. Nick

    Nick Veteran

    I guess another possibility is if this future Xeon is actually a socketed MIC. That way it doesn't have to be a coprocessor, and it could justify dropping the Phi suffix.
     
  15. rapso

    rapso Newcomer

    that's what I'm saying all the time also :)

    merging the x86 (core2duo etc.) line with LRB with iGPU to have unified vector unit architectures. and then (kind of like the bulldozer shares the vector unit across 2 cores) feed them from various front ends.
    assuming Intel's SIMDs are growing and growing, it does not make any sense to let it idle while you're waiting for the other 50% of the die to render some images (and vice versa)

    looking at the iGPU unit layout here: http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2 it makes me think that those 2x4 SIMD units are actually AVX2 compatible units.

    and if you think about it: GPUs, HPC-CPUs and desktop/mainstream CPUs, it's not really about the data transforming units (at least not anymore), it's rather about how you feed them.
     
Loading...

Share This Page

Loading...