22 nm Larrabee

Discussion in 'Architecture and Products' started by Nick, May 6, 2011.

Tags:
  1. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    771
    Likes Received:
    200
    I hadn't thought of 2x 256-bit per core in a future Xeon Phi, and the last sentence seems like a good reason to do so (the same theoretical throughput as the current Xeon Phi but more similar to Haswell). 2x 512-bit per core for Knights Landing, on second thoughts, may give too high a FP performance compared to the 3+ TFLOPS projection, at least if it uses Atom cores.
     
  2. moozoo

    Newcomer

    Joined:
    Jul 23, 2010
    Messages:
    109
    Likes Received:
    1
    On the Intel Xeon Processor road map plan for hpc it lists Haswell Xeons as 500Gflop DP peak.
    My i7 4770 gets 90Gflops single precision and 45 Glfops double using FlopsCL and Intel OpenCL.

    So when they are talking 500Gflops for Xeons are they talking 10 core, quad socket machines? 1 x $500 HD 7970 blows that away.
     
  3. Paran

    Regular Newcomer

    Joined:
    Sep 15, 2011
    Messages:
    251
    Likes Received:
    14

    They are talking about 14 cores Haswell-EP or 18 cores Haswell-EX.
     
  4. DavidC

    Regular

    Joined:
    Sep 26, 2006
    Messages:
    347
    Likes Received:
    24
    Not comparable at all.

    4770K flops are measured, which is lower than peak, and based on your numbers, it isn't even taking advantage of AVX2.

    3.5GHz x 4 cores x FMA x 8 DP Flops/cycle = 224GFlops DP

    There will be significant number of applications that will probably favor the 4770K in DP.
     
  5. DavidC

    Regular

    Joined:
    Sep 26, 2006
    Messages:
    347
    Likes Received:
    24
    Sounds reasonable, except if they want to reach the "Petaflop system with 40MW in 2018" goal it really needs to have that kind of efficiency by then. The same can probably be said of Nvidia that's expecting similar DP Flops/watt in that timeframe.

    They said to make such a system it needs to achieve 10x the DP Flop/watt effiency that current accelerators have, which is roughly 50 DP GFlops/watt.

    Actually there are parts that achieve more than that. Xeon Phi 5110P with 1.011TFlops @ 225W, and Xeon Phi 7120X/7120P with 1.208TFlops @ 300W. And while 7000 parts have 1.33GHz Turbo, the rated Flops are at the Base 1.23GHz.

    I expect them to go the "more cores versus frequency" route, so imagine a 120+ core with frequency more or less similar to what we have today. Remember they'll be at 14nm with Knights Corner, meaning die space freedom. I think the 15W ULT Haswell parts using 40EUs(aka die size) for lower power strategy is one they'll be pursuing here too. :)
     
    #1165 DavidC, Jul 6, 2013
    Last edited by a moderator: Jul 6, 2013
  6. moozoo

    Newcomer

    Joined:
    Jul 23, 2010
    Messages:
    109
    Likes Received:
    1
    "Intel OpenCL* SDK 2013 introduces performance improvements that include full code generation on the Intel Advanced Vector Extensions (Intel AVX and Intel AVX2). The Implicit CPU Vectorization Module generates the Intel AVX and Intel AVX2 code depending on what the target platform supports."
    OpenCL has a fma instruction.
    So I see no reason an opencl kernel couldn't exploit avx2 fma

    Can you point me at a benchmark that I can run that will come up with this magical 224GFlops DP number?
     
  7. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,743
    Likes Received:
    106
    Location:
    Taiwan
  8. moozoo

    Newcomer

    Joined:
    Jul 23, 2010
    Messages:
    109
    Likes Received:
    1
    I get 68-83 Gflops DP for linpack xeon64.
    An OpenCL program called MaxFlopsCl gives ~67Gflops for Dp multiplication.
    Its not working for fma. I will investigate.
    With FMA I guess that figure would be doubled.. i.e ~140Gflop DP

    In any case GPU DP flops >> CPU

    I believe Kaveri is suppose to be ~1Tflop single. If they built one with a 1/4 ratio (they won't) they could get 256Gflop DP.

    ---------------------
    Intel(R) Optimized LINPACK Benchmark data

    Current date/time: Sun Jul 07 12:58:18 2013

    CPU frequency: 3.898 GHz
    Number of CPUs: 1
    Number of cores: 4
    Number of threads: 8
    .......
    Performance Summary (GFlops)

    Size LDA Align. Average Maximal
    1000 1000 4 66.9419 70.5405
    2000 2000 4 73.4502 74.0641
    3000 3000 4 76.9867 77.1281
    4000 4000 4 78.7903 79.2638
    5000 5000 4 79.4511 79.8972
    10000 10000 4 82.1355 83.2612
    15000 15000 4 78.4566 79.3601
    20000 20000 4 78.0148 78.2111
    25000 25000 4 77.3551 77.4088
    30000 30000 4 79.2122 79.2122
    35000 35000 4 78.8542 78.8542
    40000 40000 4 78.5863 78.5863

    --------------------------
     
  9. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,743
    Likes Received:
    106
    Location:
    Taiwan
    I don't have it right now to test, but I think disabling Hyperthrading may improve LINPACK performance.
     
  10. Paran

    Regular Newcomer

    Joined:
    Sep 15, 2011
    Messages:
    251
    Likes Received:
    14
    I expect them to go for stronger cores, probably based on Silvermont cores from the new Atom.


    http://jobs.intel.com/job/Hillsboro-MicroarchitectVerification-Engineer-Job-OR-97006/2592907/
     
  11. Hornet

    Newcomer

    Joined:
    Nov 28, 2009
    Messages:
    120
    Likes Received:
    0
    Location:
    Italy
    No, it doesn't. The benchmark should use thread affinity properly to avoid co-scheduling two threads on the same core. Even if it doesn't most OS schedulers do this job quite well.
     
  12. Hornet

    Newcomer

    Joined:
    Nov 28, 2009
    Messages:
    120
    Likes Received:
    0
    Location:
    Italy
    That means less than 80% efficiency which is quite unimpressive in LINPACK.
     
  13. OpenGL guy

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    2,357
    Likes Received:
    28
    In a hyperthreaded environment, you have more available threads than cores. Thus, if you schedule one thread per hardware thread, you will maximize hardware thread utilization. However, enabling hyperthreading reduces the amount of resources available per thread, thus, in some cases, you might get better results with hyperthreading disabled and scheduling fewer threads.

    Thread affinity does nothing to fix the issue that hyperthreading reduces the amount of resources available per thread.
     
  14. Homeles

    Newcomer

    Joined:
    May 25, 2012
    Messages:
    234
    Likes Received:
    0
    Linpack is actually a classic case for increasing performance by disabling hyperthreading.
     
  15. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,743
    Likes Received:
    106
    Location:
    Taiwan
    I don't have access to a Haswell system right now (I'm at home now), so I tested with my home rig (a Sandy Bridge Core i7 2600).

    One problem on my system with the default testing script in the latest version is the affinity setting. It's set to compact, but that's wrong. It should set to scatter. Setting to compact would restrict all threads to two physical cores instead of all four cores. I don't know the reason why the script is wrong, though.

    Restricting to 4 threads and allowing 8 threads do have very small difference (93.6 GFLOPS vs 92.2 GFLOPS, ~86% to the peak). Testing again with task manager running, I think that's because Intel's LINPACK now automatically restricts the number of threads when Hyperthreading is detected.

    For those with Haswell to test, I suggest modifying the script (runme_xeon64.bat) and change:

    Code:
    set KMP_AFFINITY=nowarnings,compact,granularity=fine
    to
    set KMP_AFFINITY=nowarnings,scatter,granularity=fine
    About the worse efficiency of Haswell, I think it's natural that a FMA+FMA system is going to have worse efficiency to the peak performance compared to a MUL+ADD system.
     
  16. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    Silvermont changes more than just adding OoOE, which is btw for the integer pipelines only and not the FP/SIMD or memory pipelines. I expect that it's precisely because it's for the ALUs only that the OoOE die impact is so small, but that's no good at all for something like Xeon Phi. You can't use Saltwell vs Silvermont performance as an evaluation of performance of OoO vs SMT.

    Saltwell's SMT is pretty different from what's in Xeon Phi anyway, not to mention that the typical workloads are very different. If you're in a situation where you can require that there are at least two threads loaded on the thread in order to get good performance then hiding latency and filling execution unit occupancy with simple SMT is going to be a much better choice than doing so with OoOE. Xeon Phi falls under that category, unless Intel really does want to do a unified core like Nick thinks.
     
  17. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,723
    Likes Received:
    193
    Location:
    Stateless
    OK, I think it is going to be interesting to compare how Jaguar and Silvermont compares in that regard.
    --------------------------------------
    EDIT
    By the way what is your take on Intel performances goals? It is quite a jump in both throughput and power efficiency.
    To some extend it has me to wonder about what David C stated, how about beefy iGPU, the first Intel iGPU to support DP calculation.
    ---------
    EDIT 2 One thing I find bothering with all this talk about Intel actually widening its SIMD is that actually its iGPU are actually highly threaded machines but the executions units /SIMD are pretty narrow (I would think 8 wide).
     
    #1177 liolio, Jul 8, 2013
    Last edited by a moderator: Jul 8, 2013
  18. Hornet

    Newcomer

    Joined:
    Nov 28, 2009
    Messages:
    120
    Likes Received:
    0
    Location:
    Italy
    A decent LAPACK/LINPACK implementation for hyper-threaded processors only spawns as many threads as the number of cores, and sets thread affinity to make sure that each thread runs on a separate core.
     
  19. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,743
    Likes Received:
    106
    Location:
    Taiwan
    I think the point is that for some highly optimized such as LINPACK, Hyperthreading is not useful so you have to try to not use it.
     
  20. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    771
    Likes Received:
    200
    From CPU World: "Intel plans to expand Xeon Phi 3100 and 5100 series in 2014."

    They can't do the same thing with the 71xx series since they already have 16 GB, but I'm still thinking there might be a clock speed bump for those next year, possibly with all 62 cores enabled, to reach higher performance and perf/W ( ≥ 5 DP GF/W?).

    Also, I have a question. Since Knights Landing is going to use DDR4, what sorts of memory bus widths and memory capacities can we expect to see on it?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...