Knights Landing Speculation

Discussion in 'Architecture and Products' started by dkanter, Nov 19, 2013.

  1. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
  2. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,125
    Likes Received:
    2,885
    Location:
    Well within 3d
    I look forward to seeing the updates that start giving the contours of the speculated design. I'm curious just how custom the core can be, or which other core line it will most resemble.

    Is there anything we can infer from the AVX3 extensions that lie outside of the foundation instructions, such as whether Skywell will support them?
    If there is a disparity there, would that have some import on the differences between the cores?
    In the other direction, is it known if Knights Landing will omit other extensions to x86, keeping it out of lockstep with the main lines?

    One point of clarification, while it is the case that the pre Knights Landing chips aren't bootable, is it really the core's fault or something about the design as a whole that caused it?
    It's not like the P54 couldn't boot.

    There were details mentioned about the LLC and ring bus for Xeon Phi, and their similarity to the big core ring bus. Has the cache protocol been disclosed for Larrabee or others in that line? I've just gone with the assumption they probably didn't have time or justification to put MESIF into the P54 cache subsystem.


    One thing that I have been wondering about is if there is something significant about how the caches in Larrabee and Knights Landing are indexed versus something like GCN. GCN's are described as virtually addressed (the Onion bus excepted?), and I am going with the assumption that the TLB still figures in the Phi memory pipeline.
     
  3. Ninjaprime

    Regular

    Joined:
    Jun 8, 2008
    Messages:
    337
    Likes Received:
    1
    I don't really see how they plan to more than double perf while using AVX3 unless they either double core count or double clock rates. I think that lends credibility to the 72 core custom Atom based rumor, a few more cores clocked much higher seems more doable (and desirable) than putting 120+ cores on a chip.
     
  4. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,798
    Likes Received:
    2,056
    Location:
    Germany
  5. Megadrive1988

    Veteran

    Joined:
    May 30, 2002
    Messages:
    4,638
    Likes Received:
    148
  6. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,559
    Likes Received:
    34
  7. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    773
    Likes Received:
    200
    Interesting that they claim that the socketed variants will come earlier than the PCIe variants. Late 2015 for the latter could mean that they'll be up against GPUs two generations from now (16FF?).
     
  8. Paran

    Regular Newcomer

    Joined:
    Sep 15, 2011
    Messages:
    251
    Likes Received:
    14
  9. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    372
    Likes Received:
    309
    nice reading.
    however, a quick comment on this point:
    KL timeframe will face 20nm Maxwell and/or 16nm FinFet Volta, and they will both use a bunch of high performance ARM v8 cores with Nvidia idea of HSA. One NV guru also said in an interview that their goal is to get off main CPU dependence in the HPC space, so anyone should expect bootable Maxwell/Volta...
     
  10. Paran

    Regular Newcomer

    Joined:
    Sep 15, 2011
    Messages:
    251
    Likes Received:
    14
    Of course it will face Maxwell 20nm but even then it will be hard to match KNL for Nvidia. They need a DP doubling or more to match KNL DP performance. This is not easy since Maxwell to Kepler is most likely a much smaller architectural jump than Kepler to Fermi was. Same for the process shrink. Also KNL TDP decreased to 160-215W according to the leaks. The edram usage in KNL should give Intel a huge latency advantage over Maxwell. This is a very tough opponent for Maxwell 2015.
     
  11. ninelven

    Veteran

    Joined:
    Dec 27, 2002
    Messages:
    1,702
    Likes Received:
    117
    What is this statement based upon, if you don't mind me asking?
     
  12. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    372
    Likes Received:
    309
    well I will have a pragmatic and "wait and see" approach when it comes to intel numbers in HPC space. From many interviews of people in the know, current Xeon Phi "real HPC world efficiency" is way below Kepler.
    So even if KL DP theoretical performance claims are very high, let's wait what it can do in the field before announcing a winner...

    edit: first google link xeon phi vs tesla: http://blog.xcelerit.com/intel-xeon-phi-vs-nvidia-tesla-gpu/
     
    #12 xpea, Jan 5, 2014
    Last edited by a moderator: Jan 5, 2014
  13. LiXiangyang

    Newcomer

    Joined:
    Mar 4, 2013
    Messages:
    81
    Likes Received:
    47
    I bet it can barely match the performance that a 20nm maxwell can deliver.

    Just ignore the raw Glops nonsense, besides a few lapack routines where the two were comparable, for more general and more flexable tasks I and my co-workers tested so far, MIC is roughly 1/2 the performance of K20x, despite of the roughly equal rated Gflops of the two products.

    And furthermore, despite of the marketing nonsense, it actually takes more effects to optimze the codes on MIC than on GPU, and the codes are less portable.

    The main problem with MIC is:

    1, SIMD vs SIMT, the programming interface of CUDA is thread-level, whilst for MIC, it is a very fat vector-level, which means you need to write assmebly-like vector ops codes that could end-up being less portable to further generations of MICs, and of cause you have to forget template-like stuff for high productivity, and more or less stick with C and assemebly-like programming styles.

    2, GPU has a programmable L1 cache and large register files which many general tasks can take advantage of, whilst MIC has no such features nor they are intend to implent such in their future versions according to Intel tech stuff I met.

    3, It seems to me Nvidia's hardware can simply handle/manage massive parallel work-loads better and can utilize their memory-bandwidth better, for whatever reasons.

    If the rumor spec about maxwell is true, I dont think Nvidia need to worry about intel's MICs in the near future.
     
    #13 LiXiangyang, Jan 6, 2014
    Last edited by a moderator: Jan 6, 2014
  14. moozoo

    Newcomer

    Joined:
    Jul 23, 2010
    Messages:
    109
    Likes Received:
    1
    I'm wondering how something like AMD's Berlin APU (Server APU) would compare against Knights Landing if they added 4P (4 sockets, via Hyper Transport?) and 1/2 DP rate to the GPU.
     
  15. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,496
    Likes Received:
    910
    Probably not too well, but then again Toronto should be out by then, presumably on 20nm.
     
  16. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Do you use cilk for mic? Intrinsics? something else?
     
    #16 rpg.314, Jan 6, 2014
    Last edited by a moderator: Jan 6, 2014
  17. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    For the love of god do yourself a favor and use ISPC for SIMD :) Then you can use Cilk above it for work stealing or whatever solution you want. Super-productive and powerful environment, speaking from experience.
     
  18. LiXiangyang

    Newcomer

    Joined:
    Mar 4, 2013
    Messages:
    81
    Likes Received:
    47
    I tried cilk plus before, but it seems that the compiler cannot handle vectorization very efficiently when functions being mapped is more complicate, consisting some loops and branches (comparing to CUDA), so I stick with hand-tunned vectorization within the function body instead.

    As for instrinsics, well I used them extensively, but they looks like assembly, especially if you use them alot in your codes.
     
  19. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Sounds like only SPMD model works for vectorization.

    It's a shame ISO C++ is going the annonated loop way. Sigh...
     
  20. Andrew Lauritzen

    Moderator Veteran

    Joined:
    May 21, 2004
    Messages:
    2,526
    Likes Received:
    454
    Location:
    British Columbia, Canada
    Yeah ISPC is clearly better, and it's better than the GPU computing languages too. If the whole persistent threads research has showed us anything it's that virtualizing (vs. parameterizing) the SIMD width is far too harmful to performance of non-trivial kernels.

    Unfortunately HPC and other parties have too much sway as far as the standards go, and they are completely in the land of non-coding physicists who just want to trust compiler magic to get them a 2x even on 8+-wide SIMD... and again, I speak from experience on that as someone who has rewritten a lot of scientist-code :)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...