Knights Landing details at RWT

Discussion in 'Architecture and Products' started by dkanter, Jan 13, 2014.

Tags:
  1. RecessionCone

    RecessionCone Regular Subscriber

    Yes, you're right: KNL runs the code you've already been running. However, it runs that code slower than HSW. Have you ever used a Xeon Phi?

    Perhaps counter intuitively, there are far more applications in the wild that use CUDA or OpenCL than use KNL vector instructions effectively. KNL is several years too late to the GPU compute party, and it has the same drawbacks as GPUs: if you want improved compute density, you have to rewrite your code in a non-trivial way. However, because it's so late, it has to compete against the far more mature GPU compute ecosystem, with a significant performance/W and performance/$ disadvantage compared to either NVIDIA or AMD GPUs, but without a gaming market to justify Intel's significant R&D costs.

    I do computationally dense simulations for a living. We buy large numbers of GPUs, and we couldn't get our work done without them. We write very little CUDA or OpenCL code - it's not necessary to get the job done, thanks to the libraries that already exist. I'd would buy Xeon Phi if it would make my applications run faster, but so far it's been a big disappointment. KNL looks better than KNC in many ways, but adoption will be slow, because the important libraries don't support KNL, and writing code to make Xeon Phi perform well is much more difficult than writing efficient GPU code. After years of broken promises (Larrabee! Knights Ferry! Knights Corner! Knights Landing!), Intel has a lot to prove wrt to Xeon Phi performance on things that matter.
     
    Grall and Lightman like this.
  2. Blazkowicz

    Blazkowicz Legend

    Pascal competes on this point somehow when used with an IBM POWER8 or POWER9 CPU. No, I take that back : it's only with Volta and POWER9, Volta benefiting from NVLink 2.0. ( this source helped me http://www.hardware.fr/news/13987/ibm-power9-nvidia-volta-100-petaflops-2017.html )

    Nvidia will litterally use a connection with the same order of magnitude of bandwith, with coherency so it should be about as seamless, though likely slower and higher latency.
     
  3. psurge

    psurge Regular

    I thought NVLink was supposed to arrive with Pascal (sans cache-coherency)?

    Anyway, at work someone I know did try a Xeon Phi on a non trivial image processing code (without a rewrite), and the experience was exactly as spworley described: it worked, but was slower than on a regular Xeon. That said, it's not obvious to me why writing new code that efficiently makes use of the Xeon Phi vector units would be so much harder than writing efficient GPU code. Does Intel not provide a decent OpenCL implementation for Phi? Is Intel ISPC not viable for some reason?
     
    Dade likes this.
  4. pharma

    pharma Veteran

    I think you're right .... According to the Hardware.fr link above, Pascal will use NvLink along with HBM, but Volta will use NvLink version 2.0 which also supports fully coherent memory connectivity between CPU and GPU.

     
  5. Dade

    Dade Newcomer

    Exactly, I don't understand why Xeon Phi can not run OpenCL as well as any other GPU.

    ISPC works well (at least in my field of interest) but who want to rewrite yet again his code after all the effort to move to OpenCL (and it was quite a pain given the quite horrible drivers, full of bugs, provided by many vendors).
     
  6. RecessionCone

    RecessionCone Regular Subscriber

    Have you tried running OpenCL code on Xeon Phi? My experience has been that it requires float16 vectors in your code in order to perform well. Intel's OpenCL compiler for Xeon Phi did not vectorize across lanes, last time I used it, but instead required the user to manually vectorize their code (like when using intrinsics). Writing 16-way vector code is not fun. OpenCL could do what ISPC does and vectorize work items across lanes, but like I said, at the moment it doesn't.

    The problem with Xeon Phi is that Intel doesn't have a coherent programming model strategy for it. There are several competing visions:
    1. Run normal x86 code, you just get more threads!
    2. Recompile with ICC autovectorizing magic!
    3. Rewrite your code using C extensions for Array Notation and Cilk++ & let ICC do the work!
    4. Use OpenMP and extra pragmas for vectorization!
    5. Use OpenCL with 16-wide vector types!
    6. Write Xeon Phi intrinsics!

    Of all these visions, #1 is the most alluring, but also the most disappointing. Performance is worse than traditional Xeon in this case.

    #6 is the most practical, and everyone I know that seriously uses Xeon Phi and requires good performance does this. However, this is way more work than writing OpenCL code.

    ISPC is great. I wish Intel would focus its programming model efforts on it. But because of Intel's culture, ISPC isn't taken seriously outside of the graphics group, and Intel has not committed serious resources to really push it forward. (I think this dysfunction is why Matt Pharr is at Google now.)

    The Intel compiler team is always pushing 2, 3 & 4, but I haven't seen good performance on serious projects with any of this, and am quite skeptical.

    OpenCL could work, but again: Intel hasn't devoted the resources to make a great OpenCL compiler for Xeon Phi. And I doubt it will, because the corporate culture at Intel is dominated by manufacturing, not programming models. All the OpenCL benchmarks I've seen for Xeon Phi have been quite disappointing, even though people have gone to great lengths to write OpenCL with 16-wide vector types to try to make it work.

    Intel doesn't really want Xeon Phi to succeed. Just as they didn't really want Itanium to succeed. Traditional Xeon is king at Intel, and I believe will be the only thing left standing.
     
    spworley, Lightman and Grall like this.
  7. Jawed

    Jawed Legend

    No.

    Not sure how that's counter-intuitive, since MIC is still so new.

    It's worth noting that the #1 machine in the Top 500 is Phi based. So someone's been using it.

    I wonder if there'll be a market for 300W gaming GPUs in 5 years?... 10 years, definitely not.

    I've spent a while today looking at what's out there in Phi land and what I see looks like 2003-2005 in GPGPU. Most people with results to report are "experimenting" with Phi. Some code gives precisely the performance expected (SGEMM, ooh what a surprise) and other stuff has gotchas in unexpected places (nothing to do with the compute architecture, for instance).

    I think it's worth pointing out that people are still trying to figure out how to write fast SGEMM on GPUs (DGEMM, not so much) and they've had, oh I dunno, about 12 years of practice now. GPUs are actually more difficult to program than Phi, because there are more layers in the machine's architecture, each of which hides corner cases.

    As an aside, I believe I've written the fastest GCN SGEMM, 3.1 TFLOPS at 1GHz on 7970 (sticking with OpenCL, though the temptation to patch the binary is strong). I've been talking about SGEMM for most of a decade, because it's not easy and GPUs have made it particularly hard:

    NVIDIA GT200 Rumours & Speculation Thread

    Whereas by all appearances it's pretty simple on Phi to get the expected performance. (In b4 the smart alec says, "well that's all Phi's good for".)

    Yes that's fair, Intel has yet to prove it's going to make it as compelling as the theories say it should be.
     
  8. spworley

    spworley Newcomer

    Maybe not. A report this month shows that Tianhe-2's Phi accelerators are not often used.

    The reason quoted? First, operational cost. Second, "Xeon Phi hasn’t proven itself in ease of use when compared to pure CPU code or accelerated code through GPGPU accelerators such as the Nvidia Tesla or AMD FirePro S Series."
     
  9. Jawed

    Jawed Legend

    I suggest people read that article (it has the content of about 2 tweets) - it's 100% speculation.
     
  10. psurge

    psurge Regular

    That's good to know and pretty disappointing to hear - I would have expected Intel to at least ship a competitive OpenCL implementation given their lateness to the party.
     
  11. CarstenS

    CarstenS Legend Subscriber

    I wonder whether all this MIC/Knight sth./Phi stuff is just a doorstop so that the HPC community does not forget about Intel while they slowly and steadily ramp up the Vector units inside the regular x86-CPUs.
     
  12. Its more of a smoking gun.
    If the leading guru of HPC can't find out from anyone in China as to what if anything is being run on the Tianhe-2 that says a lot.
     
  13. nutball

    nutball Veteran Subscriber

    Or PR spin. Speaking of which...
     
  14. Blazkowicz

    Blazkowicz Legend

    I remember reading the same about Teslas, but that was simply a very generic comment. Scientists, not computer scientists have to use these things ; they're likely to simply hack up something in Python (w/ libraries), Matlab or R to run on their desktop or a single node.
    I'd wager code for a supercomputer or cluster is less commonly written, ditto code for GPUs, and GPUs on a supercomputer are a subset of both.
    A CPU-only supercomputer can be very useful on its own.

    So no matter what, I believe CPU-only code competes with Xeon Phi enabled code on the big Chinese machine.
    You might be able to run both concurrently, but network and I/O might be a severe limitation.
    At worst : the supercomputer causes some of the coal-fired smog build up so you might want to throttle it down sometimes :shock: :smile:
     
  15. Seems that you don't like the subject matter so out comes the old PR spin line.

    I am inclined to believe an expert in the field of HPC that was there from the beginning than someone who posts nothing to disprove the claim.
     
    Last edited: Apr 1, 2015
  16. The problem for the Tianhe-2 is that those Scientists have to pay for running their jobs on the Tianhe-2.
     
  17. nutball

    nutball Veteran Subscriber

    Seems you know nothing about HPC, nor the politics of HPC. I'm not really interested in educating you.
     
  18. aaronspink

    aaronspink Veteran

    Except he doesn't really say anything in that article except that their funding model is flawed.
     
  19. And yet another post without any facts to back up your claim that the Tianhe-2 is running near capacity, or running many jobs on the accelerator part of the machine.

    Unless you can provide proof of that then your claims of this being only politics and a PR spin fall into the old deny and spin category yourself.

    Why would I even be interested in your teachings if you will not even provide proof of your claims. Seems like I would gain nothing useful in those classes.
     
  20. Dr. Jack Dongarra's quotes:
     
Loading...

Share This Page

Loading...