Knights Landing details at RWT

RecessionCone · Mar 27, 2015

Jawed said:
The reason GPUs became interesting is that they offered 1-2 orders of magnitude greater compute density or performance per watt or per $ (and combinations thereof).

None of those things apply with Knights Landing in the wild. And it will run the code that everyone has been running, and happily support compute paradigms beyond crappy old MPI and OpenMP.

You can freely choose your desired mix of task- and data-parallel kernels within a single architecture, memory hierarchy, execution model, instruction set and clustering topology. It's a flat, sane landscape.

Yes, you're right: KNL runs the code you've already been running. However, it runs that code slower than HSW. Have you ever used a Xeon Phi?

Perhaps counter intuitively, there are far more applications in the wild that use CUDA or OpenCL than use KNL vector instructions effectively. KNL is several years too late to the GPU compute party, and it has the same drawbacks as GPUs: if you want improved compute density, you have to rewrite your code in a non-trivial way. However, because it's so late, it has to compete against the far more mature GPU compute ecosystem, with a significant performance/W and performance/$ disadvantage compared to either NVIDIA or AMD GPUs, but without a gaming market to justify Intel's significant R&D costs.

I do computationally dense simulations for a living. We buy large numbers of GPUs, and we couldn't get our work done without them. We write very little CUDA or OpenCL code - it's not necessary to get the job done, thanks to the libraries that already exist. I'd would buy Xeon Phi if it would make my applications run faster, but so far it's been a big disappointment. KNL looks better than KNC in many ways, but adoption will be slow, because the important libraries don't support KNL, and writing code to make Xeon Phi perform well is much more difficult than writing efficient GPU code. After years of broken promises (Larrabee! Knights Ferry! Knights Corner! Knights Landing!), Intel has a lot to prove wrt to Xeon Phi performance on things that matter.

Blazkowicz · Mar 27, 2015

spworley said:
KNL's use of 90GB/sec DDR4 seems laughable at first compared to even today's 300GB/sec GPUs, soon to be 1TB/sec. But the real use of this is for a large data store, not for performance. When you have 16GB of fast MCDRAM on the package, the off-chip DDR4 RAM is essentially used like host system memory is now... a slow but large main data staging and storage region. And for HPC, having that memory as dense DIMMs gives you flexibility and a very large maximum size (384 GB). So I almost think of it as replacing a 6GB/sec PCIE bus to system memory with a 90GB/sec connection instead. KNL is the system now.

Pascal competes on this point somehow when used with an IBM POWER8 or POWER9 CPU. No, I take that back : it's only with Volta and POWER9, Volta benefiting from NVLink 2.0. ( this source helped me http://www.hardware.fr/news/13987/ibm-power9-nvidia-volta-100-petaflops-2017.html )

Nvidia will litterally use a connection with the same order of magnitude of bandwith, with coherency so it should be about as seamless, though likely slower and higher latency.

psurge · Mar 28, 2015

I thought NVLink was supposed to arrive with Pascal (sans cache-coherency)?

Anyway, at work someone I know did try a Xeon Phi on a non trivial image processing code (without a rewrite), and the experience was exactly as spworley described: it worked, but was slower than on a regular Xeon. That said, it's not obvious to me why writing new code that efficiently makes use of the Xeon Phi vector units would be so much harder than writing efficient GPU code. Does Intel not provide a decent OpenCL implementation for Phi? Is Intel ISPC not viable for some reason?

pharma · Mar 28, 2015

psurge said:
I thought NVLink was supposed to arrive with Pascal (sans cache-coherency)?

I think you're right .... According to the Hardware.fr link above, Pascal will use NvLink along with HBM, but Volta will use NvLink version 2.0 which also supports fully coherent memory connectivity between CPU and GPU.

Always at the memory, with Volta, each GPU can then be equipped with a large amount of high memory performance with HBM technology. No question, however, to test all this during the implementation of these supercomputers, these technologies will be tested before. This was expected Nvidia. In 2016, the GPU Pascal will be the first to bear NVLink, HBM memory and the new format. Something to be ready for 2017 and GPU Volta who will benefit from version 2.0 NVLink whose main change will be the ability to support a space fully coherent memory or between the CPU and the GPU or. To enjoy high bandwidth is required, it can go up to 200 GB / s through all NVLink links (5 links to 40GB / s?). What to allow a thorough review of the architecture of supercomputers.

Dade · Mar 28, 2015

psurge said:
That said, it's not obvious to me why writing new code that efficiently makes use of the Xeon Phi vector units would be so much harder than writing efficient GPU code. Does Intel not provide a decent OpenCL implementation for Phi?

Exactly, I don't understand why Xeon Phi can not run OpenCL as well as any other GPU.

psurge said:
Is Intel ISPC not viable for some reason?

ISPC works well (at least in my field of interest) but who want to rewrite yet again his code after all the effort to move to OpenCL (and it was quite a pain given the quite horrible drivers, full of bugs, provided by many vendors).

RecessionCone · Mar 28, 2015

Dade said:
Exactly, I don't understand why Xeon Phi can not run OpenCL as well as any other GPU.

ISPC works well (at least in my field of interest) but who want to rewrite yet again his code after all the effort to move to OpenCL (and it was quite a pain given the quite horrible drivers, full of bugs, provided by many vendors).

Have you tried running OpenCL code on Xeon Phi? My experience has been that it requires float16 vectors in your code in order to perform well. Intel's OpenCL compiler for Xeon Phi did not vectorize across lanes, last time I used it, but instead required the user to manually vectorize their code (like when using intrinsics). Writing 16-way vector code is not fun. OpenCL could do what ISPC does and vectorize work items across lanes, but like I said, at the moment it doesn't.

The problem with Xeon Phi is that Intel doesn't have a coherent programming model strategy for it. There are several competing visions:
1. Run normal x86 code, you just get more threads!
2. Recompile with ICC autovectorizing magic!
3. Rewrite your code using C extensions for Array Notation and Cilk++ & let ICC do the work!
4. Use OpenMP and extra pragmas for vectorization!
5. Use OpenCL with 16-wide vector types!
6. Write Xeon Phi intrinsics!

Of all these visions, #1 is the most alluring, but also the most disappointing. Performance is worse than traditional Xeon in this case.

#6 is the most practical, and everyone I know that seriously uses Xeon Phi and requires good performance does this. However, this is way more work than writing OpenCL code.

ISPC is great. I wish Intel would focus its programming model efforts on it. But because of Intel's culture, ISPC isn't taken seriously outside of the graphics group, and Intel has not committed serious resources to really push it forward. (I think this dysfunction is why Matt Pharr is at Google now.)

The Intel compiler team is always pushing 2, 3 & 4, but I haven't seen good performance on serious projects with any of this, and am quite skeptical.

OpenCL could work, but again: Intel hasn't devoted the resources to make a great OpenCL compiler for Xeon Phi. And I doubt it will, because the corporate culture at Intel is dominated by manufacturing, not programming models. All the OpenCL benchmarks I've seen for Xeon Phi have been quite disappointing, even though people have gone to great lengths to write OpenCL with 16-wide vector types to try to make it work.

Intel doesn't really want Xeon Phi to succeed. Just as they didn't really want Itanium to succeed. Traditional Xeon is king at Intel, and I believe will be the only thing left standing.

Jawed · Mar 29, 2015

RecessionCone said:
Yes, you're right: KNL runs the code you've already been running. However, it runs that code slower than HSW. Have you ever used a Xeon Phi?

No.

Perhaps counter intuitively, there are far more applications in the wild that use CUDA or OpenCL than use KNL vector instructions effectively.

Not sure how that's counter-intuitive, since MIC is still so new.

It's worth noting that the #1 machine in the Top 500 is Phi based. So someone's been using it.

KNL is several years too late to the GPU compute party, and it has the same drawbacks as GPUs: if you want improved compute density, you have to rewrite your code in a non-trivial way. However, because it's so late, it has to compete against the far more mature GPU compute ecosystem, with a significant performance/W and performance/$ disadvantage compared to either NVIDIA or AMD GPUs, but without a gaming market to justify Intel's significant R&D costs.

I wonder if there'll be a market for 300W gaming GPUs in 5 years?... 10 years, definitely not.

I do computationally dense simulations for a living. We buy large numbers of GPUs, and we couldn't get our work done without them. We write very little CUDA or OpenCL code - it's not necessary to get the job done, thanks to the libraries that already exist. I'd would buy Xeon Phi if it would make my applications run faster, but so far it's been a big disappointment. KNL looks better than KNC in many ways, but adoption will be slow, because the important libraries don't support KNL, and writing code to make Xeon Phi perform well is much more difficult than writing efficient GPU code.

I've spent a while today looking at what's out there in Phi land and what I see looks like 2003-2005 in GPGPU. Most people with results to report are "experimenting" with Phi. Some code gives precisely the performance expected (SGEMM, ooh what a surprise) and other stuff has gotchas in unexpected places (nothing to do with the compute architecture, for instance).

I think it's worth pointing out that people are still trying to figure out how to write fast SGEMM on GPUs (DGEMM, not so much) and they've had, oh I dunno, about 12 years of practice now. GPUs are actually more difficult to program than Phi, because there are more layers in the machine's architecture, each of which hides corner cases.

As an aside, I believe I've written the fastest GCN SGEMM, 3.1 TFLOPS at 1GHz on 7970 (sticking with OpenCL, though the temptation to patch the binary is strong). I've been talking about SGEMM for most of a decade, because it's not easy and GPUs have made it particularly hard:

NVIDIA GT200 Rumours & Speculation Thread

Whereas by all appearances it's pretty simple on Phi to get the expected performance. (In b4 the smart alec says, "well that's all Phi's good for".)

After years of broken promises (Larrabee! Knights Ferry! Knights Corner! Knights Landing!), Intel has a lot to prove wrt to Xeon Phi performance on things that matter.

Yes that's fair, Intel has yet to prove it's going to make it as compelling as the theories say it should be.

spworley · Mar 29, 2015

Jawed said:
It's worth noting that the #1 machine in the Top 500 is Phi based. So someone's been using it.

Maybe not. A report this month shows that Tianhe-2's Phi accelerators are not often used.

The reason quoted? First, operational cost. Second, "Xeon Phi hasn’t proven itself in ease of use when compared to pure CPU code or accelerated code through GPGPU accelerators such as the Nvidia Tesla or AMD FirePro S Series."

Jawed · Mar 29, 2015

I suggest people read that article (it has the content of about 2 tweets) - it's 100% speculation.

psurge · Mar 30, 2015

That's good to know and pretty disappointing to hear - I would have expected Intel to at least ship a competitive OpenCL implementation given their lateness to the party.

CarstenS · Mar 31, 2015

I wonder whether all this MIC/Knight sth./Phi stuff is just a doorstop so that the HPC community does not forget about Intel while they slowly and steadily ramp up the Vector units inside the regular x86-CPUs.

A1xLLcqAgt0qc2RyMz0y · Mar 31, 2015

Jawed said:
I suggest people read that article (it has the content of about 2 tweets) - it's 100% speculation.

Its more of a smoking gun.

But in an exclusive interview with VR World, Dr. Jack Dongarra of Oak Ridge National Laboratory and the University of Tennessee, said that China’s HPC stature may be something of a facade. Tianhe-2, while definitely the world’s fastest supercomputer, is somewhat idle and is not being used to its full capacity.

“The real question is: what are they going to use the machine for. I question, at some level, what the Chinese are doing with these big machines,” Dongarra said. “They are are not using the accelerator part of the machine.” [48,000 Intel Xeon Phi 31S1P Accelerator cards].

“I go visit the computing facilities [in China] – and I’m not saying that they are being used for things that are secret – I’m saying that I don’t know what they are being used for,” he continued.

If the leading guru of HPC can't find out from anyone in China as to what if anything is being run on the Tianhe-2 that says a lot.

Dongarra explained that part of the reason why Tianhe-2 is more idle than other top supercomputers is because of the funding model China’s government provides. The government paid for the costs to develop and construct the machine, but not for its operational costs which is not the norm in the scientific computing community.

The additional difficulty might be the machine setup China decided to go with. Intel’s (NASDAQ: INTC) Xeon Phi hasn’t proven itself in ease of use when compared to pure CPU code or accelerated code through GPGPU accelerators such as the Nvidia (NASDAQ: NVDA) Tesla or AMD (NASDAQ: AMD) FirePro S Series.

“They have to come up with some mechanism to pay for it,” Dongarra said. “In scientific computing we don’t pay for computing time. It’s not in the culture of how we do business. A situation where people have to pay for computing time limits the computing time being used.”

nutball · Mar 31, 2015

A1xLLcqAgt0qc2RyMz0y said:
Its more of a smoking gun.

Or PR spin. Speaking of which...

Blazkowicz · Mar 31, 2015

spworley said:
Maybe not. A report this month shows that Tianhe-2's Phi accelerators are not often used.

I remember reading the same about Teslas, but that was simply a very generic comment. Scientists, not computer scientists have to use these things ; they're likely to simply hack up something in Python (w/ libraries), Matlab or R to run on their desktop or a single node.
I'd wager code for a supercomputer or cluster is less commonly written, ditto code for GPUs, and GPUs on a supercomputer are a subset of both.
A CPU-only supercomputer can be very useful on its own.

So no matter what, I believe CPU-only code competes with Xeon Phi enabled code on the big Chinese machine.
You might be able to run both concurrently, but network and I/O might be a severe limitation.
At worst : the supercomputer causes some of the coal-fired smog build up so you might want to throttle it down sometimes

:smile:

A1xLLcqAgt0qc2RyMz0y · Mar 31, 2015

nutball said:
Or PR spin. Speaking of which...

Seems that you don't like the subject matter so out comes the old PR spin line.

I am inclined to believe an expert in the field of HPC that was there from the beginning than someone who posts nothing to disprove the claim.

Jack Dongarra has been involved since the origin and formation of the TOP500 list in 1993, which used his Linpack benchmark as the common application for evaluating the performance of supercomputers. Through the consistent use of the Linpack benchmark, the TOP500 list provides a standardized measure of supercomputers over the past 20 years. Dongarra holds an appointment at the University of Tennessee, Oak Ridge National Laboratory, and the University of Manchester. He specializes in numerical algorithms in linear algebra, parallel computing, use of advanced-computer architectures, programming methodology, and tools for parallel computers.

In addition to Linpack and the TOP500, Dongarra has contributed to the design and implementation of the following open source software packages and systems: EISPACK, the BLAS, LAPACK, ScaLAPACK, Netlib, PVM, MPI, Open-MPI, NetSolve, ATLAS, PAPI, PLASMA, and MAGMA. He has published approximately 200 articles, papers, reports and technical memoranda and he is coauthor of several books. He was awarded the IEEE Sid Fernbach Award in 2004 for his contributions in the application of high performance computers using innovative approaches; in 2008 he was the recipient of the first IEEE Medal of Excellence in Scalable Computing; in 2010 he was the first recipient of the SIAM Special Interest Group on Supercomputing's award for Career Achievement; and in 2011 he was the recipient of the IEEE IPDPS 2011 Charles Babbage Award. He is a Fellow of the AAAS, ACM, IEEE, and SIAM and a member of the National Academy of Engineering. Dongarra has been an ISC Fellow since 2012.

Dongarra received a Bachelor of Science in Mathematics from Chicago State University in 1972 and a Master of Science in Computer Science from the Illinois Institute of Technology in 1973. He received his Ph.D. in Applied Mathematics from the University of New Mexico in 1980. He worked at the Argonne National Laboratory until 1989, becoming a senior scientist. He is the director of the Innovative Computing Laboratory at the University of Tennessee.

http://www.top500.org/project/authors/jack-dongarra

A1xLLcqAgt0qc2RyMz0y · Mar 31, 2015

Blazkowicz said:
Scientists, not computer scientists have to use these things

The problem for the Tianhe-2 is that those Scientists have to pay for running their jobs on the Tianhe-2.

Dongarra explained that part of the reason why Tianhe-2 is more idle than other top supercomputers is because of the funding model China’s government provides. The government paid for the costs to develop and construct the machine, but not for its operational costs which is not the norm in the scientific computing community.

“They have to come up with some mechanism to pay for it,” Dongarra said. “In scientific computing we don’t pay for computing time. It’s not in the culture of how we do business. A situation where people have to pay for computing time limits the computing time being used.”

nutball · Apr 1, 2015

A1xLLcqAgt0qc2RyMz0y said:
Seems that you don't like the subject matter so out comes the old PR spin line.

Seems you know nothing about HPC, nor the politics of HPC. I'm not really interested in educating you.

aaronspink · Apr 1, 2015

A1xLLcqAgt0qc2RyMz0y said:
Seems that you don't like the subject matter so out comes the old PR spin line.

I am inclined to believe an expert in the field of HPC that was there from the beginning than someone who posts nothing to disprove the claim.

Except he doesn't really say anything in that article except that their funding model is flawed.

A1xLLcqAgt0qc2RyMz0y · Apr 1, 2015

nutball said:
Seems you know nothing about HPC, nor the politics of HPC.

And yet another post without any facts to back up your claim that the Tianhe-2 is running near capacity, or running many jobs on the accelerator part of the machine.

Unless you can provide proof of that then your claims of this being only politics and a PR spin fall into the old deny and spin category yourself.

nutball said:
I'm not really interested in educating you.

Why would I even be interested in your teachings if you will not even provide proof of your claims. Seems like I would gain nothing useful in those classes.

A1xLLcqAgt0qc2RyMz0y · Apr 1, 2015

aaronspink said:
Except he doesn't really say anything in that article except that their funding model is flawed.

Dr. Jack Dongarra's quotes:

Dr. Jack Dongarra of Oak Ridge National Laboratory and the University of Tennessee, said that China’s HPC stature may be something of a facade. Tianhe-2, while definitely the world’s fastest supercomputer, is somewhat idle and is not being used to its full capacity.

“The real question is: what are they going to use the machine for. I question, at some level, what the Chinese are doing with these big machines,” Dongarra said. “They are are not using the accelerator part of the machine.” [48,000 Intel Xeon Phi 31S1P Accelerator cards].

http://www.vrworld.com/2015/03/22/jack-dongarra-china-isnt-the-emerging-hpc-power-you-think-it-is/

Knights Landing details at RWT

RecessionCone

Blazkowicz

psurge

pharma

Dade

RecessionCone

Jawed

spworley

Jawed

psurge

CarstenS

Moderator

A1xLLcqAgt0qc2RyMz0y

nutball

Blazkowicz

A1xLLcqAgt0qc2RyMz0y

A1xLLcqAgt0qc2RyMz0y

nutball

aaronspink

A1xLLcqAgt0qc2RyMz0y

A1xLLcqAgt0qc2RyMz0y

Similar threads