Knights Landing details at RWT

I wonder whether all this MIC/Knight sth./Phi stuff is just a doorstop so that the HPC community does not forget about Intel while they slowly and steadily ramp up the Vector units inside the regular x86-CPUs.
Or, what about a big little configuration? :mrgreen:
 
And yet another post without any facts to back up your claim that the Tianhe-2 is running near capacity, or running many jobs on the accelerator part of the machine.

You are arguing across points, but that's hardly surprising. International HPC willy-waving is just a little bit different from NVIDIA Uber Alles posing on internet forums.
 
Dr. Jack Dongarra's quotes:

And all he's saying is that the funding model for the machine means its underused. As far as what's running on the machine, I'm pretty sure Jack isn't going to tell China details about whats running on the ***CLASSIFIED*** US machines either...

And having actually worked designing components of these systems, even the vendors don't really know whats running on them. You'll never get actual source codes or data. And in the case where they need to bring in a vendor to debug something, they try to replicate the issue without using actual code or data. In the rare case that they have to should actual code or data, the people shown have to go through clearance and sign NDAs. And that's supportive vendors designing the actual hardware and software infrastructure, not a foreign national who works directly for a foreign government as part of its military-scientific complex!
 
http://anandtech.com/show/9151/inte...contracts-for-2-dept-of-energy-supercomputers

The flagship of these two computers is Aurora, a next-generation Cray “Shasta” supercomputer that is scheduled for delivery in 2018. Designed to deliver 180 PetaFLOPS of peak compute performance, Aurora will be heavily leveraging Intel’s suite of HPC technologies.
[...]
According to Intel this is the first time in nearly two decades that they have been awarded the prime contractor role in a supercomputer, their last venture being ASCI Red in 1996.
 
Or, what about a big little configuration? :mrgreen:
Isn't traditional x86 + a large vector unit already some kind of big little without the transparency thing? ;)
Not in the sense, big.Little is used in ARM-arch, of course, but still. You have legacy cores for day-to-day tasks and specialized units for heavy lifting only that those require explicit application support.
 
I wasn't positing big and little "cores" or instruction types so much as "complexes" for want of a better word, e.g. 8 cores of conventional 3 GHz+ x86 + 64 cores of MIC on a single die.
 
Its almost a common knowledge among HPC people that the first gen of MIC sucks, sure it is easier than GPU @writing some "WORKING" code, but when it comes to performance, which is the purpose of an accerlator, it is MUCH harder to write some optimized code there, and yes, even intel itself cannot, and MIC's silicon design is also a failure, they may only be comparable to GPU @GEMM, and much worse for almost all other tasks, and actually for most tasks, it is barely better than a standard Xeon.

I have had some projects with Guangzhou supercomputer center (where Tianhe-2 is located), as far as I know, few serious people use Xeon-phi in their applications there.

Intel enjoys a better process(22nm vs 28nm) and dont need to waste any silicons on graphic stuff, yet they fail to deliever a competive product.
 
Intel enjoys a better process(22nm vs 28nm) and dont need to waste any silicons on graphic stuff, yet they fail to deliever a competive product.
Yes, but that's the old Phi; this thread's about the new one, which is largely an unknown entity at the moment. It's like saying the Geforce FX was terrible, so the next one must be too. ;)
 
Its almost a common knowledge among HPC people that the first gen of MIC sucks, sure it is easier than GPU @writing some "WORKING" code, but when it comes to performance, which is the purpose of an accerlator, it is MUCH harder to write some optimized code there, and yes, even intel itself cannot, and MIC's silicon design is also a failure, they may only be comparable to GPU @GEMM, and much worse for almost all other tasks, and actually for most tasks, it is barely better than a standard Xeon.

I have had some projects with Guangzhou supercomputer center (where Tianhe-2 is located), as far as I know, few serious people use Xeon-phi in their applications there.

Intel enjoys a better process(22nm vs 28nm) and dont need to waste any silicons on graphic stuff, yet they fail to deliever a competive product.

Its also almost common knowledge among HPC people that GPGPU sucks too, so there's that. For most tasks, current generation GPUs and MICs are generally no better than standard Xeon. This shouldn't be too shocking since both GPU and MICs currently have lots of hardware and software issues. KNL fixes many of MICs problem, the major one being lack of available/usable memory bandwidth, with actual delivered stream bandwidths of 400+ GB/s. It also solves the capacity and communication issues as well. In addition, it provides viable single thread performance which is something KNC sorely lacked.

As far as what's actually run on Tianhe-2, its highly doubtful any of us really know, and highly unlikely that anyone that really knows can or should actually talk about it.
 
Speaking of Xeon. Does anyone have an insight or can make an educated guess how many real-world problems and/or algorithms are out there being used that utilize recent instruction set updates like AVX(2)? I am wondering if this is used really or just employed for the sake of the linpack rating.
 
My uneducated guess is that libraries like MKL, the vectorizers for various compilers, the Intel OpenCL driver, and so on all get mileage out of them - linpack rating alone doesn't seem like enough of a marketing win to justify introducing a new ISA extension every time somebody sneezes.
 
Autovectorization is still a crapshoot, and the Intel OpenCL driver sees very little use. In HPC, if you want any kind of real use out of the vector extensions, you either use instrinsics in c++ or straight up write assembly. There are workloads where this is worth it -- for example, when working with dna and proteins >20x speedups are possible and there is enough computational load that spending programmer time to save machine time can save money.

A simplified way to put it is that most users get nothing out of large vectors, but the small subset that does benefit from them really likes them.
 
how many real-world problems and/or algorithms are out there being used that utilize recent instruction set updates like AVX(2)?
Grid 2 uses avx 2 for order independent transparency

edit: sorry Grid 2 uses exclusive haswell features not avx2
 
Last edited:
Speaking of Xeon. Does anyone have an insight or can make an educated guess how many real-world problems and/or algorithms are out there being used that utilize recent instruction set updates like AVX(2)? I am wondering if this is used really or just employed for the sake of the linpack rating.

I believe x264 uses avx2/avx where possible with at least some degree of measured success.

But I'm not sure I understand the question. Any workload that can make use of (wider) vectors would benefit. You could ask the same question about SSE really. AVX(2) mostly just extends the vector length to 256-bit, adds support for three operands, and I think added a low performing gather. Skylake will extend the vector length again to 512-bit and (I think) add a high performing gather/scatter (and I'm sure other things). You're right that there's a lot of workloads that can't be vectorized (or perhaps vectorized to the point where >128-bit vectors start becoming a win), but I don't think Intel adds these isa extensions for the lulz/marketing. :D
 
AVX(2) mostly just extends the vector length to 256-bit, adds support for three operands, and I think added a low performing gather
There is nothing "low performing" about the avx2 gather. It can do everything you'd wish, including using a mask (and best of it, masked elements aren't looked up, meaning you don't need to hack in valid offsets or anything for such elements). The particular implementation in Haswell may not be all that fast (though I forgot how fast it was exactly), but that's entirely up to the implementation.
 
There is nothing "low performing" about the avx2 gather. It can do everything you'd wish, including using a mask (and best of it, masked elements aren't looked up, meaning you don't need to hack in valid offsets or anything for such elements). The particular implementation in Haswell may not be all that fast (though I forgot how fast it was exactly), but that's entirely up to the implementation.

Ah I believe you are right. The implementation on haswell is not any faster than loading them in a "scaler fashion" if I remember agner correctly. My poor memory attributed this to the isa and not haswell. But looking at the instructions again it would seem that I was wrong! Thanks for the insight.
 
Back
Top