Knights Landing Speculation

I look forward to seeing the updates that start giving the contours of the speculated design. I'm curious just how custom the core can be, or which other core line it will most resemble.

Is there anything we can infer from the AVX3 extensions that lie outside of the foundation instructions, such as whether Skywell will support them?
If there is a disparity there, would that have some import on the differences between the cores?
In the other direction, is it known if Knights Landing will omit other extensions to x86, keeping it out of lockstep with the main lines?

One point of clarification, while it is the case that the pre Knights Landing chips aren't bootable, is it really the core's fault or something about the design as a whole that caused it?
It's not like the P54 couldn't boot.

There were details mentioned about the LLC and ring bus for Xeon Phi, and their similarity to the big core ring bus. Has the cache protocol been disclosed for Larrabee or others in that line? I've just gone with the assumption they probably didn't have time or justification to put MESIF into the P54 cache subsystem.


One thing that I have been wondering about is if there is something significant about how the caches in Larrabee and Knights Landing are indexed versus something like GCN. GCN's are described as virtually addressed (the Onion bus excepted?), and I am going with the assumption that the TLB still figures in the Phi memory pipeline.
 
I don't really see how they plan to more than double perf while using AVX3 unless they either double core count or double clock rates. I think that lends credibility to the 72 core custom Atom based rumor, a few more cores clocked much higher seems more doable (and desirable) than putting 120+ cores on a chip.
 
Very much reminds me of Platform 2015 from years ago.

evolution.jpg


platform2015.jpg



Also:

Intel Platform 2015 PDF
 
Interesting that they claim that the socketed variants will come earlier than the PCIe variants. Late 2015 for the latter could mean that they'll be up against GPUs two generations from now (16FF?).
 
nice reading.
however, a quick comment on this point:
Second, while Knights Landing can act as a bootable CPU, many applications will demand greater single threaded performance due to Amdahl’s Law. For these workloads, the optimal configuration is a Knights Landing (which provides high throughput) coupled to a mainstream Xeon server (which provides single threaded performance). In this scenario, latency is critical for communicating results between the Xeon and Knights Landing. This is also a huge competitive advantage over Nvidia GPUs, which do not have QPI and therefore must rely on PCI-E for connecting to the host Xeon processor.
KL timeframe will face 20nm Maxwell and/or 16nm FinFet Volta, and they will both use a bunch of high performance ARM v8 cores with Nvidia idea of HSA. One NV guru also said in an interview that their goal is to get off main CPU dependence in the HPC space, so anyone should expect bootable Maxwell/Volta...
 
Of course it will face Maxwell 20nm but even then it will be hard to match KNL for Nvidia. They need a DP doubling or more to match KNL DP performance. This is not easy since Maxwell to Kepler is most likely a much smaller architectural jump than Kepler to Fermi was. Same for the process shrink. Also KNL TDP decreased to 160-215W according to the leaks. The edram usage in KNL should give Intel a huge latency advantage over Maxwell. This is a very tough opponent for Maxwell 2015.
 
Paran said:
This is not easy since Maxwell to Kepler is most likely a much smaller architectural jump than Kepler to Fermi was.
What is this statement based upon, if you don't mind me asking?
 
Of course it will face Maxwell 20nm but even then it will be hard to match KNL for Nvidia. They need a DP doubling or more to match KNL DP performance. This is not easy since Maxwell to Kepler is most likely a much smaller architectural jump than Kepler to Fermi was. Same for the process shrink. Also KNL TDP decreased to 160-215W according to the leaks. The edram usage in KNL should give Intel a huge latency advantage over Maxwell. This is a very tough opponent for Maxwell 2015.
well I will have a pragmatic and "wait and see" approach when it comes to intel numbers in HPC space. From many interviews of people in the know, current Xeon Phi "real HPC world efficiency" is way below Kepler.
So even if KL DP theoretical performance claims are very high, let's wait what it can do in the field before announcing a winner...

edit: first google link xeon phi vs tesla: http://blog.xcelerit.com/intel-xeon-phi-vs-nvidia-tesla-gpu/
 
Last edited by a moderator:
Of course it will face Maxwell 20nm but even then it will be hard to match KNL for Nvidia. They need a DP doubling or more to match KNL DP performance. This is not easy since Maxwell to Kepler is most likely a much smaller architectural jump than Kepler to Fermi was. Same for the process shrink. Also KNL TDP decreased to 160-215W according to the leaks. The edram usage in KNL should give Intel a huge latency advantage over Maxwell. This is a very tough opponent for Maxwell 2015.

I bet it can barely match the performance that a 20nm maxwell can deliver.

Just ignore the raw Glops nonsense, besides a few lapack routines where the two were comparable, for more general and more flexable tasks I and my co-workers tested so far, MIC is roughly 1/2 the performance of K20x, despite of the roughly equal rated Gflops of the two products.

And furthermore, despite of the marketing nonsense, it actually takes more effects to optimze the codes on MIC than on GPU, and the codes are less portable.

The main problem with MIC is:

1, SIMD vs SIMT, the programming interface of CUDA is thread-level, whilst for MIC, it is a very fat vector-level, which means you need to write assmebly-like vector ops codes that could end-up being less portable to further generations of MICs, and of cause you have to forget template-like stuff for high productivity, and more or less stick with C and assemebly-like programming styles.

2, GPU has a programmable L1 cache and large register files which many general tasks can take advantage of, whilst MIC has no such features nor they are intend to implent such in their future versions according to Intel tech stuff I met.

3, It seems to me Nvidia's hardware can simply handle/manage massive parallel work-loads better and can utilize their memory-bandwidth better, for whatever reasons.

If the rumor spec about maxwell is true, I dont think Nvidia need to worry about intel's MICs in the near future.
 
Last edited by a moderator:
I'm wondering how something like AMD's Berlin APU (Server APU) would compare against Knights Landing if they added 4P (4 sockets, via Hyper Transport?) and 1/2 DP rate to the GPU.
 
I'm wondering how something like AMD's Berlin APU (Server APU) would compare against Knights Landing if they added 4P (4 sockets, via Hyper Transport?) and 1/2 DP rate to the GPU.

Probably not too well, but then again Toronto should be out by then, presumably on 20nm.
 
I bet it can barely match the performance that a 20nm maxwell can deliver.

Just ignore the raw Glops nonsense, besides a few lapack routines where the two were comparable, for more general and more flexable tasks I and my co-workers tested so far, MIC is roughly 1/2 the performance of K20x, despite of the roughly equal rated Gflops of the two products.

And furthermore, despite of the marketing nonsense, it actually takes more effects to optimze the codes on MIC than on GPU, and the codes are less portable.

The main problem with MIC is:

1, SIMD vs SIMT, the programming interface of CUDA is thread-level, whilst for MIC, it is a very fat vector-level, which means you need to write assmebly-like vector ops codes that could end-up being less portable to further generations of MICs, and of cause you have to forget template-like stuff for high productivity, and more or less stick with C and assemebly-like programming styles.

2, GPU has a programmable L1 cache and large register files which many general tasks can take advantage of, whilst MIC has no such features nor they are intend to implent such in their future versions according to Intel tech stuff I met.

3, It seems to me Nvidia's hardware can simply handle/manage massive parallel work-loads better and can utilize their memory-bandwidth better, for whatever reasons.

If the rumor spec about maxwell is true, I dont think Nvidia need to worry about intel's MICs in the near future.
Do you use cilk for mic? Intrinsics? something else?
 
Last edited by a moderator:
Do you use cilk for mic? Intrinsics? something else?
For the love of god do yourself a favor and use ISPC for SIMD :) Then you can use Cilk above it for work stealing or whatever solution you want. Super-productive and powerful environment, speaking from experience.
 
I tried cilk plus before, but it seems that the compiler cannot handle vectorization very efficiently when functions being mapped is more complicate, consisting some loops and branches (comparing to CUDA), so I stick with hand-tunned vectorization within the function body instead.

As for instrinsics, well I used them extensively, but they looks like assembly, especially if you use them alot in your codes.
 
For the love of god do yourself a favor and use ISPC for SIMD :) Then you can use Cilk above it for work stealing or whatever solution you want. Super-productive and powerful environment, speaking from experience.

I tried cilk plus before, but it seems that the compiler cannot handle vectorization very efficiently when functions being mapped is more complicate, consisting some loops and branches (comparing to CUDA), so I stick with hand-tunned vectorization within the function body instead.

As for instrinsics, well I used them extensively, but they looks like assembly, especially if you use them alot in your codes.

Sounds like only SPMD model works for vectorization.

It's a shame ISO C++ is going the annonated loop way. Sigh...
 
Sounds like only SPMD model works for vectorization.
It's a shame ISO C++ is going the annonated loop way. Sigh...
Yeah ISPC is clearly better, and it's better than the GPU computing languages too. If the whole persistent threads research has showed us anything it's that virtualizing (vs. parameterizing) the SIMD width is far too harmful to performance of non-trivial kernels.

Unfortunately HPC and other parties have too much sway as far as the standards go, and they are completely in the land of non-coding physicists who just want to trust compiler magic to get them a 2x even on 8+-wide SIMD... and again, I speak from experience on that as someone who has rewritten a lot of scientist-code :)
 
Back
Top