Knights Landing details at RWT

A little off topic but David I have a question on your article coverages.

In the past your articles covered more vendors (Apple, ARM, AMD, Intel, Nvidia) but now over the last 18 months you have 7 Intel articles, 2 ARM articles and one Apple article.

Is there a reason for the very heavy coverage on Intel and the now non-existent coverage of those other vendors?
 
Have the other vendors done anything worthy of being covered from an architectural level? Have other vendors provided any sort of base material?
 
Being open about architectural detail isn't particularly common.
A number of those other vendors are known to be tight-lipped.

We've got Anandtech trying to wrangle something as basic as the issue width of Apple's Cyclone processor with their own code loops, as one example.
I've seen complaints about the lack of feedback from ARM as well.

x86, and for time POWER, had a much broader level of exposure.
With paywalling or an increasing level of custom platform designs, we see that IBM and AMD have backtracked significantly.
The other question mark is that, with generations of rebranding and Orwellian PR speak under their belt, some vendors might consider disclosing much at this point might just show how much of the external activity is decoupled from the physical reality.

That leaves Intel as a company whose broad range of products have the unusual property of frequently being new and for which many derive commercial utility from educating their customers. Granted, its architectural details for general consumption are not as forthcoming as they once were.

I'm not sure if it's just the optics, but it looks like the increased exposure to the mobile market and its lack of transparency and/or honesty has lead to a regression towards a much less open mean.
 
My words are my own. I speak for no one else. Who do you speak for?

If you don't want others participating then don't ask questions in an open discussion forum.
 
Have the other vendors done anything worthy of being covered from an architectural level? Have other vendors provided any sort of base material?

18 months?
nVidia's GK110, Tegra K1; some of AMD's GCN family and FM2 APUs, most recently Hawaii, Bonaire and Kaveri. Both consoles sporting HSA APUs.

Just from the top of my head.

Disclaimer: I'm not saying what should or should not be in RWT. I'm just answering your questions.
If dkanter has any reasons not to cover other vendors, I'm pretty sure it's not for the lack of base material.
He's one of the few writing about Intel's non-exclusively-CPU solutions - which tend to pass around unnoticed in the media, and that's enough of a reason for me.
 
My words are my own. I speak for no one else. Who do you speak for?

If you don't want others participating then don't ask questions in an open discussion forum.

I speak for myself and my question was directly stated to David. You however seem to want to speak for others.
 
I'm curious about the prediction of a 144MB L3 on-die, and the additional prediction of QPI.

How would this L3 be distributed on-die?
Intel went through a fair amount of work creating the distributed ring bus protocol and hashed distribution of addresses across L3 slices to minimize traffic hot spots in a comparatively constrained topology. For smaller core counts, there is a unidirectional ring bus, with clients able to add traffic at known cycles. Higher core counts can up the number of directions to two.

How does the fabric play into this, with the additional dimension for routing and an unknown sequencing of request cycles? If the L3 is still sliced, at what granularity and how many directions across how many cycles can contend with any particular slice?

If QPI is now the interconnect, earlier RWT articles tied its function to the introduction of MESIF. Does this mean KNC is using MESIF? How is that going to play with the new fabric?
I've seen discussions about earlier MIC architectures leveraging some kind of L2 directory or filter structure, but the KNC article is indicating the L3 is the new filter.
Does a line in the F state need to wander around a 2D fabric with an unknown number of requesters and intermediaries when there is an L3/directory close at hand? Perhaps the tiles are MESI (shared L2 means S can still forward between tile cores), with a forwarding state at the L3 level?
 
18 months?
nVidia's GK110, Tegra K1; some of AMD's GCN family and FM2 APUs, most recently Hawaii, Bonaire and Kaveri. Both consoles sporting HSA APUs.

Just from the top of my head.

Disclaimer: I'm not saying what should or should not be in RWT. I'm just answering your questions.
If dkanter has any reasons not to cover other vendors, I'm pretty sure it's not for the lack of base material.
He's one of the few writing about Intel's non-exclusively-CPU solutions - which tend to pass around unnoticed in the media, and that's enough of a reason for me.

I would guess other vendors have launched products but haven't made disclosures commensurate with RWT's depth.
 
I would guess other vendors have launched products but haven't made disclosures commensurate with RWT's depth.

I for one hope to see a detailed report on Nvidia's upcoming Denver CPU.

Between this patent information:

Instruction-optimizing processor with branch-count table in hardware
https://www.google.com/patents/US20130311752

and announced information: 7-way Superscaler, 128-64k L1 Cache

Tegra K1 64-bit Denver core analysis: Are Nvidia’s x86 efforts hidden within?
http://www.extremetech.com/computin...nalysis-are-nvidias-x86-efforts-hidden-within

and a 3 year development timeframe it will be interested in what Nvidia's Custom 64-Bit ARM CPU is.
 
The wide issue could point to a LIW or VLIW processor core for Nvidia, if the patents turn out to be implemented.
The patent makes note that there is a hardware decoder for non-native instructions and that native instructions bypass it. There's some ambiguity here as to whether that means there's a decode engine and an implied native decoder, or if Nvidia is going for a really old school VLIW internal representation. I'd feel more comfortable with the idea that there's at least some level of decode still present, even if it's not mentioned in the patent.

Since Nvidia wants HPC, giving details should be encouraged.
On, the other hand, if it is a code-morphing processor, it's a mark against transparency if Transmeta's behavior is a guide. RWT had an article about reverse-engineering it.

One handy thing that might happen with this isn't so much ARM or x86 compatibility, but compatibility with the GPU ISA could be done as well, if for example a kernel showed bad divergence or small granularity.
Kepler's inclusion of dependence data in its ISA would make creating instruction packets easier, but that's more a what-if and not a KNL speculation.


As far as KNL goes, if it uses Silvermont as a basis, how it works in relation to a 4-way SMT setup with vector requirements will be interesting.
That core's rename resources are not big enough for 4-way threading to not feel significant resource pressure, the load and store queues are going to fill up with outstanding accesses with that many threads in a stream compute context, and the FP rename capability is simply not big enough for an AVX3 32 register context.
The design choices for Silvermont would point to a multithreading scenario where register files get duplicated, instead of pointer following in a physical register file like Haswell and the like.

I'm curious if the explicit breakout of the VPU means a new domain or possibly a coprocessor model like AMD's FPU, or something even more separate.
The VPUs are going to like using gather, have many outstanding accesses, tolerate latency, and like a lot of bandwidth. Itanium routed FP accesses straight to the L2.
What if they combined all of this and gave a separate VPU ALU and memory domain that hooks into that new and custom L2?
(edit: not sure how they'd approach the reorder buffer, sharing one with the current size poses the risk of it becoming a hazard, but the size or sharing/duplication of it are knobs that can be turned)
 
Last edited by a moderator:
I'm curious if the explicit breakout of the VPU means a new domain or possibly a coprocessor model like AMD's FPU, or something even more separate.
If you look at Silvermont, it actually looks almost like AMD's K7 through K10 with separate reservation stations for each unit instead of a unified scheduler for everything as with intel's high performance CPUs (or two schedulers, one for int and one for FP, respectively, in case of AMD's BD/PD/SR).
 
Mmm, this looks bootiful:

http://www.tweaktown.com/news/44225/details-intels-next-gen-knights-landing-platform/index.html

6 TFLOPS SP, 3 TFLOPS DP, 300W. All 60 cores accessible directly with the OS. 16GB of on-package (pseudo HMC it seems) RAM at 400GB/s. 384GB per socket, off package.

I see now why NVidia is pushing neural nets running at limited precision. Intel's just made CUDA irrelevant for all other HPC.

KNL does indeed look interesting, although Intel needs to show they can use all those peak numbers to do something useful. KNC was abysmal in that regard - it was tough to use more than 50% of off-chip memory bandwidth, for example, meaning that AMD and NVIDIA GPUs had much higher sustained throughputs than KNC, even though their peak bandwidths were lower.

KNL's peak numbers are actually a little less than I was expecting, especially since GM200 already beats it in SP Flops and Fiji will likely beat it in memory bandwidth, and both of them are shipping earlier than KNL.

However, GP100, which will be KNL's main competitor, will likely dominate KNL in SP & DP flops, as well as off chip bandwidth.

So I think it's demonstrably false that CUDA's irrelevant for other HPC, even if KNL does look pretty interesting.
 
http://www.theplatform.net/2015/03/25/more-knights-landing-xeon-phi-secrets-unveiled/

The interesting thing about the Knights Landing processor is that it will have three memory modes. The first mode is the 46-bit physical addressing and 48-bit virtual addressing used with the current Xeon processors, only addressing that DDR4 main memory. In the second mode, which is called cache mode, that 16 GB of near memory is used as a fast cache for the DDR4 far memory on the Knights Landing package. The third mode is called flat mode, an in this mode the 384 GB of DDR4 memory and 16 GB of MCDRAM memory are turned into a single address space, and programmers have to allocate specifically into the near memory. Intel is tweaking its compilers so Fortran can allocate into the near memory using this flat addressing mode.

Mmm, 16GB "L3 cache".

“Actually, we did debate that quite a bit, making a two-socket,” says Sodani, “One of the big reasons for not doing it was that given the amount of memory bandwidth we support on the die – we have 400 GB/sec plus of bandwidth – even if you make the thing extremely NUMA aware and even if only five percent of the time you have to snoop on the other side, that itself would be 25 GB/sec worth of snoop and that would swamp any QPI channel.”
 
KNL's use of 90GB/sec DDR4 seems laughable at first compared to even today's 300GB/sec GPUs, soon to be 1TB/sec. But the real use of this is for a large data store, not for performance. When you have 16GB of fast MCDRAM on the package, the off-chip DDR4 RAM is essentially used like host system memory is now... a slow but large main data staging and storage region. And for HPC, having that memory as dense DIMMs gives you flexibility and a very large maximum size (384 GB). So I almost think of it as replacing a 6GB/sec PCIE bus to system memory with a 90GB/sec connection instead. KNL is the system now.
 
So I think it's demonstrably false that CUDA's irrelevant for other HPC, even if KNL does look pretty interesting.
The reason GPUs became interesting is that they offered 1-2 orders of magnitude greater compute density or performance per watt or per $ (and combinations thereof).

None of those things apply with Knights Landing in the wild. And it will run the code that everyone has been running, and happily support compute paradigms beyond crappy old MPI and OpenMP.

You can freely choose your desired mix of task- and data-parallel kernels within a single architecture, memory hierarchy, execution model, instruction set and clustering topology. It's a flat, sane landscape.
 
Back
Top