22 nm Larrabee

rapso · Oct 17, 2013

I would wish Intel would extend those cards in the lower end. some 500bucks 50core card, just for development/research would be nice. Otherwise, sooner or later, there will be more software optimized for GPUs than for XeonPhi and the x86 argument will be void.

Dade · Oct 17, 2013

rapso said:
I would wish Intel would extend those cards in the lower end. some 500bucks 50core card, just for development/research would be nice. Otherwise, sooner or later, there will be more software optimized for GPUs than for XeonPhi and the x86 argument will be void.

I agree, it would be very useful for developers and they could sell also few cards for some niche markets like rendering, etc.

However they would have to be competitive with game GPUs used in the same niche markets (and I'm not sure if they can be).

rapso · Oct 23, 2013

well, nobody knows if they are competitive and that's already a problem.
and I did not want to imply those cards should do rendering, rather that you could port and optimize every software you currently optimize for normal CPUs. Right now you can write software and optimize it on some 300bucks i7-4700 and that software can be run on crazy expensive 8socket xeon systems (e.g. some fluid simulation, databases, chess engines,...). but it's not that way with the xeon phi, they can just sell it to devs right now, no rendering farm or something until someone writes dedicated stuff.

Blakhart · Nov 2, 2013

No larabees to be had pcie?

liolio · Nov 18, 2013

New article from David Kanter:
http://www.realworldtech.com/knights-landing-cpu-speculation/

iMacmatician · Nov 18, 2013

Informative read. I assumed that modified Atom cores would work for Knights Landing, but it looks like they won't (at least without changing it so much it'll be basically a separate core). I'm looking forward to their followup article.

Paran · Nov 19, 2013

http://investorshub.advfn.com/boards/read_msg.aspx?message_id=94163274

Heavily modified Atom core according to him.

Not sure if new on page 22:
http://indico.cern.ch/getFile.py/access?resId=0&materialId=slides&confId=278792

iMacmatician · Nov 19, 2013

liolio said:
New article from David Kanter:
http://www.realworldtech.com/knights-landing-cpu-speculation/

A link in the comments goes to a heise online report "Intel's supercomputer accelerators: Xeon Phi with 72 cores, integrated InfiniBand and 3 teraflops DGEMM" (original), which claims (among other specs) that Knights Landing will have up to 72 "advanced" Silvermont cores.

72 cores and 3 DP TFLOPS would imply, if all the cores are active, 2.6 GHz with 1 512-bit vector unit per core or 1.3 GHz with 2 512-bit vector units per core. In either situation I would not be surprised if release parts had some cores disabled (like in Knights Corner) and in that case I would expect potentially higher clocks depending on the TFLOPS numbers.

Paran · Nov 19, 2013

Sounds plausible, the leaked Roadmap claimed ~3+ Tflops and 14-16 Gflops/W.

Pretty big stacked memory, 8 or 16GB. Up to 500 GByte/s.

dkanter · Nov 19, 2013

Exophase said:
Silvermont changes more than just adding OoOE, which is btw for the integer pipelines only and not the FP/SIMD or memory pipelines. I expect that it's precisely because it's for the ALUs only that the OoOE die impact is so small, but that's no good at all for something like Xeon Phi. You can't use Saltwell vs Silvermont performance as an evaluation of performance of OoO vs SMT.

Saltwell's SMT is pretty different from what's in Xeon Phi anyway, not to mention that the typical workloads are very different. If you're in a situation where you can require that there are at least two threads loaded on the thread in order to get good performance then hiding latency and filling execution unit occupancy with simple SMT is going to be a much better choice than doing so with OoOE. Xeon Phi falls under that category, unless Intel really does want to do a unified core like Nick thinks.

Silvermont is OOOE is for integer and memory, the FPU is in-order.

David

Exophase · Nov 20, 2013

dkanter said:
Silvermont is OOOE is for integer and memory, the FPU is in-order.

David

The queue in front of the AGEN unit itself is in-order, which is what I was referring to in my post.

According to the optimization manual:

"Out-of-order execution for integer instructions and de-coupled ordering between non-integer and memory instructions."

"Memory instructions must generate their addresses (AGEN) in-order and schedule from the scheduling queue in-order but they may complete out-of-order."

Simply saying Silvermont is OOOE for memory is not telling the whole story.

dkanter · Nov 20, 2013

Exophase said:
The queue in front of the AGEN unit itself is in-order, which is what I was referring to in my post.

According to the optimization manual:

"Out-of-order execution for integer instructions and de-coupled ordering between non-integer and memory instructions."

"Memory instructions must generate their addresses (AGEN) in-order and schedule from the scheduling queue in-order but they may complete out-of-order."

Simply saying Silvermont is OOOE for memory is not telling the whole story.

In-order issue isn't really a problem with a retry buffer and a single AGU.

David

Exophase · Nov 21, 2013

dkanter said:
In-order issue isn't really a problem with a retry buffer and a single AGU.

David

Unless the address-generation to use latency is a single cycle (the manual doesn't say what it is but I strongly doubt this) it's going to stall at least occasionally on loads depending on loads, ie any kind of pointer chasing. That's not just going to happen with traversing data structures but also stuff like applying LUTs to loaded values or loading a pointer off the stack. If you have something like this in a tight loop you could miss a lot of opportunity for hiding the latency of the loads. These can be hard to hide much of in software with only 16 registers, and considerably worse if you're stuck with 8 registers like in Android.

I don't know what the typical impact is going to be (I'll let someone else try to profile that) but the in-order nature of the MEC is not something that should be overlooked entirely.

Gubbi · Nov 22, 2013

Exophase said:
Unless the address-generation to use latency is a single cycle (the manual doesn't say what it is but I strongly doubt this) it's going to stall at least occasionally on loads depending on loads, ie any kind of pointer chasing. That's not just going to happen with traversing data structures but also stuff like applying LUTs to loaded values or loading a pointer off the stack. If you have something like this in a tight loop you could miss a lot of opportunity for hiding the latency of the loads. These can be hard to hide much of in software with only 16 registers, and considerably worse if you're stuck with 8 registers like in Android.

The queue in front of the AGEN is fed from the global ROB/instruction queue when operands are ready. Ie. once issued, to the AGEN reservation station, ops runs at full tilt.

You pointer chasing schenario doesn't stall the AGEN unit because the AGEN from the dependent load won't be issued to the AGEN queue until the first load is completed

Cheers

Exophase · Nov 22, 2013

Gubbi said:
The queue in front of the AGEN is fed from the global ROB/instruction queue when operands are ready. Ie. once issued, to the AGEN reservation station, ops runs at full tilt.

You pointer chasing schenario doesn't stall the AGEN unit because the AGEN from the dependent load won't be issued to the AGEN queue until the first load is completed

Cheers

"The MEC also owns the MEC RSV, which is responsible for scheduling of all loads and stores. Load and store instructions go through addresses generation phase in program order to avoid on-the-fly memory ordering later in the pipeline. Therefore, an unknown address will stall younger memory instructions"

You seem to be describing the instruction queue as a unified scheduler, which is not the case. The ARR (allocation/rename/reorder) cluster processes instructions in-order, even if the ROB tracks out of order completion towards retirement. The 32-entry post-decode instruction queue is there to absorb some of the penalty of first-stage branch mispredictions and to act as a loop buffer, it's not a scheduler.

The actual scheduling is distributed to the separate reservation stations, and only the two integer ones allow execution to begin from any place within the RSV. The two FEC and the MEC RSVs can only issue from the oldest slot.

"Each reservation station is responsible for receiving up to 2 ops from the ARR cluster in a cycle and selecting one ready op for dispatching to execution as soon as the op becomes ready."

The dependent load in my scenario will be put in the MEC the same cycle as the previous one, assuming that they are output from the ARR in the same cycle. Then the dependent one will sit in the queue and wait for the older one to finish, where "finish" presumably means that the value is back from L1 cache. While doing so it'll block all younger independent memory instructions.

chris1515 · Aug 16, 2016

http://tomforsyth1000.github.io/blog.wiki.html#[[Why didn't Larrabee fail?]]

Some thought about Larabee in a Tom Forsyth blog post

CSI PC · Aug 16, 2016

It is probably fair to say it is more of a variant (that being a 2-way superscalar and static in-order schedule processor) than a direct translation of Larrabee as being the success for HPC, especially with how it has changed with Knights Landing and Silvermont cores.
http://www.realworldtech.com/knights-landing-cpu-speculation/
http://www.anandtech.com/show/8217/intels-knights-landing-coprocessor-detailed

Cheers

ImSpartacus · Aug 16, 2016

chris1515 said:
http://tomforsyth1000.github.io/blog.wiki.html#[[Why didn't Larrabee fail?]]

Somega thought about Larabee in a Tom Forsythia blog post

Really good read. I didn't know a lot of that stuff about larrabee.

sebbbi · Aug 16, 2016

chris1515 said:
http://tomforsyth1000.github.io/blog.wiki.html#[[Why didn't Larrabee fail?]]

8 texture sampling units took 10% of die space and they would have wanted 16 to be competetive = 20% of die space. I didn't realize modern texture sampling units are so large. I am wondering whether the tiny 4KB L1 texture caches are tied to the sampling unit (they borrowed sampling units from their GPU). IIRC it contains uncompressed texels ready for filtering. Intel GPUs also have L2 texture caches (16 KB / 24 KB) but a general purpose cache would likely replace the bigger L2 texture cache.

silent_guy · Aug 16, 2016

The discussion on HN:

https://news.ycombinator.com/item?id=12293308

22 nm Larrabee

rapso

Dade

rapso

Blakhart

liolio

Aquoiboniste

iMacmatician

Paran

iMacmatician

Paran

dkanter

Exophase

dkanter

Exophase

Gubbi

Exophase

chris1515

CSI PC

ImSpartacus

sebbbi

silent_guy

Similar threads