22 nm Larrabee

Discussion in 'Architecture and Products' started by Nick, May 6, 2011.

Tags:
  1. rapso

    rapso Newcomer

    I would wish Intel would extend those cards in the lower end. some 500bucks 50core card, just for development/research would be nice. Otherwise, sooner or later, there will be more software optimized for GPUs than for XeonPhi and the x86 argument will be void.
     
  2. Dade

    Dade Newcomer

    I agree, it would be very useful for developers and they could sell also few cards for some niche markets like rendering, etc.

    However they would have to be competitive with game GPUs used in the same niche markets (and I'm not sure if they can be).
     
  3. rapso

    rapso Newcomer

    well, nobody knows if they are competitive and that's already a problem.
    and I did not want to imply those cards should do rendering, rather that you could port and optimize every software you currently optimize for normal CPUs. Right now you can write software and optimize it on some 300bucks i7-4700 and that software can be run on crazy expensive 8socket xeon systems (e.g. some fluid simulation, databases, chess engines,...). but it's not that way with the xeon phi, they can just sell it to devs right now, no rendering farm or something until someone writes dedicated stuff.
     
  4. Blakhart

    Blakhart Newcomer

    No larabees to be had pcie?
     
  5. liolio

    liolio Aquoiboniste Legend

  6. iMacmatician

    iMacmatician Regular

    Informative read. I assumed that modified Atom cores would work for Knights Landing, but it looks like they won't (at least without changing it so much it'll be basically a separate core). I'm looking forward to their followup article.
     
  7. Paran

    Paran Regular

  8. iMacmatician

    iMacmatician Regular

    A link in the comments goes to a heise online report "Intel's supercomputer accelerators: Xeon Phi with 72 cores, integrated InfiniBand and 3 teraflops DGEMM" (original), which claims (among other specs) that Knights Landing will have up to 72 "advanced" Silvermont cores.

    72 cores and 3 DP TFLOPS would imply, if all the cores are active, 2.6 GHz with 1 512-bit vector unit per core or 1.3 GHz with 2 512-bit vector units per core. In either situation I would not be surprised if release parts had some cores disabled (like in Knights Corner) and in that case I would expect potentially higher clocks depending on the TFLOPS numbers.
     
  9. Paran

    Paran Regular

    Sounds plausible, the leaked Roadmap claimed ~3+ Tflops and 14-16 Gflops/W.

    Pretty big stacked memory, 8 or 16GB. Up to 500 GByte/s.
     
  10. dkanter

    dkanter Regular

    Silvermont is OOOE is for integer and memory, the FPU is in-order.

    David
     
  11. Exophase

    Exophase Veteran

    The queue in front of the AGEN unit itself is in-order, which is what I was referring to in my post.

    According to the optimization manual:

    "Out-of-order execution for integer instructions and de-coupled ordering between non-integer and memory instructions."

    "Memory instructions must generate their addresses (AGEN) in-order and schedule from the scheduling queue in-order but they may complete out-of-order."

    Simply saying Silvermont is OOOE for memory is not telling the whole story.
     
  12. dkanter

    dkanter Regular

    In-order issue isn't really a problem with a retry buffer and a single AGU.

    David
     
  13. Exophase

    Exophase Veteran

    Unless the address-generation to use latency is a single cycle (the manual doesn't say what it is but I strongly doubt this) it's going to stall at least occasionally on loads depending on loads, ie any kind of pointer chasing. That's not just going to happen with traversing data structures but also stuff like applying LUTs to loaded values or loading a pointer off the stack. If you have something like this in a tight loop you could miss a lot of opportunity for hiding the latency of the loads. These can be hard to hide much of in software with only 16 registers, and considerably worse if you're stuck with 8 registers like in Android.

    I don't know what the typical impact is going to be (I'll let someone else try to profile that) but the in-order nature of the MEC is not something that should be overlooked entirely.
     
  14. Gubbi

    Gubbi Veteran

    The queue in front of the AGEN is fed from the global ROB/instruction queue when operands are ready. Ie. once issued, to the AGEN reservation station, ops runs at full tilt.

    You pointer chasing schenario doesn't stall the AGEN unit because the AGEN from the dependent load won't be issued to the AGEN queue until the first load is completed

    Cheers
     
  15. Exophase

    Exophase Veteran

    "The MEC also owns the MEC RSV, which is responsible for scheduling of all loads and stores. Load and store instructions go through addresses generation phase in program order to avoid on-the-fly memory ordering later in the pipeline. Therefore, an unknown address will stall younger memory instructions"

    You seem to be describing the instruction queue as a unified scheduler, which is not the case. The ARR (allocation/rename/reorder) cluster processes instructions in-order, even if the ROB tracks out of order completion towards retirement. The 32-entry post-decode instruction queue is there to absorb some of the penalty of first-stage branch mispredictions and to act as a loop buffer, it's not a scheduler.

    The actual scheduling is distributed to the separate reservation stations, and only the two integer ones allow execution to begin from any place within the RSV. The two FEC and the MEC RSVs can only issue from the oldest slot.

    "Each reservation station is responsible for receiving up to 2 ops from the ARR cluster in a cycle and selecting one ready op for dispatching to execution as soon as the op becomes ready."

    The dependent load in my scenario will be put in the MEC the same cycle as the previous one, assuming that they are output from the ARR in the same cycle. Then the dependent one will sit in the queue and wait for the older one to finish, where "finish" presumably means that the value is back from L1 cache. While doing so it'll block all younger independent memory instructions.
     
  16. chris1515

    chris1515 Legend

    Last edited: Aug 16, 2016
    Heinrich04, Jawed, liolio and 2 others like this.
  17. CSI PC

    CSI PC Veteran

    milk likes this.
  18. ImSpartacus

    ImSpartacus Regular

  19. sebbbi

    sebbbi Veteran

    8 texture sampling units took 10% of die space and they would have wanted 16 to be competetive = 20% of die space. I didn't realize modern texture sampling units are so large. I am wondering whether the tiny 4KB L1 texture caches are tied to the sampling unit (they borrowed sampling units from their GPU). IIRC it contains uncompressed texels ready for filtering. Intel GPUs also have L2 texture caches (16 KB / 24 KB) but a general purpose cache would likely replace the bigger L2 texture cache.
     
  20. silent_guy

    silent_guy Veteran Subscriber

Loading...

Share This Page

Loading...