22 nm Larrabee

Discussion in 'Architecture and Products' started by Nick, May 6, 2011.

Tags:
  1. rapso

    Newcomer

    Joined:
    May 6, 2008
    Messages:
    215
    Likes Received:
    27
    I would wish Intel would extend those cards in the lower end. some 500bucks 50core card, just for development/research would be nice. Otherwise, sooner or later, there will be more software optimized for GPUs than for XeonPhi and the x86 argument will be void.
     
  2. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    I agree, it would be very useful for developers and they could sell also few cards for some niche markets like rendering, etc.

    However they would have to be competitive with game GPUs used in the same niche markets (and I'm not sure if they can be).
     
  3. rapso

    Newcomer

    Joined:
    May 6, 2008
    Messages:
    215
    Likes Received:
    27
    well, nobody knows if they are competitive and that's already a problem.
    and I did not want to imply those cards should do rendering, rather that you could port and optimize every software you currently optimize for normal CPUs. Right now you can write software and optimize it on some 300bucks i7-4700 and that software can be run on crazy expensive 8socket xeon systems (e.g. some fluid simulation, databases, chess engines,...). but it's not that way with the xeon phi, they can just sell it to devs right now, no rendering farm or something until someone writes dedicated stuff.
     
  4. Blakhart

    Newcomer

    Joined:
    Sep 27, 2006
    Messages:
    103
    Likes Received:
    0
    No larabees to be had pcie?
     
  5. liolio

    liolio Aquoiboniste
    Legend

    Joined:
    Jun 28, 2005
    Messages:
    5,723
    Likes Received:
    193
    Location:
    Stateless
  6. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    771
    Likes Received:
    200
    Informative read. I assumed that modified Atom cores would work for Knights Landing, but it looks like they won't (at least without changing it so much it'll be basically a separate core). I'm looking forward to their followup article.
     
  7. Paran

    Regular Newcomer

    Joined:
    Sep 15, 2011
    Messages:
    251
    Likes Received:
    14
  8. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    771
    Likes Received:
    200
    A link in the comments goes to a heise online report "Intel's supercomputer accelerators: Xeon Phi with 72 cores, integrated InfiniBand and 3 teraflops DGEMM" (original), which claims (among other specs) that Knights Landing will have up to 72 "advanced" Silvermont cores.

    72 cores and 3 DP TFLOPS would imply, if all the cores are active, 2.6 GHz with 1 512-bit vector unit per core or 1.3 GHz with 2 512-bit vector units per core. In either situation I would not be surprised if release parts had some cores disabled (like in Knights Corner) and in that case I would expect potentially higher clocks depending on the TFLOPS numbers.
     
  9. Paran

    Regular Newcomer

    Joined:
    Sep 15, 2011
    Messages:
    251
    Likes Received:
    14
    Sounds plausible, the leaked Roadmap claimed ~3+ Tflops and 14-16 Gflops/W.

    Pretty big stacked memory, 8 or 16GB. Up to 500 GByte/s.
     
  10. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    Silvermont is OOOE is for integer and memory, the FPU is in-order.

    David
     
  11. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    The queue in front of the AGEN unit itself is in-order, which is what I was referring to in my post.

    According to the optimization manual:

    "Out-of-order execution for integer instructions and de-coupled ordering between non-integer and memory instructions."

    "Memory instructions must generate their addresses (AGEN) in-order and schedule from the scheduling queue in-order but they may complete out-of-order."

    Simply saying Silvermont is OOOE for memory is not telling the whole story.
     
  12. dkanter

    Regular

    Joined:
    Jan 19, 2008
    Messages:
    360
    Likes Received:
    20
    In-order issue isn't really a problem with a retry buffer and a single AGU.

    David
     
  13. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    Unless the address-generation to use latency is a single cycle (the manual doesn't say what it is but I strongly doubt this) it's going to stall at least occasionally on loads depending on loads, ie any kind of pointer chasing. That's not just going to happen with traversing data structures but also stuff like applying LUTs to loaded values or loading a pointer off the stack. If you have something like this in a tight loop you could miss a lot of opportunity for hiding the latency of the loads. These can be hard to hide much of in software with only 16 registers, and considerably worse if you're stuck with 8 registers like in Android.

    I don't know what the typical impact is going to be (I'll let someone else try to profile that) but the in-order nature of the MEC is not something that should be overlooked entirely.
     
  14. Gubbi

    Veteran

    Joined:
    Feb 8, 2002
    Messages:
    3,519
    Likes Received:
    852
    The queue in front of the AGEN is fed from the global ROB/instruction queue when operands are ready. Ie. once issued, to the AGEN reservation station, ops runs at full tilt.

    You pointer chasing schenario doesn't stall the AGEN unit because the AGEN from the dependent load won't be issued to the AGEN queue until the first load is completed

    Cheers
     
  15. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    "The MEC also owns the MEC RSV, which is responsible for scheduling of all loads and stores. Load and store instructions go through addresses generation phase in program order to avoid on-the-fly memory ordering later in the pipeline. Therefore, an unknown address will stall younger memory instructions"

    You seem to be describing the instruction queue as a unified scheduler, which is not the case. The ARR (allocation/rename/reorder) cluster processes instructions in-order, even if the ROB tracks out of order completion towards retirement. The 32-entry post-decode instruction queue is there to absorb some of the penalty of first-stage branch mispredictions and to act as a loop buffer, it's not a scheduler.

    The actual scheduling is distributed to the separate reservation stations, and only the two integer ones allow execution to begin from any place within the RSV. The two FEC and the MEC RSVs can only issue from the oldest slot.

    "Each reservation station is responsible for receiving up to 2 ops from the ARR cluster in a cycle and selecting one ready op for dispatching to execution as soon as the op becomes ready."

    The dependent load in my scenario will be put in the MEC the same cycle as the previous one, assuming that they are output from the ARR in the same cycle. Then the dependent one will sit in the queue and wait for the older one to finish, where "finish" presumably means that the value is back from L1 cache. While doing so it'll block all younger independent memory instructions.
     
  16. chris1515

    Veteran Regular

    Joined:
    Jul 24, 2005
    Messages:
    3,415
    Likes Received:
    2,040
    Location:
    Barcelona Spain
    #1196 chris1515, Aug 16, 2016
    Last edited: Aug 16, 2016
    Heinrich04, Jawed, liolio and 2 others like this.
  17. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    milk likes this.
  18. ImSpartacus

    Regular Newcomer

    Joined:
    Jun 30, 2015
    Messages:
    252
    Likes Received:
    199
  19. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    8 texture sampling units took 10% of die space and they would have wanted 16 to be competetive = 20% of die space. I didn't realize modern texture sampling units are so large. I am wondering whether the tiny 4KB L1 texture caches are tied to the sampling unit (they borrowed sampling units from their GPU). IIRC it contains uncompressed texels ready for filtering. Intel GPUs also have L2 texture caches (16 KB / 24 KB) but a general purpose cache would likely replace the bigger L2 texture cache.
     
  20. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...