22 nm Larrabee

Discussion in 'Architecture and Products' started by Nick, May 6, 2011.

Tags:
  1. Nick

    Veteran

    Joined:
    Jan 7, 2003
    Messages:
    1,881
    Likes Received:
    17
    Location:
    Montreal, Quebec
    I don't think such an instruction exists. I believe Gipsel was describing the functionality from a conceptual point of view. That is, if such instruction existed, it would mainly be a matter of collecting all other elements in the same cache line, and updating the mask register, to implement Knights Corner's gather.
     
  2. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    @Exophase:
    I was basically saying, that a gather in the worst case is probably not much faster than if you would manually and individually load all elements (and you could save the loop, which runs 16 iterations in the worst case with gather). The reason it would not work that well is just the impossibility to manipulate the write masks in an efficient way (you can't just use an immediate or a scalar reg as an writemask or manipulate a write mask directly with a bitshift).
    But what you said would be the major hindrance (individualle moving the offsets of each element from the vector regs to scalar regs) isn't that much of a problem in my opinion. It's a single vector move to memory (L1-D$) and loading it afterwards to use it as (part of the) address for the broadcast instruction. It is highly probable that one can have one vector load op (the broadcast instructions are running through the vector pipe) and one scalar load running in parallel. Would have been pretty fast in this case if KC would allow immediates as masks.
     
  3. Exophase

    Veteran

    Joined:
    Mar 25, 2010
    Messages:
    2,406
    Likes Received:
    429
    Location:
    Cleveland, OH
    Okay, I think I understand why I was confused - I didn't realize broadcasts came from memory instead of scalar registers. So your proposed loop is like this:

    Code:
    vmov(offsets_array, vector_offsets);
    for(i = 0; i < 8; i++)
    {
      int offset = offsets_array[i];
      vbroadcastss(dest_vector, &(base[offset]), vector_mask_field(i));
    }
    
    But I'm not convinced that these two loads can actually be performed in parallel. Do you have any information that the cache can perform two line loads for vector and scalar in one cycle? Even then, we'd have to know that the broadcast uses the vector load..

    Even if these do co-issue I strongly doubt you'll be able to issue them back to back like that w/o stall, you'd eat the full latency of that scalar load, which is probably substantially when used in the effective address of the next instruction. You'd also need to hope that the store to load forwarding from a vector store to scalar load is any good. You'd probably need a fair number of independent instructions at the start of the loop to get things rolling, and you'd easily blow through all your scalar registers.
     
  4. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    I actually proposed to completely unroll it, but otherwise yes.
    The latter is stated in the ISA Manual, iirc. But you are probably right with the former point: Although the vector memory operations have their own address generation (mainly to enable scatter/gather and the deviating meaning of offsets) and the L1-D$ enables two simultaneous accesses, it appears to have only a single read port and a single write port. So everything else doesn't matter much, after all.
    I somewhat doubt the load to use latency of the L1 is higher than 4 cycles (and a store won't be much slower). That Knights Corner supports 4way SMT looks like a natural fit to hide the L1 latency almost completely. I guess there is also a reasoning behind the possibility to directly use memory operands in the vector operations. The L1 has to works as an extension of the pretty small (compared to GPUs) register file. It has to be quite fast and low latency to accomplish that (and as intel builds 4 cycle latency, 32kB L1-D$ running at ~4GHz in their CPUs, a 32kB L1-D$ running at ~1.3 GHz does not appear as a very hard problem). You can almost think of the vector registers as a slightly larger register file cache similar to what nV likes to talk about in connection to Maxwell/Einstein/Project Denver, it is just explictly managed by the running program.
     
  5. rapso

    Newcomer

    Joined:
    May 6, 2008
    Messages:
    215
    Likes Received:
    27
    having a gather instruction that immediately returns instead of internal looping/stalling can have some nice advantage (if they prefetch the remaining loads)/

    you could xor the returned mask, do your work with only those returned lanes.
    I imagin something like a kd-tree traversal could be really efficient, the nearer a node is to the root, the higher the chance it's in L2 or even L1, but if it's a slow fetch from ram, all the other lanes could continue working. it would be a 'finer threading' than GPUs do with wavefronts/warps. With a very diverging access pattern e.g. secondary rays in a path tracer, I'd think it's way more efficient on LRB, as you could work on at least something, not keeping 31 lanes (or threads) on hold, because one is still not loaded.
     
  6. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    As you have to loop also over the vector instructions processing the returned data until everything is returned (at least in the worst case), it looks like it could be an excellent way of wasting energy for no or even a negative performance gain. But I agree that one may be able to construct cases, where it could help. As you said: tree traversals are a candidate, but one has to look carefully, one needs a likely early out of the loop, otherwise it will be slower. If you need all of the returned data anyway, it's not an option to process some of it earlier and some other parts later (it's just SIMD with lane masking like GPUs after all).
     
  7. rapso

    Newcomer

    Joined:
    May 6, 2008
    Messages:
    215
    Likes Received:
    27
    spinning on a condition (condition mask from the load units) isn't beneficial either, unless intel does some smart energy savings in those tight loops like they do on their Core CPUs since Nahalem.
    I'm not sure how many gather instructions you could have simultaneously in the pipe to unroll the spinning and start working on the first gather that is done, but it sounds like quite an overhead to emulate the warp/wavefront scheduling of GPUs. Yet, you would still be some spinning. it might have been smarter to have a stalling gather, if the other 'hyper threads' could continue working, getting the time slices of the waiting 'hyper thread'.

    I also wonder what causes a return with a not cleared mask, whether it's a long latency memory fetching, or even just addressing across cache lines (in that case I hope those are 128 or 256 byte long).

    yeah, it just works if you have some kind of dynamic data consumer, a regular matrix x matrix would just have disadvantages in that way.
     
  8. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,414
    Likes Received:
    411
    Location:
    New York
    Do we have any independent performance data on either Knight's Ferry or Knight's Corner? It's a bit strange that Intel is so quiet on that front. I know there have been a few preliminary papers focused on KF scaling but haven't seen anything with absolute performance numbers.
     
  9. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    Im not surprised to see Intel really tight lipsed about it, im sure they dont want reveal too much data before the official launch.
     
  10. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    980
    Likes Received:
    268
    Then that is a change for them.

    They have been pumping Xeon Phi aka Knights Corner aka Larrabee non-stop.

    For Intel to be this silent seems to mean either something is wrong or that maybe they are embarrassed by the K20 from Nvidia with it's 1.3 TF DP.
     
  11. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    AFAIK, everything so far is under NDA.
     
  12. Cookie Monster

    Newcomer

    Joined:
    Sep 12, 2008
    Messages:
    167
    Likes Received:
    8
    Location:
    Down Under
    Im guessing their power figures might not be so good?

    And does anyone know when exactly the official launch date is :?:
     
  13. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Really? Does Intel regularly disclose performance numbers before lunching a new processor?
     
  14. A1xLLcqAgt0qc2RyMz0y

    Regular

    Joined:
    Feb 6, 2010
    Messages:
    980
    Likes Received:
    268
    Yes, when it is Xeon Phi aka Knights Corner aka Larrabee.

    Do a Google Search on the above and see the hype from Intel including expected performance numbers.
     
  15. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,414
    Likes Received:
    411
    Location:
    New York
    They've made some public claims about MIC performance running behind closed doors. However, they're not allowing independent researchers who have hardware in hand to say anything beyond "it works".

    The best guess I've seen is this from the Register.

    http://www.theregister.co.uk/2012/06/18/intel_mic_xeon_phi_cray/

     
  16. willardjuice

    willardjuice super willyjuice
    Moderator Veteran Alpha Subscriber

    Joined:
    May 14, 2005
    Messages:
    1,372
    Likes Received:
    239
    Location:
    NY
    lol I don't think Marco has to google anything wrt larrabee. :razz:

    And fwiw he's correct; Intel has remained pretty consistent on how it releases *juicy* information regarding larrabee. I know they have been a bit more open with CPUs, but that really hasn't translated to larrabee. The secrecy certainly doesn't seem new to me and can hardly be called a "change" in strategy.
     
  17. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,489
    Likes Received:
    907
    There are often "leaks" or previews of upcoming processors weeks or months before launch, usually through big websites like Anandtech.

    Granted, that may not make as much sense for an HPC product.
     
  18. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    How is this any different from any other Intel processor launch? It's not like IVB machines were parachuted into selected ISV hands the day of the official launch.
     
  19. Billy Idol

    Legend Veteran

    Joined:
    Mar 17, 2009
    Messages:
    5,933
    Likes Received:
    768
    Location:
    Europe
    I am looking forward to use those little beasts for hpc...I already broke the O(10^5) processor mark with my simulations...now I want the million...bring it on and build a real HPC system!
     
  20. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Everybody (minus Apple) discloses hype before the next shiny is about to launch. Real world numbers, tested by unbiased parties without the aid of cripple-everything-but-my-chips tools with real world code are another matter.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...