Larrabee at GDC 09

Discussion in 'Architecture and Products' started by bowman, Feb 16, 2009.

  1. ssp

    ssp
    Newcomer

    Joined:
    Dec 26, 2008
    Messages:
    2
    Likes Received:
    0
  2. tinokun

    Newcomer Subscriber

    Joined:
    Jul 23, 2004
    Messages:
    60
    Likes Received:
    67
    Location:
    Peru
    Larrabee to support scatter/magic!
     
  3. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,155
    Likes Received:
    586
    Kinda disappointing that they seemingly aren't trying to do anything on the language front.
     
  4. Rufus

    Newcomer

    Joined:
    Oct 25, 2006
    Messages:
    246
    Likes Received:
    61
    Why is that listed as if it's a good thing? We all know how well it turned out the last time Intel released a new architecture where that was true.
     
  5. Arwin

    Arwin Now Officially a Top 10 Poster
    Moderator Legend

    Joined:
    May 17, 2006
    Messages:
    18,065
    Likes Received:
    1,660
    Location:
    Maastricht, The Netherlands
    Well I think that's why they apparently have 500 programmers working on this? ;)

    Seriously though, I think this is kind of what we need to evolve the graphics space in the transition phase from a pure rasterizer to new technologies and hybrid forms blending all sorts of physics and rendering techniques.
     
  6. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    Every disclosure from Intel has referred to the vector unit in the singular sense, and all the diagrams only show one.
    It doesn't look likely at this point.
     
  7. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,943
    Likes Received:
    2,286
    Location:
    Germany
    Either I don't get your point or we're talking different things here. I did not want to say that the entire core or the entire chip cannot do more than 1 FMAC/cycle, but rather that their FMACs would need at least 1 cycle to complete as do everyone else's.

    And surely, if they can do any number of FMACs in parallel, that'd be factored into their peak TFLOPS rate, wouldn't it?
     
  8. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    While they can do more than 1 fmac/cycle, I don't think they will. I was merely demonstrating a hypothetical possibility, not that I expect it to happen with lrb.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,876
    Likes Received:
    768
    Location:
    London
    EDIT:[strike]Aha, that paper's been updated from the version previously published.[/strike]

    Interestingly enough it seems to say that gather isn't really performing any kind of nice packing:

    so it's quite possible for a gather to waterfall over 16 successive fetches from L1. But at least L1 fetch latency can be hidden by hardware-thread switching. So it's a half-way house solution. I suspect NVidia can also hide, or partially-hide, this kind of fetch latency, but not sure.

    Jawed
     
    #49 Jawed, Mar 30, 2009
    Last edited by a moderator: Mar 30, 2009
  10. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    The load hardware could determine relatively quickly how many cache lines will need to be accessed based on the calculated addresses.
    Determining this per gather/scatter might inject some latency in the process.
     
  11. TimothyFarrar

    Regular

    Joined:
    Nov 7, 2007
    Messages:
    427
    Likes Received:
    0
    Location:
    Santa Clara, CA
    Gather / Scatter performance between LRB and NV will be interesting.

    If I remember correctly NVidia gt2xx in the "waterfall" to 16 independent fetches gather case, can reduce those independent fetches to 32 bytes (vs 64 bytes for LRB). Latency hiding like texture fetch latency hiding, so as long as enough ALU work in there, seems easy enough to keep ALUs busy.

    With LRB, if you do hit this worst case waterfall, you are effectively killing 1/8 of this threads L1 cache for just one worst case gather (32KB cache / 4 hyperthreads / 64B line / 16 lines = 8 ). I don't know the LRB set associativity, if you have 4 hyperthreads doing this at the same time, things might get very bad for the L1 ... not sure how useful the gather prefetch will be in such a condition. Would seem to me that with 4 threads doing worst case waterfall gathering, that ALUs would stall on L1, even if data locality fit into L2?

    Jawed you have any thoughts here?
     
  12. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    If Larrabee decomposes gathers and scatters as serialized accesses, then perhaps a minimum of 4-way associativity should keep things relatively civil between threads.

    It might be more. The Pentium had a 2-way 8KiB cache, and Larrabee's is up by a factor of 4.
    Intel has shown a preference for increasing associativity with the increase of capacity in other designs.
     
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,876
    Likes Received:
    768
    Location:
    London
    Nice analysis. I honestly don't know what to add.

    Someone needs to show a Larrabee raytracer with rays intersecting all over the place :razz:

    I'm in over my head to be honest. In graphics, intense multi-level dependent texturing, e.g. as often seen in perlin noise shaders is a recipe for lag.

    Jawed
     
  14. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,155
    Likes Received:
    586
    It would keep the misses down, but the pure cycles necessary even on hits is still a problem. For that the only solution is having multiple banks with smaller width ports (or just plain add more ports, but that's obviously a bit too expensive). A little like the texture cache in fact. Would make sense to put a separate scatter/gather unit in there.
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    The tenor of Intel statements is that the design assumes that most gathers will have high locality, and that those that don't will incur the penalties of moving cache lines around.

    The memory architecture, such as the way the ring bus seems to be optimized for 64 byte transfers, might not be the best for such an arrangement.
    Separate ports are separate memory coherence requests. Fragmenting traffic at the vector memory pipe would have knock on effects all the way down the memory hiearchy.
     
  16. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,155
    Likes Received:
    586
    Maybe next gen, snooping based coherency will have a hard time scaling anyway.
     
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,876
    Likes Received:
    768
    Location:
    London
    Would it be practical to have one or more hardware threads dedicated to "grouping and assembling" L1 lines. Give this thread 16-way sets of indices and let it fetch and build single cache lines in response which then get copied to your local L1?

    This wouldn't solve latency, but it would at least keep all the L1-trashing to a corner of Larrabee, out of the way of real work. And the effective gain in L1 space for worker threads would allow them to hide the longer latency of this technique.

    Such dedicated threads don't sound particularly different from texturing hardware and its own private texture cache. So the next question is, can the texture units take on this kind of workload?

    Jawed
     
  18. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    In other words, we have a thread doing memory copies from throughout memory into a single contiguous buffer space.

    Larrabee's flexible, so it could probably be done.
    The scheme is pretty complex. We're either adding a whole other layer of indirection per cache line so that shader code can properly determine what it has accessed, or we've regimented the software renderer to operate on the collected data format.

    The methodology described for texturing is more of a command/response relationship between a core and the texturing unit, and the texture unit sends back filtered data as opposed to heavily manipulating the cache lines.
    The exact amount of intelligence present in the texturing units is unclear.

    Since texture units don't handle faults that might pop up while striding through memory constructing the optimum L1 arrangement, the texture hardware might not be flexible enough.
     
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,876
    Likes Received:
    768
    Location:
    London
    It doesn't seem particularly complex to me: the originating thread has to generate a vector of 16 indices for a gather, so all that's happening here is that the vector is being sent to another thread.

    LRBni seems to have all the bit-wise masking/shifting support necessary for the gather thread to extract and pack the requested data to assemble a vector of fetched data to send back to the worker thread.

    True, one thing that's not clear yet is whether there's any LOD, bias and addressing computation in the TUs. The justification for making them dedicated was based upon decompression and filtering.


    So, ahem, it sounds like the texture units would be no help. Though the "pipelined gather logic" for a quad implies some kind of walk over disparate addresses to gather the texels required.

    Jawed
     
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,365
    Likes Received:
    3,955
    Location:
    Well within 3d
    The data has to come back.
    Sending off the vector doesn't tell the original thread what address the compiled cache line is in, nor does it tell the shader thread when the worker thread is done.
    The shader thread has to be told which address contains the desired results, and then the shader has to perform a read at this unknown address to make the data migrate back.

    edit: And the base address needs to be sent. There's not enough bit space per index to derive the address otherwise.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...