Larrabee at Siggraph

Discussion in 'Architecture and Products' started by nAo, Jun 2, 2008.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    This comes back to the programmer. e.g. they've designed a way to construct a D3D pipeline that sizes a tile to fit within cache alongside other stuff that's also going to use cache. So the "driver" must assess the pixel shader for its register payload versus the amount of texture latency it needs to hide, and trade those off against tile size. ATI and NVidia don't have a tile size to worry about, but they do have to worry about cache thrashing caused by the raggedness of the progress of the batches - i.e. what's the greatest difference in program counter amongst the extant batches and what effect that has on cache thrashing.

    So in Larrabee the programmer is supposed to configure L2 cache lines to suit the types of fibres running. Once a core gets under way with a phase of rendering I get the impression that the cache lines are pretty much static - e.g. in pixel shading a block of lines for the tile data, another set of lines for texture results (parameters too) and some lines for general scheduling.

    One thing that's occurred to me is that Larrabee's circular fibre scheduling could lead to under-utilisation of the texture units - this is the average versus worst-case latency hiding that Mintmaster was alluding to earlier, I think. Not sure, need to think about it more.

    Depends on the interval between starting the move and the other unit consuming the data - i.e. whether this mostly stays within L1 or often ends up going to L2.

    ---

    So, what happens on interrupts? I've got no idea what happens to x86 SSE registers in this situation, so not sure what to expect in Larrabee and the effect on VPU. Is it likely that Larrabee will turn off interrupts on most cores, e.g. leaving one core as able to accept them?

    Jawed
     
  2. randomhack

    Newcomer

    Joined:
    Apr 4, 2008
    Messages:
    41
    Likes Received:
    0
    Hows CUDA simpler than OpenMP?
     
  3. randomhack

    Newcomer

    Joined:
    Apr 4, 2008
    Messages:
    41
    Likes Received:
    0
    edit : removed post
     
    #463 randomhack, Aug 16, 2008
    Last edited by a moderator: Aug 16, 2008
  4. armchair_architect

    Newcomer

    Joined:
    Nov 28, 2006
    Messages:
    128
    Likes Received:
    8
    I think you're confusing HW threads and SW threads.

    EDIT: having finished reading the thread, I guess you're not. Not sure what you meant in this post though.

    They're reserving the term "thread" for HW threads, which do indeed switch every cycle just like on a GPU. Each HW thread has real registers assigned to it (statically). The round-robin HW threads will hide instruction and L1 latency; I imagine they'll also need some non-dependent instructions (like fibers with >16 strands, see below) to fully hide L2 latency, unless their L2 latency is amazingly low.

    Fibers are "SW threads", and switching between them is like a thread switch on a CPU: using normal instructions you write out any live registers to memory (cache, really) including any special registers like condition codes and the vector predicate register, then write out the address this fiber will resume execution at, then read in and jump to the incoming fiber's resume address, which points to code that will read in that fiber's live registers and keep going. It's going to be 10+ instructions for a save+restore cycle assuming only a handful of live registers each in the outgoing/incoming fibers.

    It sounds like they'll also be doing a sort of hybrid, where if the fibers don't need all of the vector registers they'll have multiple sets of strands active and round-robin between them within a fiber. So a fiber can be more than just 16 strands. Very similar to how NVIDIA and AMD both run each instruction through the ALUs for 4 ALU-clocks.

    Check out Tom Forsyth's course presentation for most of this.
     
    #464 armchair_architect, Aug 16, 2008
    Last edited by a moderator: Aug 16, 2008
  5. armchair_architect

    Newcomer

    Joined:
    Nov 28, 2006
    Messages:
    128
    Likes Received:
    8
    They're more about the cache line size. Which happens to match the SIMD width, for reasons that should be obvious (which is why SSE/AVX/Larrabee will all have the same issue). Yes, it does break the abstraction a bit. But like I said, in my experience the 80/20 rule applies. Getting the last 20% of performance by optimizing this takes 80% of the time .. but given the vast speedup vs. CPU available even without that 20% you can certainly ignore it for starters. That's of course not true for all problems, YMMV, etc.
     
  6. armchair_architect

    Newcomer

    Joined:
    Nov 28, 2006
    Messages:
    128
    Likes Received:
    8
    I've looked at it. Ct to me looks like a library version of old vector architectures married to the fancy template meta-programming linear algebra libraries that started popping up a few years ago. Works great for some problems, but doesn't seem as broadly applicable as CUDA. Maybe I just lack imagination though.

    DX11 compute shaders and OpenCL both appear (from what we know) to essentially be the CUDA programming model. Not CUDA-the-language exactly, but very similar decomposition of parallelism and combination thread/memory hierarchy. To be more clear, what I like is the CUDA programming model (which includes DX11 compute and OpenCL), not just NVIDIA's current implementation of that programming model.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Sod it, I deferred reading most of those decks until this weekend. ARGH, that's what I should have read first.

    Agreed with all that. The paper refers to thread switching as a way to hide L2->L1 latency and to obviate stalls caused by serially dependent instructions.

    Interestingly, it's quite possible that only 1 thread is running, so the core is no longer switching threads each clock - it is merely evaluating which threads it can issue from each clock. So I don't think threads are scheduled in any kind of strict round-robin fashion. That's merely a possibility allowed under SMT.

    My mind boggles at the sheer expense of wasted cycles switching contexts due to texturing :cry:

    Yeah so they're trading-off the granularity of the register file versus the total number of strands in flight. A few fibres with lots of strands versus lots of fibres with few strands. The former case will have less wasted cycles due to context switching so will hurt less when there's a lot of texturing. If the shader's free of DB then lots of strands obviously won't impact performance.

    OK, well that undoes all my thinking about fibre scheduling. This also means that the register file is prolly gonna be tiny as 3dilletante was originally asserting - much more like souped-up SSE than cut-down GPU SIMD. Oh well.

    Jawed
     
  8. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Weird, to me it looks like some sort of functional programming language....
     
  9. Barbarian

    Regular

    Joined:
    Jun 27, 2005
    Messages:
    289
    Likes Received:
    15
    Location:
    California, USA
    The size of the vector register file is rumored to be 32. Not huge but not small either.
    In the end it might actually not matter. Low latency L1 cache plus mem-op capability of the instruction set should allow an easy L1-as-huge-register-file model.
     
  10. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,806
    Likes Received:
    473
    I wonder why only the primary pipeline can do vector loads ... I'd expect the secondary pipeline to at least be able to do unformatted loads for reading flushed registers for a fibre, seems silly to waste the primary pipeline for that.
     
  11. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    How do you know that?
     
  12. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,806
    Likes Received:
    473
    Well, they said so :)

     
  13. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Ok, fair enough, I didn't remember that part of the paper.
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    Anything to do with the ability of the VPU to read one operand directly from L1?

    Jawed
     
  15. PeterT

    Regular

    Joined:
    May 14, 2002
    Messages:
    702
    Likes Received:
    14
    Location:
    Austria
    In the terminology you use here, are the manual shared memory management and coalescing requirements of CUDA part of the programming model or just of NVs current implementation? Because, as someone who's used both traditional GPGPU and CUDA on HPC problems that's the part of CUDA that I can't see as part of any future more-or-less "mainstream" parallel programming language/model. I also think those should be mentioned before talking about how well CUDA scales -- because a CUDA program that's actually optimized (and when it comes to coalescing and shared memory we're not talking about single-percent type optimizations but potentially orders of magnitude differences) won't come close to porting to another architecture optimally.

    It's also interesting (and I haven't seen anyone explicitly mention it in this thread) to see the different programming trade offs between Larrabee and NV GPUs. As I currently understand it, on the former, you get automatic/hardware cache management but no hardware thread scheduling. On current NV hardware it's just the other way round. I'm not yet sure what's preferable, ideally you'd have (the option of using) both.
     
  16. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,183
    Likes Received:
    1,840
    Location:
    Finland
    Slightly related, Larrabee (A many core Intel Architecture for Visual Computing) presentation was apparently too popular at IDF Fall 2008, a big portion of press got left outside due lack of space. Due this, the presentation will get another round thursday.
     
  17. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Hopefully they released some more details..
     
  18. kyetech

    Regular

    Joined:
    Sep 10, 2004
    Messages:
    532
    Likes Received:
    0
    Tech demo vid for us laymen to enjoy would be good!
     
  19. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,325
    Likes Received:
    93
    Location:
    San Francisco
    Tech video on emulated hardware? not exactly exciting :)
     
  20. kyetech

    Regular

    Joined:
    Sep 10, 2004
    Messages:
    532
    Likes Received:
    0
    just cos samples arnt out till november doesnt mean thay aint got nuthin now?

    besides, i like seeing pretty graphics, real time or otherwise.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...