NVIDIA Maxwell Speculation Thread

Discussion in 'Architecture and Products' started by Arun, Feb 9, 2011.

Tags:
  1. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Almost like all multiple outstanding in-order completion memory interfaces. :wink:
     
  2. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
  3. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Variable vector length (minimum is 1), so it can indeed be configured as MIMD (and the moniker SIMT may start to make sense after all ;)).

    And they do static scheduling by the compiler into "LIW" instruction groups (2 arithmetic instructions + 1 L/S). SP is twice as fast as DP (each arithmetic instruction slot can operate on two SP values, it's basically a mini 64bit SIMD).
     
  4. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    6,768
    Likes Received:
    470
    As AMD abandons (V)LIW NVIDIA embraces it, always said it was the way to go for MIMD.
     
    #64 MfA, Jan 22, 2012
    Last edited by a moderator: Jan 23, 2012
  5. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    It seems noteworthy how many of these ideas bear some resemblance to the PowerVR SGX/Rogue shader core. While there are clear differences and some things (once again) being proposed like the configurable cache hierarchy are truly novel and exciting, I don't think IMG could be asking for a better validation of their design choices than this. Anyway, back on subject...

    Obviously I've been hoping for MIMD shader cores to go mainstream (partly for selfish reasons) for a very long time so there's not much I can say here but a simple "yay" :)

    Their basic argument for MIMD seems to be that while it may increase area, it won't hurt power efficiency by anywhere near as much if you're clever about it, so if you're power-limited anyway the greater programming flexibility and performance is very much worth the cost.

    That certainly makes a lot of sense although it's interesting that they hardly mention leakage at all in their entire paper (only as a limiter to Vdd scaling). Yes, you can save wire energy and improve locality by coalescing on a MIMD architecture, but you've still got some extra leakage to contend with. Hmm... I know some companies are (over?)optimistic over extremely fine-grained power gating, but I've never seen any indication that NVIDIA is one of them.

    Do you have any idea how the 'mini 64bit SIMD' would work? Surely it's not effectively true Vec2 for SP?!
     
  6. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    I don't get the question. Each of the two arithmetic LIW slots is Vec2 for SP. It should work like the obsolete 3DNow! in AMD CPUs. It will probably have the same problem to get the 32bit components into one 64 bit location as CPUs have with their vector extensions. So you need to vectorize a bit (or you need a clever compiler for more than some simple cases), the instruction packing into the LIWs for SP is less flexible if you compare it to AMD's VLIW architectures for instance. I guess it is done to reduce the (power) overhead that would be necessary to individually adress 16 operands of 32bits in the register files (AMDs VLIW architectures are doing of that). In Einstein/Echelon this will be reduced to 8 operands of 64bits per clock cycle.
     
  7. Arun

    Arun Unknown.
    Moderator Legend Veteran

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    299
    Location:
    UK
    That was also my initial assumption but I can't seriously believe that's what NVIDIA is going to do here given how important FP32 for graphics will remain going forward. This looks like the key sentence to me: "Second, within a lane, threads can execute independently, or coalesced threads can execute in a SIMT fashion with instructions from different threads sharing the issue slots".

    So my guess is they must have found a way to do 4 scalar FP32 instructions per clock per shader when using a warp size of 2 or more (at least in certain cases). This is not a bad solution at all, especially if they continue focusing on quads for pixel shaders. I'm really not sure how the whole "instructions from different threads sharing the issue slots" thing would work in practice though...
     
  8. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Could be as simple as that the compiler packs 2 or 4 work items ("threads") into one lane (one needs some kind of "slot masking" in that case). If it would be 4, this could be the traditional quad. Effective Warp size would be 4 for SP in this case. Doing it dynamically during runtime would defeat the purpose of the whole LIW decision.
     
  9. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    939
    Likes Received:
    35
    Location:
    LA, California
    Gipsel - given that you can run 4 SP fmas per clock, running a quad per-thread makes perfect sense to me. But it's also not clear to me that the exact same lane configuration would get used for graphics SKUs. Maybe for HPC, SP is much less important than DP.
     
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,428
    Likes Received:
    426
    Location:
    New York
    Where does this lead us in terms of software and the 3D pipeline? Lots of parallelism + short vectors + MIMD seems like a good time to seriously rethink the rasterization process. Somehow I doubt graphics APIs will change dramatically in the next five years though.
     
  11. Dade

    Newcomer

    Joined:
    Dec 20, 2009
    Messages:
    206
    Likes Received:
    20
    Lots of parallelism + short vectors + MIMD (+ cache) = ray tracing :wink:
     
  12. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    What about the mess of incoherent memory access? Ray tracing never really had lots of problems with wide vectors or SIMD compared to it's insatiable need for cache size.
     
    #72 rpg.314, Jan 26, 2012
    Last edited by a moderator: Jan 26, 2012
  13. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Speaking of validating competitor's design choices, I think the most important change is the possibility to configure an SRAM pool as registers, cache or scratchpad. For all of Larrabee's flaws, it got this bit exactly right, IMO.
    The real big deal is just letting tools allocate storage for arbitrary data structures.

    My guess is that it will be able to do 2 SP ops, but it will not require them to use contigous registers. IOW, conventional static dual issue of SP ops and just sharing the ALU datapath.

    WIth the focus on handling control incoherence, I don't think they are about to mandate manual or compiler driven workitem packing for performance.
     
  14. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    I hope we get efficient micropolygons though. That is something that doesn't need radical change and looks great on screen. Though chances are, at this point, MS doesn't care where DX goes as long as xbox3 is nice at compute.
     
  15. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,122
    Likes Received:
    2,873
    Location:
    Well within 3d
    How did Larrabee do that?
     
  16. hoho

    Veteran

    Joined:
    Aug 21, 2007
    Messages:
    1,218
    Likes Received:
    0
    Location:
    Estonia
    With proper acceleration structures that's not really a problem on a GPU where you can easily hide latency with throwing a ton of threads at it. Much bigger problem is still in algorithms, namely paralleliziation of building that acceleration structure so that you could have dynamic worlds.
     
  17. fellix

    fellix Hey, You!
    Veteran

    Joined:
    Dec 4, 2004
    Messages:
    3,486
    Likes Received:
    397
    Location:
    Varna, Bulgaria
    Hm, is that's why Cayman performs so well in LuxMark, despite the lack of coherent caching? Large register file and properly accelerated data structures.
     
  18. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,797
    Likes Received:
    2,056
    Location:
    Germany
    In Luxball HDR, even 5870 is almost as fast. Other tests within Luxmark seem to favor the Radeons a little less though
     
  19. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    The primary context storage today in a GPU is in registers. And you can't really cache acceleration structures in registers, no matter how well it is built.

    Since a GPU needs a ton of threads and has tiny caches, effective cache per thread is really small. Larrabee would have been better balanced in this regard.
     
  20. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    As I understood Larrabee's implementation, they were allocating all the registers and shared memory in cache lines, and then marking those lines as" don't replace" using special cache management instructions in the driver. Whatever was left of L2, after subtracting the code and the per core runtime, was available for caching.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...