NVIDIA Maxwell Speculation Thread

Discussion in 'Architecture and Products' started by Arun, Feb 9, 2011.

Tags:
  1. LiXiangyang

    Newcomer

    Joined:
    Mar 4, 2013
    Messages:
    81
    Likes Received:
    47
    It seems that NVCC will use significant more regs by default, when build for Maxwell GPUs, yet the reg file size is the same as Kepler per SM.

    I dont know whether it means Maxwell simply dont care occuppyies that much due to L2 cache and possibily reduced instruction latencies, or actually they have no choice but use that much regs due to they cut corners here and there too much (L1?).

    https://devtalk.nvidia.com/default/...hats-new-about-maxwell-/post/4127010/#4127010

    Also, notice, that DriverQuery come with CUDA 6.0 reports the maximum SMEM can be used per block is still 48 KB for maxwell cards, despite that Maxwell have 64 KB SMEM per SM, althrough according to above figure, NVCC can build SM5.0 objs even if a block ask for 64KB SMEM, so dont know if its the issue of driver, or NVCC, or both.
     
  2. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,875
    Likes Received:
    2,183
    Location:
    Germany
    ...
     
    #1162 CarstenS, Feb 19, 2014
    Last edited by a moderator: Feb 19, 2014
  3. pixelio

    Newcomer

    Joined:
    Feb 17, 2014
    Messages:
    47
    Likes Received:
    75
    Location:
    Seattle, WA
    Nope, sorry to mislead w/the register count. It's still 64K. The regs were changed because I was trying out a different block size. I'll fix the post so that the regs appear the same.
     
  4. Osamar

    Newcomer

    Joined:
    Sep 19, 2006
    Messages:
    200
    Likes Received:
    14
    Location:
    40,00ºN - 00,00ºE
    Which one is the possibility or Nvidia desire to make a 256 bit memory bus 2xGM107 in one chip if there are delays or problems with TSMC 20nm??
     
  5. pixelio

    Newcomer

    Joined:
    Feb 17, 2014
    Messages:
    47
    Likes Received:
    75
    Location:
    Seattle, WA
    @LiXiangyang, actually you are right that when NVCC is targeting sm_35 it's very frugal with registers and the Maxwell sm_50 allocation seems less optimized. Perhaps because it has been polished for the past year? :smile:

    Anyway, you made a good observation and it actually took some effort to push the register allocation upward in my dummy test program. The test program was just emitting a slew of outstanding loads followed by an equal number of stores with some shared memory accesses mixed in. The sm_35 target was intelligently optimizing the register footprint until I stopped it (__syncthreads() is the solution to everything).

    The screen captures are now updated.
     
    #1165 pixelio, Feb 19, 2014
    Last edited by a moderator: Nov 10, 2014
  6. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    Do you have a number for this? Is the pipeline really shorter or did they implement result forwarding to cut the register file access from the instruction latency?
     
  7. Gipsel

    Veteran

    Joined:
    Jan 4, 2010
    Messages:
    1,620
    Likes Received:
    264
    Location:
    Hamburg, Germany
    I thought about one of the specific tests some time ago (pre GCN) to find an explanation for the weird (and clearly wrong) numbers. My conclusion was that at least in one of the tests (I think it was local/shared memory latency) Sandra was basically spitting out a convoluted measure for the clause scheduling latency. So yes, sometimes Sandra appears to be a test of the scheduling behaviour (and the architectural knowledge of the Sandra guys ;)), not memory latency.
     
  8. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    Wow, no wonder nv managed to double perf/W on the same node.
     
  9. rpg.314

    Veteran

    Joined:
    Jul 21, 2008
    Messages:
    4,298
    Likes Received:
    0
    Location:
    /
    On a GPU, the stack variables are in registers.

    L1 is for spilled registers and other data.
     
  10. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    There is a bug in the RC version of the ptx assembler. The limit is indeed 48 KB of shared memory per block. This will be corrected for the official release.
     
  11. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    946
    Likes Received:
    46
    Location:
    LA, California
    Sorry - my ignorance of GPU programming is showing. I can see how you'd want to keep as much stuff as possible in regs. But I don't really understand where you would spill registers to, if not to something that is essentially stack like. Explanations appreciated :)
     
  12. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,875
    Likes Received:
    2,183
    Location:
    Germany
    Which? The limit or the documentation?
     
  13. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,429
    Likes Received:
    181
    Location:
    Chania
    Then you obviously didn't understand what you just quoted. If GM204 is also going to have slightly more FP64 SPs compared to GK104 (just as GM107 vs. GK107), GM204 might end up with MORE SMMs then I would had expected.
     
  14. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177

    GK110 fits 30720 threads on chip at once. The runtime allocates stack space for each of these threads to hold spilling, but only a small amount for each thread, since there are so many of them. The other wrinkle is that stacks are not contiguous - all the stacks are interleaved so that when a SIMD vector does a load from the same address relative to each lane's stack pointer, the load is vectorized. This is why they talk about "local memory", since it's allocated and accessed differently than normal "global memory".
     
  15. Bob

    Bob
    Regular

    Joined:
    Apr 22, 2004
    Messages:
    424
    Likes Received:
    47
    On Maxwell sm50 profiles, blocks are HW-limited to 48 KB of shared memory (each). This is the same as on Kepler and Fermi. ptxas was misreporting the capability as 64 KB per block. This is incorrect. The real HW maximum size is 48 KB of shared memory per block.

    SMM has a total shared memory capacity of 64 KB. So it can run one block that uses 48 KB, or two blocks each using 32 KB of shared memory at the same time.

    Hope this clarifies things.
     
  16. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,875
    Likes Received:
    2,183
    Location:
    Germany

    Absolutely. Thanks Bob!
     
  17. Osamar

    Newcomer

    Joined:
    Sep 19, 2006
    Messages:
    200
    Likes Received:
    14
    Location:
    40,00ºN - 00,00ºE
  18. Cookie Monster

    Newcomer

    Joined:
    Sep 12, 2008
    Messages:
    167
    Likes Received:
    8
    Location:
    Down Under
  19. UniversalTruth

    Veteran

    Joined:
    Sep 5, 2010
    Messages:
    1,747
    Likes Received:
    22
    A 150$ videocard is anything but low-end, some people who are enthusiasts but still either do not wanna spend insane amounts of money or simply do not have it, buy cards in this exact price segment and play everything on close to maximum settings!

    Keep in mind also that just one tech process (40 nm) generation ago this was the high-end type of performance!
     
  20. xDxD

    Regular

    Joined:
    Jun 7, 2010
    Messages:
    412
    Likes Received:
    1

    at 500$ msrp...
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...