Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    I think the only question would be whether it makes sense to use 64 CUDA cores per SM (like P100 and V100) or 128 CUDA cores per SM (like every other Maxwell/Pascal GPU).
    I am sure the Quadro GP100 was tested by some people with games, anyone with a better memory remember where?
    Maybe worth revisiting that to see if there is a correlation with regards to the CUDA core/SM ratio and whether it is potentially detrimental to some games; sure I heard that it may not be ideal, and we see quite a few games not reaching right performance scaling even outside of ROPs while a few others do reach what should be expected 30-40% improvement.
    Apart from that like you I would expect Volta to be next gaming arch minus Tensor/FP64, albeit possibly with a differentiated name outside of the flagship mixed-precision-Tensor GPU.
    Jonah Alben has mentioned Volta architecture generally works well with games.
     
    #941 CSI PC, Dec 16, 2017
    Last edited: Dec 16, 2017
  2. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    Never tested with games unfortunately, only pro apps.

    We have multiple factors indeed, low clocks, low ROPs count, immature drivers (obvious from the frame pacing issues and fps lock that several games have). Never the less, NVIDIA is touting TitanV as a part of 10 series in the driver selection page, so a completely new product is almost a given at this point.
     
  3. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    I don’t understand what the practical difference is between the 128 and the 64 core Pascal SM. From where I stand, they are identical. Even Nvidia is confused about it, sometimes giving gp100 30 128 core SMs and sometimes 60 64 core SMs.
     
  4. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    You are changing the structure/size of register/instruction buffer-instruction cache/texture units-cache relative to number of CUDA cores, all of which is affecting warps/threads-thread blocks by doubling the SM-Cuda core ratio.
    It is efficient in many ways but how it behaves in some workloads-code-gaming *shrug*.
    There must be a reason Nvidia never went this route with any other of their Pascal GPUs and especially the higher end GP102 (remember shared with gaming segment); the Tesla P40 was Nvidia's highest performing FP32 HPC-scientific card.
    Hmm I have never seen the GP100 mentioned by Nvidia having 64 CUDA per SM, reality is 56 SMs as the P100 has 3584 FP32 cores active.
    60 is in theory possible as that would be the full active die, but it was never released as that.

    If we had gaming results for both the Quadro GP100 and the Titan V it would help to give a better picture when it comes to games.
     
  5. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    Check out this slide: https://cdn.wccftech.com/wp-content/uploads/2016/08/NVIDIA-GP100-GPU.jpg

    30 SMs, not 60.
     
  6. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Ah WCCFT, that really does not look like anything from Nvidia tbh so more likely WCCFT made a mistake or quoted someone who did not get all the facts correct.
    Every reliable source will report it as either 56 or 60 (the full die but was never released as we only saw the 3584 core model).
     
    #946 CSI PC, Dec 17, 2017
    Last edited: Dec 17, 2017
    nnunn and BRiT like this.
  7. nnunn

    Newcomer

    Joined:
    Nov 27, 2014
    Messages:
    40
    Likes Received:
    31
    #947 nnunn, Dec 17, 2017
    Last edited by a moderator: Dec 17, 2017
    Lightman, CSI PC and pharma like this.
  8. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,887
    Likes Received:
    4,534
    For reference purposes.
    https://www.nextplatform.com/2017/08/11/nvidia-textbook-case-sowing-reaping-markets/


    https://www.nextplatform.com/2017/05/10/nvidias-tesla-volta-gpu-beast-datacenter/
     
    nnunn likes this.
  9. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    The Power9 with Nvidia is looking nice, fingers cross IBM gets back into the HPC segment as they deserve to IMO.
     
  10. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    It’s a slide that Nvidia presented at HotChips 2016...
     
    Ryan Smith, Rufus and CSI PC like this.
  11. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Hmm..
    And their 2016 P100 whitepaper has the correct numbers at 56, can you link the article or slides?
    Edit:
    NVM and thanks your right found the HotChips one, but that is the 1st I have seen that is wrong and they are assuming it had the same as the rest of the GPU line, which obviously it does not.

    To put it into perspective it was even on their primary devblog page well before that August presentation with the 56SM, along with the whitepaper also well before August so no idea how they messed up with the HotChips presentation *shrug*.
    https://devblogs.nvidia.com/parallelforall/inside-pascal/
    https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
     
    #951 CSI PC, Dec 17, 2017
    Last edited: Dec 17, 2017
  12. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    It was bugging me as I know John Danskin is pretty much on the ball as an engineer and a senior one at that, although the scope of the HotChips was specifically silicon/NVLink but still not an excuse for that to be wrong - John Danskin is the Nvidia name associated with the HotChips presentation.
    So I digged around for some of his other presentations, later on he shows the GP100 with the 64 CUDA cores per SM, meaning double the ratio to the other GPUs and 56/60.
    HETEROGENEOUS COMPUTING: http://salishan.ahsc-nm.org/uploads/4/9/7/0/49704495/2017-danskin.pdf
    You will notice the information with the same-similar die image (page 5 in above link) has now changed and the next slide shows the correct Pascal GP100 structure 56/60.

    Slightly off-tangent but considering P100 was never released as the full active die (60SM), makes me think we will see the same again with V100 staying as 80 rather than eventually 84; just saying as some feel it may eventually launch as a fully active GPU but some thought that with the P100 as well.
     
    #952 CSI PC, Dec 17, 2017
    Last edited: Dec 17, 2017
  13. Anarchist4000

    Veteran

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    We've seen papers from AMD and Nvidia on the matter. Both showed rather significant gains(30-40% as I recall), but didn't really go into scenarios beyond single die fabrication limits. The physics are simple and multiple chips will win as more silicon at lower voltages is more efficient due to less leakage. At least up until monolithic designs are running threshold voltages. Then it's a question of absolute performance versus efficiency.

    Epyc may not be a GPU, but we've seen cost and performance tradeoffs versus Intel's.
     
  14. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    I don't think it matters very much whether it has 30 SMs that have 128 cores or 60 SMs that have 64. But I still would like to know the reason why. :)
     
  15. keldor314

    Newcomer

    Joined:
    Feb 23, 2010
    Messages:
    132
    Likes Received:
    13
    Basically, each SM has a block of shared memory/L1 cache associated with it. In GP100, they doubled the L1/shared memory by having two blocks of it per what used to be a SM. Due to addressing/whatever, each half of the original SM can only see one of the blocks, so it behaves like 2 64-thread SMs. There are likely structures (likely the SFU) still shared by both halves, so it's not as clear cut as it could be.
     
    silent_guy and nnunn like this.
  16. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    I said earlier it changes the structure/size of register/instruction buffer-instruction cache/texture units-cache relative to number of CUDA cores, all of which is affecting warps/threads-thread blocks by doubling the SM-Cuda core ratio.

    Apply Occam's Razor:
    If it made no difference between 128 and 64 doubling the SM structure then why do it in the 1st place, although like I said Nvidia did report it is more efficient generally.
    If it does make an efficiency difference then why only on the P100 and V100, the flagship mix-precision HPC dedicated GPUs.
    And lastly why is it not applied to what used to be their fastest HPC FP32 GPU the GP102 Tesla P40; important difference between the GP102 and GP100 is one is also shared for gaming and also is not a FP64/FP32/FP16 mixed-precision GPU.

    I mentioned earlier it is more efficient but also it may not be ideal for all workloads-code that includes some or many games; you notice the results for gaming can be 11% to 40% faster than a Titan xP and it cannot all be explained away by ROPs/CPU limits - some games just cannot use the cores-SM with the front end that well.
    Unfortunately though the only way to really know is to also test the Quadro GP100 with games as well to see if there is a trend-behaviour.
     
    #956 CSI PC, Dec 18, 2017
    Last edited: Dec 18, 2017
  17. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    It's pretty clearly a mistake on the HotChips slide since every other NVIDIA paper mentions 60 (56+4)
    edit: It probably should say "TPC" instead of "SM" on the HotChips slide, since each TPC on GP100 has 2 SMs
     
  18. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I think the biggest high-level difference is 2x the amount of shared memory bandwidth per ALU, and 1.5x the shared memory capacity (2x64KiB vs 1x96KiB). The bandwidth is especially important as the lack of shared memory bandwidth was a big bottleneck for Maxwell (vs Kepler and especially GCN/Fermi) which showed up a lot more in compute than gaming.

    On GV100, the shared memory architecture is completely different, so it's hard to predict what they're going to do for future gaming GPUs... 128KB of hybrid L1/shared memory per 64 FP32 ALUs seems like an expensive luxury for gaming...

    BTW, this is completely unrelated to the increase in register file size, as registers are effectively per warp scheduler. Some of the other points that CSI PC brings up, e.g. instruction buffer/cache sizes, are interesting and I hadn't thought about before - I don't really know whether HPC kernels are so large they might justify a larger cache size compared to gaming (and again how it plays with the different instruction cache architecture in GV100).
     
    #958 Arun, Dec 18, 2017
    Last edited: Dec 18, 2017
    nnunn and silent_guy like this.
  19. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,887
    Likes Received:
    4,534
  20. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    That makes a lot of sense.

    But that means that GP100 can’t some shared data structures in shared memory because the maximum size is actually smaller. This is a rare exception, probably.

    I wasn’t aware that Maxwell was a regression in shared memory bandwidth compared to Kepler. (To be honest: I’ve never given it any thought!)
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...