Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. Rootax

    Veteran

    Joined:
    Jan 2, 2006
    Messages:
    2,401
    Likes Received:
    1,845
    Location:
    France
    CUDA maybe allows more "hardware" functions to be exploited than OpenCL, or be closer to the metal (whatever than mean in 2017), hence the performances boost ?
     
  2. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    That's why I said weird: Open CL and Cuda are all over the place, with RAW conversion for example being massively faster in oCL than in Cuda. Also DoF is a bit faster there. Normally, I'd agree, you would expect Cuda to be faster than Open CL.
     
    Lightman likes this.
  3. BoMbY

    Newcomer

    Joined:
    Aug 31, 2017
    Messages:
    68
    Likes Received:
    31
    The NVidia DGX-1 system is supposed to have 8x Tesla V100. A shame Geekbench doesn't show how many Compute devices it used. 743537 seems to be too good for just one of them, by the specs, but it would be really bad for 8.
     
  4. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,891
    Likes Received:
    4,539
    https://forum.beyond3d.com/posts/2001559/
     
  5. BoMbY

    Newcomer

    Joined:
    Aug 31, 2017
    Messages:
    68
    Likes Received:
    31
    Yeah, it just doesn't add up. By the specs V100 has maybe 150% of the raw performance of a P100. The V100 result is 232% of the best P100 result. The individual results are 2x up to 3.5x higher than a P100. If the score was in the 500k range, it would be believable, but not more than 700k for a single V100.
     
  6. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,891
    Likes Received:
    4,539
    CUDA 9 has significant improvements over CUDA 8, so that may also be coming into play with the results.
    Edit: I would be interesting to see V100 vs P100 with both using CUDA 9.
     
    CarstenS and DavidGraham like this.
  7. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    in OpenCL, so using same API version, V100 reaches 481k vs 278k for P100.
    Still a MA-SSI-VE perf increase
     
    Lightman and pharma like this.
  8. BoMbY

    Newcomer

    Joined:
    Aug 31, 2017
    Messages:
    68
    Likes Received:
    31
    Yeah, but 50% raw performance, plus 23% architecture improvement is much more believable than a 50% raw performance, plus 82% architecture improvement.
     
  9. LiXiangyang

    Newcomer

    Joined:
    Mar 4, 2013
    Messages:
    87
    Likes Received:
    48
    The V100 has 2.85x the bandwidth of shared memory comparing to Pascal, and greatly improved L1 cache, greatly improved memory controller, it also have a more flexible SIMD execuation method at warp level, and it can also execute FP and INT computation (usually to compute index) at the same time, all above could reduce latency/instruction stall significantly.

    You cannot just compare the gflops there, saving the tensor core, the gflops is probabiliy the least improved dimension of performance of V100.
     
    pharma, xpea, nnunn and 1 other person like this.
  10. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Is that shared memory bandwidth figure aggregate over all SMs?
     
  11. BoMbY

    Newcomer

    Joined:
    Aug 31, 2017
    Messages:
    68
    Likes Received:
    31
    So you are saying improved cache speed, and 1.2x memory bandwidth, is responsible for almost double overall performance? Okay ... I think I'll wait for independent benchmarks, with an actual and comparable baseline, before believing any of this.

    Hmm, why isn't there any edit function? I also want to add: Even NVidia is only claiming 1.5x HPC performance gain over P100 on their V100 product page ...
     
    #611 BoMbY, Sep 19, 2017
    Last edited by a moderator: Sep 19, 2017
  12. Malo

    Malo Yak Mechanicum
    Legend Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    8,931
    Likes Received:
    5,530
    Location:
    Pennsylvania
    You're a new member with only a few messages, you don't have any Edit rights yet.
     
  13. Matoka

    Joined:
    Sep 7, 2017
    Messages:
    1
    Likes Received:
    5
    It's per SM.

    That's because most HPC apps are memory bandwidth bound, and 1.5x is how much improvement there is in Volta versus Pascal (much bigger improvement than ratio of peak bandwidths comes from Volta's much improved efficiency from the high 70s percent of peak to mid 90s percent of peak).

    One other improvement worth mentioning is the reduction in back-to-back math op latency:
    • Kepler: 9 clocks
    • Maxwell / Pascal: 6 clocks
    • Volta: 4 clocks
     
    #613 Matoka, Sep 19, 2017
    Last edited by a moderator: Sep 19, 2017
    gamervivek, nnunn, Lightman and 2 others like this.
  14. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    the key word here is HPC.
    Geekbench is not an HPC bench. Maybe this light workload fits in Volta massive cache and get big performance boost...
     
  15. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Yeah,
    will be worth keeping an eye out on the AMBER benchmarks that will give a better indicator in general for HPC, I would expect the V100 to appear in next month or two on there.
    http://ambermd.org/gpus/benchmarks.htm#Benchmarks
    Downside is that they never do a full node so it does not really reflect the mesh improvement, however from an interest perspective for many they do both PCIe and also NVLINK paired GPU setup.

    Cheers
     
  16. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    SUMMIT details emerge via nextplatform. A small extract:
    Full article here at the source: https://www.nextplatform.com/2017/09/19/power9-rollout-begins-summit-sierra/
     
    sonen, Alexko, iMacmatician and 2 others like this.
  17. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,541
    Likes Received:
    964
    So with tensor cores, it could be the first machine to reach 1 exaflop—at weird, limited precision, not FP64, but still.
     
  18. Infinisearch

    Veteran

    Joined:
    Jul 22, 2004
    Messages:
    779
    Likes Received:
    146
    Location:
    USA
    Where did you read this?
     
  19. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    696
    Likes Received:
    446
    Location:
    Slovenia
    Here for example:
     
  20. gamervivek

    Regular

    Joined:
    Sep 13, 2008
    Messages:
    805
    Likes Received:
    320
    Location:
    india
    Would that make much of a difference on gaming side of things? I was more interested in the fact that they have dedicated INT32 units alongside FP32, how much can nvidia cut down on the GV100 for a gaming chip?

    I don't think they'd like to go more FP32 cores in the gaming chip than GV100 unless they're really hurting for the gpu performance crown, so 400-500mm2 chip perhaps.
     
    pharma, nnunn and DavidGraham like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...