Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Although that ignores one of the big selling points and that being for many CUDA and its integration with various frameworks while also having very optimised libraries with quite a lot of flexibility when considering diverse solutions and implementations.
    Whichever large scale HW solution scientists/devs use they will have to spend a lot of time learning and optimising their code, especially if they require both modelling-simulation and training.
    Importantly Nvidia heavily support a broad range of frameworks.

    But as a reference even moving from traditional Intel Xeon to Phi meant a lot of reprogramming/optimsing to make it worthwhile, one of the HPC labs investigated this and published their work.
    I agree CUDA will split opiniions though, with some looking to avoid it while others embrace it from an HPC perspective.
     
    pharma likes this.
  2. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    797
    Likes Received:
    1,624

    It seems even high precision HPC computations can benefit from tensor cores
     
    silent_guy, jlippo, Rufus and 6 others like this.
  3. Shaklee3

    Newcomer

    Joined:
    Apr 9, 2016
    Messages:
    18
    Likes Received:
    10
    @Ryan Smith could you elaborate more on how you did the matrix multiply? You mentioned the size earlier, but did you use the tensor core example gemm code that comes with cuda? I tried your exact size using that code and only got 50TFLOPS.
     
  4. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    The CUDA example is using WMMA, the CUDA abstraction for tensor cores. 50 TFlops is about right for the WMMA interface with current CUDA. To get full performance, use CUBLAS.
     
    Ryan Smith and pharma like this.
  5. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,887
    Likes Received:
    4,534
    #1005 pharma, Jan 10, 2018
    Last edited by a moderator: Jan 10, 2018
  6. Shaklee3

    Newcomer

    Joined:
    Apr 9, 2016
    Messages:
    18
    Likes Received:
    10
    Thanks. It seems to me that the wmma functions should be equivalent to an intrinsic that compiles very closely to sass. I wouldn't have expected more than 100% throughput difference. Do they expect that anyone who doesn't want to use their libraries we'll just have to pay a performance penalty?
     
  7. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Titan V and V100 scientific benchmarks.
    Been keeping an eye out as was expecting this test benchmarking around now or a bit earlier.
    Titan V and V100 PCIe (not Mezzanine) with Amber with various models single precision: http://ambermd.org/gpus/benchmarks.htm#Benchmarks

    Wow in performance, and compared to the P100 with a nice price/performance benefit, although Titan V is crazy good in that respect.
    So Titan V is a nice buy for universities/small labs that just want a few scaled nodes, not sure how much larger Nvidia would allow this to be scaled but they do try to support and assist academic and labs within reason.
    1xTitan V is faster than a dual Quadro PCIe GP100 NVlinked setup with these SP solvents......
    Insane value with the Titan V, while V100 PCIe has top performance and easier to build efficient nodes/clusters around.
    Just to reiterate these are single precision solvent models.
     
    #1007 CSI PC, Jan 11, 2018
    Last edited: Jan 11, 2018
    nnunn, xpea and pharma like this.
  8. Ryan Smith

    Regular

    Joined:
    Mar 26, 2010
    Messages:
    629
    Likes Received:
    1,131
    Location:
    PCIe x16_1
    RecessionCone is correct. This was all a pretty thin wrapper calling up the appropriate CUBLAS functions.
     
    BRiT likes this.
  9. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    The intrinsic is fine. The missing performance is because the CUDA compiler can’t optimally schedule and register allocate the code that uses the intrinsic. Hopefully that will improve with time. Getting 100% utilization of the tensor cores requires the whole chip to work at full tilt, doing anything slightly suboptimally reduces performance measurably.
     
    pharma likes this.
  10. Shaklee3

    Newcomer

    Joined:
    Apr 9, 2016
    Messages:
    18
    Likes Received:
    10
    Thanks @RecessionCone and @Ryan Smith. I was about to get 92TFLOPS using your matrix size, which is close enough for me.
     
    nnunn, pharma, CSI PC and 1 other person like this.
  11. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,887
    Likes Received:
    4,534
    Benchmarking Tensorflow Performance on Next Generation GPUs
    Jan 22, 2018
    https://medium.com/initialized-capi...formance-on-next-generation-gpus-e68c8dd3d0d4
     
    xpea and Geeforcer like this.
  12. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,887
    Likes Received:
    4,534
    In this video from SC17 in Denver, Eric Nielsen from NASA presents:Unstructured-Grid CFD Algorithms on the NVIDIA Pascal and Volta Architectures.



    “In the field of computational fluid dynamics, the Navier-Stokes equations are often solved using an unstructured-grid approach to accommodate geometric complexity. Furthermore, turbulent flows encountered in aerospace applications generally require highly anisotropic meshes, driving the need for implicit solution methodologies to efficiently solve the discrete equations. To prepare NASA Langley Research Center’s FUN3D CFD solver for the future HPC landscape, we port two representative kernels to NVIDIA Pascal and Volta GPUs and present performance comparisons with a common multi-core CPU benchmark.”

    https://insidehpc.com/2018/01/unstructured-grid-cfd-algorithms-nasa-volta-gpus/


    Volta architecture
     
    #1012 pharma, Feb 2, 2018
    Last edited: Feb 2, 2018
    CSI PC, fellix, nnunn and 4 others like this.
  13. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    Extract from Video:
    Unstructured-Grid CFD Algorithms on Pascal and Volta.jpg
     
    nnunn and pharma like this.
  14. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    His talk on the cache also is probably an indicator on why Amber is so much more efficient/faster on Volta than Pascal, much more than the spec sheet when considering cores/clocks.
    In my earlier post on the Amber scientific solvent/model-simulation results show a single Titan V or single GV100 are faster than 2xTitan Xp or 2x1080ti setup in all their solvent tests.
    Results are FP32.

    Some applications will see quite a dramatic increase.
    Edit:
    I did not bother mentioning the P100 as it has lower FP32 than either the Xp or 1080ti even as the DGX-1 (which they tested as well for P100 as a single card).
     
    #1014 CSI PC, Feb 5, 2018
    Last edited: Feb 5, 2018
  15. manux

    Veteran

    Joined:
    Sep 7, 2002
    Messages:
    3,034
    Likes Received:
    2,276
    Location:
    Self Imposed Exhile
    Wrong thread, but I'm too lazy to start a separate appropriately named thread for google news. TPU is now "out in the wild" or more concretely one can now pay to compute on google cloud using tpu.

    https://www.forbes.com/sites/moorin...e-announces-expensive-cloud-tpu-availability/

    If true hbm2 seems to be fairly expensive. I wonder if that price is accurate or not.
     
    xpea and pharma like this.
  16. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,887
    Likes Received:
    4,534
    Well, chip for chip TPU is 2 times slower so wonder whether efficiency is playing a part in the increased cost. From the Forbes article:
     
    DavidGraham and xpea like this.
  17. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    It is rather ironic that the TPU2 chip is practically 2 independent processors glued together on the same chip, each with its own independent HBM controller... they could probably get away with less memory (footprint - not bandwidth - so fewer HBM2 dies per stack but the same number of stacks) if they let both controllers work for both parts of the chip with a proper bus infrastructure.

    I wonder if Google has any kind of TPU roadmap now that the lead HW engineers moved to Groq (groq.com), and if not, whether they even care about making it competitive outside of internal Google projects?
     
  18. entity279

    Veteran Subscriber

    Joined:
    May 12, 2008
    Messages:
    1,332
    Likes Received:
    500
    Location:
    Romania
    My hunch is that the TPU is a doomed project.

    It had its window when they were the first with mixed-precision tensor processors. Now that nV & Co are in, how could Google keep up?
     
  19. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    Why couldn't they? NVIDIA (or any other company) isn't some almighty deity that makes everything better than the rest
     
  20. entity279

    Veteran Subscriber

    Joined:
    May 12, 2008
    Messages:
    1,332
    Likes Received:
    500
    Location:
    Romania
    No, it doesn't take a semiconductors diety to "stop" someone who's new in the market in a resource and project management battle.

    Not excluding the fact that Google (or anyone) can make a breakthrough. But as I've said, I fear their window of opportunity is running out.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...