Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

Tags:
  1. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,117
    Likes Received:
    2,587
    Location:
    Germany
    AI training - probably yes, since that is what it was designed to do well. But this line of discussion started around the usefulness for scientifc HPC applications.

    Apart from that: Tensor cores are massive MMA arrays, they cannot do anything else, AFAIK, for example they do not have the SFUs in the traditional cores.
     
  2. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,360
    Likes Received:
    3,734
    Once more, FP64 Tensor format is designed for HPC simulation workloads as well as AI training. NVIDIA is encouraging developers to migrate their HPC code to the new format.

    "With FP64 and other new features, the A100 GPUs based on the NVIDIA Ampere architecture become a flexible platform for simulations, as well as AI inference and training — the entire workflow for modern HPC. That capability will drive developers to migrate simulation codes to the A100.

    Users can call new CUDA-X libraries to access FP64 acceleration in the A100. Under the hood, these GPUs are packed with third-generation Tensor Cores that support DMMA, a new mode that accelerates double-precision matrix multiply-accumulate operations.

    A single DMMA job uses one computer instruction to replace eight traditional FP64 instructions. As a result, the A100 crunches FP64 math faster than other chips with less work, saving not only time and power but precious memory and I/O bandwidth as well.

    We refer to this new capability as Double-Precision Tensor Cores."
    https://agenparl.eu/double-precision-tensor-cores-speed-high-performance-computing/
     
    Lightman, nnunn, Konan65 and 3 others like this.
  3. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,117
    Likes Received:
    2,587
    Location:
    Germany
    What is FP64 Tensor Format? For scientific, it has to be IEEE754 FP64.
     
  4. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,360
    Likes Received:
    3,734
    It is.

    https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
     
    nnunn, xpea, Konan65 and 3 others like this.
  5. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    344
    Likes Received:
    316
    They are not for DL only. They are just matrix on matrix multiplication instead of "scalar", so their use may (or may not) be more limited, depending on the requirements of the aplication. My remark comes after a stream of posts talking about how matrix on matrix multiplication can actually be useful in HPC in some/most cases. Which is something that I wouldn't know first hand, but if it is, then my remark follows: A100 offers 19.5 Tflops of FP64.

    Because matrix multiplication may not be useful in every situation. And FWIW, A100 may not even have FP64 ALUs at all outside the tensor cores, just like Volta/Turing didn't have FP16 ALUs, yet offered 2x FP16 rates through the TCs.
     
    xpea likes this.
  6. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,284
    Likes Received:
    5,905
    What are all these FP64 matrix-on-matrix non-ML loads that the posts mention?
    And if they exist in any relevant proportion, why is nvidia's own GA100 page only linking the FP64 Tensor performance to ML training?


    Except that is in direct contradiction to the public specifications.
     
  7. Malo

    Malo Yak Mechanicum
    Legend Veteran Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    7,852
    Likes Received:
    4,034
    Location:
    Pennsylvania
    Maybe due to the reduced precision so it likely depends on what the mantissa/exponent requirements are of the application itself? The TF32 definitely has less precision than FP32 but for ML purposes it's suitable? The Ampere documents don't go into what IEEE the FP64 is compliant with nor what levels of precision it has, like it does with TF32. At least this is all based on my limited knowledge.
     
  8. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,284
    Likes Received:
    5,905
    I brought this here.

    That makes sense. It's a larger chip with more execution units and the TDP is similar, so they needed to lower the clocks to keep power consumption and heat dissipation in check.
    Also, IIRC the P100 was nvidia's very first 16FF chip so there could have been some process optimization between GP100 and GP102.


    Only if you believe TDP doesn't depend on clock rates in a given chip.
     
  9. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,360
    Likes Received:
    3,734
    FP64 Tensor runs the full precision of FP64. The only mode with reduced precision is TF32.

    Yes, and it doesn't require a code change.

    IEEE FP64 compliant means the standard FP64.
     
    nnunn likes this.
  10. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    344
    Likes Received:
    316
    Why? I don't care? As shown by DavidGraham above, twice, Nvidia is actually advising to switch?

    No.
     
  11. Bondrewd

    Veteran Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    1,043
    Likes Received:
    441
    idk ask the poor cuDNN devs lmao
     
  12. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    344
    Likes Received:
    316
     
  13. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,117
    Likes Received:
    2,587
    Location:
    Germany
  14. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    344
    Likes Received:
    316
    For the same reason Turing had 2xFP16 rate alongside the much higher Tensor FP16. One is matrix multiplication the other is not. Both modes still executed on the TCs.
     
  15. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,117
    Likes Received:
    2,587
    Location:
    Germany
    That's what I said, ain't it? ;)
     
  16. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    410
    Likes Received:
    406
    Yet, it's used enough to serve as a performance measure for TOP500 supercomputers ranking.
    It's used in many practical HPC tasks here and there, otherwise it would have been pointless to test supercomputers on DGEMM.
     
    Benetanegia, Konan65 and pharma like this.
  17. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    344
    Likes Received:
    316
    Not really sure what you're trying to say tbh. As OlegSH just pointed out right above, DGEMM's importance in HPC is beyond any doubt. I don't see what does not compute in Nvidia offering different intruction types for different tasks.
    Also afaik in Turing at least, it was posible to dual issue a TC and either an FP32 or INT32, too, so maybe that extends to FP64 ALUs if there are any at all.
     
  18. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    344
    Likes Received:
    316
    If the apparent inclusion of FP64 ALUs is what bugs you, my theory for that is the following:

    Turing didn't have FP16 yet it did offer "scalar" FP16 instructions at 2x rate running on the tensor cores. If my math is correct that translates to "scalar" instructions running in the TCs, providing 1/4 the rate of the tensor/matrix-on-matrix rates (i.e 32 TFLOPS vs 120). If running "scalar" FP64 instructions on Ampere TCs has the same 1/4 impact (probable), that would result in 1:4 FP64:FP32 rate compared to the desired 1:2 rate. Hence the inclusion of actual FP64 ALUs.
     
  19. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    421
    Likes Received:
    461
    But AI ML is where the money is. From hyperion research:
    Screenshot_20200527-071015__01.jpg
    HPC is shrinking and Nvidia made the right decision to go all in AI ML DL with Ampere. For the record, by the end of 2019, Nvidia had more than 1500 direct customers for DGX systems. From nothing 2 years ago, and at +100k per system, that's where the money is. Forget HPC, already for this generation it's not the main focus anymore...
     
    #159 xpea, May 27, 2020
    Last edited: May 27, 2020
  20. nnunn

    Newcomer

    Joined:
    Nov 27, 2014
    Messages:
    38
    Likes Received:
    30
    Just to clarify, double precision (FP64) matrix multiplication is just one of many things that HPC codes do. All our other (double precision) kernels (currently) expect traditional double precision ALUs.
     
    BRiT, pharma and CarstenS like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...