Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

Tags:
  1. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    Looks like you are not happy about the new sparsity tech. And BTW it's not only for int, it also works on TF16, BF16 and TF32
    it's all here:
    https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/
     
    pharma and A1xLLcqAgt0qc2RyMz0y like this.
  2. neckthrough

    Newcomer

    Joined:
    Mar 28, 2019
    Messages:
    138
    Likes Received:
    388
    For an equal number of non-zero parameters a larger sparse (pruned) network is almost always more accurate than a smaller dense (unpruned) network. You need hardware support to run the sparse network (that supports the specific pruning patterns used), but if you do you get close to the accuracy of a larger network at the execution cost of the smaller network.
     
    pharma, nnunn and xpea like this.
  3. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    961
    Likes Received:
    855
    Doesn't look like A100 in the picture. Or very cut down version with only 4 HBM.
    https://blogs.nvidia.com/blog/2020/05/14/drive-platform-nvidia-ampere-architecture/
     
  4. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    4,027
    Likes Received:
    90
    It may not be a perfect scale representation, but surely there must be some effort made to indicate the area occupied by each SM component in the block diagram.
     
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    Nvidia has been using the same block diagrams since Kepler. They’ve never been to scale. E.g. the schedulers are likely larger than depicted and the fixed function geometry hardware isn’t represented at all.
     
  6. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,242
    Likes Received:
    3,405
    Nah, it's just a scheme. The sizes are up to the artist to decide how it will look better.
     
  7. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,891
    Likes Received:
    4,539
    NVIDIA A100 Ampere Resets the Entire AI Industry
    May 14, 2020

    Every AI company measures its performance to the Tesla V100. Today, that measuring stick changes, and takes a big leap with the NVIDIA A100. Note the cards are actually labeled as GA100 but we are using A100 to align with NVIDIA’s marketing materials.

    [​IMG]
    You may ask yourself, what are the asterisks next to many of those numbers, NVIDIA says those are the numbers with structural sparsity enabled. We are going to discuss more of that in a bit.
    ...
    NVIDIA is inventing new math formats and adding Tensor Core acceleration to many of these. Part of the story of the NVIDIA A100’s evolution from the Tesla P100 and Tesla V100 is that it is designed to handle BFLOAT16, TF32, and other new computation formats. This is exceedingly important because it is how NVIDIA is getting claims of 10-20x the performance of previous generations. At the same time, raw FP64 (non-Tensor Core) performance, for example, has gone from 5.3 TFLOPS with the Tesla P100, 7.5 TFLOPS for the SXM2 Tesla V100 (a bit more in the SXM3 versions), and up to 9.7 TFLOPS in the A100. While traditional FP64 double precision is increasing, the accelerators and new formats are on a different curve.
    ...
    NVLink speeds have doubled to 600GB/s from 300GB/s. We figured this was the case recently in NVIDIA A100 HGX-2 Edition Shows Updated Specs. That observation seems to be confirmed along with the PCIe Gen4 observation.

    The A100 now utilizes PCIe Gen4. That is actually a big deal. With Intel’s major delays of the Ice Lake Xeon platform that will include PCIe Gen4, NVIDIA was forced to move to the AMD EPYC 64-core PCIe Gen4 capable chips for its flagship DGX A100 solution. While Intel is decisively going after NVIDIA with its Xe HPC GPU and Habana Labs acquisition, AMD is a GPU competitor today. Still, NVIDIA had to move to the AMD solution to get PCIe Gen4. NVIDIA’s partners will also likely look to the AMD EPYC 7002 Series to get PCIe Gen4 capable CPUs paired to the latest NVIDIA GPUs.

    NVIDIA wanted to stay x86 rather than go to POWER for Gen4 support. The other option would have been to utilize an Ampere Altra Q80-30 or similar as part of NVIDIA CUDA on Arm. It seems like NVIDIA does not have enough faith in Arm server CPUs to move its flagship DGX platform to Arm today. This may well happen in future generations so it does not need to design-in a competitor’s solution.

    I was able to ask Jensen a question directly on the obvious question: supply. Starting today onward, a Tesla V100 for anything that can be accelerated with Tensor Cores on the A100 is a tough sell. As a result, the industry will want the A100. I asked how will NVIDIA prioritize which customers get the supply of new GPUs. Jensen said that the A100 is already in mass production. Cloud customers already have the A100 and that it will be in every major cloud. There are already customers who have the A100. Systems vendors can take the HGX A100 platform and deliver solutions around it. The DGX A100 is available for order today. That is a fairly typical data center launch today where some customers already are deploying before the launch. Still, our sense is that there will be lead times as organizations rush to get the new GPU for hungry AI workloads.

    With the first round of GPUs, we are hearing that NVIDIA is focused on the 8x GPU HGX and 4x GPU boards to sell in its own and partner systems. NVIDIA is not just selling these initial A100’s as single PCIe GPUs. Instead, NVIDIA is selling them as pre-assembled GPU and PCB assemblies.

    https://www.servethehome.com/nvidia-tesla-a100-ampere-resets-the-entire-ai-industry/


     
    Lightman and PSman1700 like this.
  8. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    This is one impressive GPU.
    Though clearly the design target was not reached as 1/8 of SMs (+4) and 1/6 of HBM2 (8GB) + L2 (8MB) was disabled due to low yield.
    Later versions will likely have ~8K SMs and 48 GB HBM2, though that either will increase the 400 Watt or lower the clock further.

    One 'deception' I noticed about the claim of 156 TF for FP32.
    This was presented in the videos as equivalent computation to FP32 when not making use of tensor cores.
    But it is not. The tensor cores work with the new floating point format 'deceptively' called TF32.
    In fact this is a 19 bit floating point format and would have been better called FP19.
    The TF32 has 8 bit exponent and 10 bit mantissa as shown below.
    GTC_PPB_08.jpg
     
  9. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    I guess they will up the power consumption, as V100's power was also increased twice, first to 350w and a second time to 450w.

    Or they may go the V100S route where they increased clocks and bandwdith while simultaneously slashing power from 300w to 250w.

    The new format works without changing code, so I guess there is "some" merits to this comparison?
     
    pharma and A1xLLcqAgt0qc2RyMz0y like this.
  10. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,891
    Likes Received:
    4,539
    Also any AI acceleration from the new format will only be available on Ampere. Matching Volta fails to be a competitive option anymore.
     
  11. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Provided your use case is fine with 16 Bit-Precision Mantissa.
     
  12. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,891
    Likes Received:
    4,539
    https://www.nextplatform.com/2020/05/14/nvidia-unifies-ai-compute-with-ampere-gpu/
     
  13. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    There’s no claim of 156 TF/s for FP32. That claim is for TF32, which is a mixed precision format for matrix multiplication and addition. The input and output operands are FP32, but the multipliers inputs are FP19, with their output accumulated at FP32. The app doesn’t need to change its code and you get 8x (or 16x with sparsity) higher peak flop/s than full FP32 math.
     
  14. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    If the app doesn't need to change how does the app then makes the difference between training with FP32 and FP19 (aka TF32)?
    It begs also the question if there is much value to FP19 as dropping 3 more bit from the mantisse and you got BFloat16, which makes training 2x faster.
    BTW Remark you can not use the sparisity feature for training this is for inferencing,
     
  15. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    Regarding the power efficiency of Ampere, A100 has 54bil transistors, Titan RTX has 18bil, V100 has 21bil, both came at a TDP of 280~300w, so roughly speaking A100 has introduced 2.5X to 3X the transistor count, while simultaneously increasing power to 400w (a 40% increase).

    I know this math is extremely rough around the edges, but it can give us some sort of an indication of how much progress NVIDIA has achieved on 7nm, the claim that Ampere is 50% faster than Turing at half the power is not that far fetched at least?
     
    nnunn likes this.
  16. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    605
    Likes Received:
    1,126
    The app doesnt. Standard mode will be TF32 when developers use certain libaries from nVidia and in the future from other companies.
     
  17. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    961
    Likes Received:
    855
    With TF32, is there any benefit using it for denoising on ray tracing games? Currently no games uses tensor cores for denoising.
     
  18. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,891
    Likes Received:
    4,539
    Sparsity can be used for both training and inferencing, though currently has more benefit when used for inferencing.
    https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/

    https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
    https://blogs.nvidia.com/blog/2020/05/14/sparsity-ai-inference/
     
    #78 pharma, May 17, 2020
    Last edited: May 17, 2020
    PSman1700, Lightman and xpea like this.
  19. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    Any comment on this guys?
     
  20. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,891
    Likes Received:
    4,539
    Likely will know more once independent testing/reviews are done. Right now we only have Nvidia's numbers.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...