Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

Tags:
  1. manux

    Veteran

    Joined:
    Sep 7, 2002
    Messages:
    3,034
    Likes Received:
    2,276
    Location:
    Self Imposed Exhile
    Would this mean ability to program GPU directly with c++ instead of using cuda?





    edit. Digging a bit more
     
  2. So from the specs it looks like A100 is running at 1425MHz.
    That's lower than any V100 16FFN implementation I know of (1440-1455MHz), even though A100 is on 7nm and can reach 100W more than the highest consuming V100 (mezzanine 300W).

    So with 400W maximum power and 19.5 FP32 TFLOPs, the A100 goes up to 20.5 W/TFLOPs. The 3 year-old V100 with 300W max power and 14.9 FP32 TFLOPs goes up to 20.1 W/TFLOPs.

    The power efficiency on FP32 actually went down, when transitioning from 16FFN Volta to 7nm Ampere?
    Or are these 400W only applicable when the GPU is running its tensor units in full parallel with the FP32 ALUs (and is that possible without hitting L2 cache / VRAM bottlenecks)?

    This is just the first data point of many, but it could be that nvidia's delay in adopting 7nm is due to the fact that they can't boost higher clocks from 7nm than they could from 16/12FFN.


    @Nebuchadnezzar wrote that the A100 can go up to 400W, though he also assumed the GPUs in Robotaxi are the A100, which they don't seem to be if you compare their pictures.
    The dGPUs in Robotaxi seem to be substantially smaller, maybe around 600mm^2.
     
    BRiT likes this.
  3. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    605
    Likes Received:
    1,126
    FP32 and DP compute performance is not relevant for A100. With TensorCores (FP16) or FP16 alone the compute performance is 2,5x higher. Results in 87,5% higher efficiency with 1,6TB bandwidth...
     
    DavidGraham and PSman1700 like this.
  4. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    605
    Likes Received:
    1,126
    TensorCore throughput matters and not DP over Cuda cores.
     
    DavidGraham likes this.
  5. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,242
    Likes Received:
    3,405
    It's impossible to say what matters and what not without knowing how many transistors are assigned to which h/w unit and how they contribute to the rated 400W power draw figure.

    If we compare transistor numbers directly then there's an obvious transistors/watt gain with Ampere - but again it's hard to say how uniform that gain is or how it will translate into graphics oriented GPUs.
     
  6. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    The 1st iteration of NVLink alone amounts to about 50w of power, which is the difference between V100 NVLink and V100S PCI-E power consumption figures. The 2nd iteration will probably need more power than that.

    An obvious, and a huge one at that as well.
    Tensor core count has been reduced in the A100, it now stands at 432 vs ~640 in V100, and they even run at lower clocks, we can infer from that that a significant amount of transistor budget has gone to the new tensor units in A100 to improve their IPC, also they now support significantly more formats, which also require significant transistor budget.
     
    Pete and A1xLLcqAgt0qc2RyMz0y like this.
  7. Malo

    Malo Yak Mechanicum
    Legend Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    8,931
    Likes Received:
    5,530
    Location:
    Pennsylvania
    Doesn't the required precision for the task at hand determine how fast and what you can use in A100? We're talking scientific HPC here, not related to gaming at all. If you need DP, how do the tensor cores help?
     
  8. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,242
    Likes Received:
    3,405
    Ampere GA100 TCs support FP64 now.
     
  9. manux

    Veteran

    Joined:
    Sep 7, 2002
    Messages:
    3,034
    Likes Received:
    2,276
    Location:
    Self Imposed Exhile
    Often but not always that math is about matrix multiplies and that's the thing tensor cores accelerate.
     
  10. Malo

    Malo Yak Mechanicum
    Legend Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    8,931
    Likes Received:
    5,530
    Location:
    Pennsylvania
    ok, I thought I did see that in the spec sheet.
     
  11. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    A100 supports two new significant data formats: Tensor TF32 and FP64. TF32 doesn't require code change, but FP64 does.

    NVIDIA is instructing developers to migrate their code to the FP64 Tensor format to achieve 2.5X increase in throughput.

    https://agenparl.eu/double-precision-tensor-cores-speed-high-performance-computing/
     
    Pete, nnunn, BRiT and 2 others like this.
  12. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    605
    Likes Received:
    1,126
  13. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■)
    Moderator Legend Alpha

    Joined:
    Feb 7, 2002
    Messages:
    20,511
    Likes Received:
    24,411
  14. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,891
    Likes Received:
    4,539
    manux and BRiT like this.
  15. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    605
    Likes Received:
    1,126
  16. manux

    Veteran

    Joined:
    Sep 7, 2002
    Messages:
    3,034
    Likes Received:
    2,276
    Location:
    Self Imposed Exhile
    This snippet from whitepaper is neat

     
    JoeJ, BRiT and pharma like this.
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    Exactly. Unless we know what workloads result in 400w power consumption the numbers don’t tell us anything. If I had to guess, the peak power numbers probably correspond to max tensor throughout given the massive amount of data movement required.
     
  18. I still think it's odd that this new 7nm GPU is clocking lower than nvidia's very first implementarion of 16FF, but sure it could be due to it being so massively wide.
    We'll wait and see how the consumer GPUs come out.
     
  19. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    We don't know for gaming yet, but am I the only one to see MA-SSI-VE architecture changes that provide HU-GE performance jump, thus efficiency jump, in the intended workloads?

    Screenshot_20200520-023642.jpg Screenshot_20200520-023146.jpg Screenshot_20200520-023312.jpg Screenshot_20200520-023108.jpg
    3 to 7 times real world performance gain on BERT training/inferencing is above expectations
     
    Konan65 likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...