Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

Tags:
  1. manux

    manux Veteran

    Would this mean ability to program GPU directly with c++ instead of using cuda?





    edit. Digging a bit more
     
  2. So from the specs it looks like A100 is running at 1425MHz.
    That's lower than any V100 16FFN implementation I know of (1440-1455MHz), even though A100 is on 7nm and can reach 100W more than the highest consuming V100 (mezzanine 300W).

    So with 400W maximum power and 19.5 FP32 TFLOPs, the A100 goes up to 20.5 W/TFLOPs. The 3 year-old V100 with 300W max power and 14.9 FP32 TFLOPs goes up to 20.1 W/TFLOPs.

    The power efficiency on FP32 actually went down, when transitioning from 16FFN Volta to 7nm Ampere?
    Or are these 400W only applicable when the GPU is running its tensor units in full parallel with the FP32 ALUs (and is that possible without hitting L2 cache / VRAM bottlenecks)?

    This is just the first data point of many, but it could be that nvidia's delay in adopting 7nm is due to the fact that they can't boost higher clocks from 7nm than they could from 16/12FFN.


    @Nebuchadnezzar wrote that the A100 can go up to 400W, though he also assumed the GPUs in Robotaxi are the A100, which they don't seem to be if you compare their pictures.
    The dGPUs in Robotaxi seem to be substantially smaller, maybe around 600mm^2.
     
    BRiT likes this.
  3. troyan

    troyan Regular

    FP32 and DP compute performance is not relevant for A100. With TensorCores (FP16) or FP16 alone the compute performance is 2,5x higher. Results in 87,5% higher efficiency with 1,6TB bandwidth...
     
    DavidGraham and PSman1700 like this.
  4. troyan

    troyan Regular

    TensorCore throughput matters and not DP over Cuda cores.
     
    DavidGraham likes this.
  5. DegustatoR

    DegustatoR Veteran

    It's impossible to say what matters and what not without knowing how many transistors are assigned to which h/w unit and how they contribute to the rated 400W power draw figure.

    If we compare transistor numbers directly then there's an obvious transistors/watt gain with Ampere - but again it's hard to say how uniform that gain is or how it will translate into graphics oriented GPUs.
     
  6. DavidGraham

    DavidGraham Veteran

    The 1st iteration of NVLink alone amounts to about 50w of power, which is the difference between V100 NVLink and V100S PCI-E power consumption figures. The 2nd iteration will probably need more power than that.

    An obvious, and a huge one at that as well.
    Tensor core count has been reduced in the A100, it now stands at 432 vs ~640 in V100, and they even run at lower clocks, we can infer from that that a significant amount of transistor budget has gone to the new tensor units in A100 to improve their IPC, also they now support significantly more formats, which also require significant transistor budget.
     
    Pete and A1xLLcqAgt0qc2RyMz0y like this.
  7. Malo

    Malo Yak Mechanicum Legend Subscriber

    Doesn't the required precision for the task at hand determine how fast and what you can use in A100? We're talking scientific HPC here, not related to gaming at all. If you need DP, how do the tensor cores help?
     
  8. DegustatoR

    DegustatoR Veteran

    Ampere GA100 TCs support FP64 now.
     
  9. manux

    manux Veteran

    Often but not always that math is about matrix multiplies and that's the thing tensor cores accelerate.
     
  10. Malo

    Malo Yak Mechanicum Legend Subscriber

    ok, I thought I did see that in the spec sheet.
     
  11. DavidGraham

    DavidGraham Veteran

    A100 supports two new significant data formats: Tensor TF32 and FP64. TF32 doesn't require code change, but FP64 does.

    NVIDIA is instructing developers to migrate their code to the FP64 Tensor format to achieve 2.5X increase in throughput.

    https://agenparl.eu/double-precision-tensor-cores-speed-high-performance-computing/
     
    Pete, nnunn, BRiT and 2 others like this.
  12. troyan

    troyan Regular

  13. BRiT

    BRiT (>• •)>⌐■-■ (⌐■-■) Moderator Legend Alpha

  14. pharma

    pharma Veteran

    manux and BRiT like this.
  15. troyan

    troyan Regular

  16. manux

    manux Veteran

    This snippet from whitepaper is neat

     
    JoeJ, BRiT and pharma like this.
  17. trinibwoy

    trinibwoy Meh Legend

    Exactly. Unless we know what workloads result in 400w power consumption the numbers don’t tell us anything. If I had to guess, the peak power numbers probably correspond to max tensor throughout given the massive amount of data movement required.
     
  18. I still think it's odd that this new 7nm GPU is clocking lower than nvidia's very first implementarion of 16FF, but sure it could be due to it being so massively wide.
    We'll wait and see how the consumer GPUs come out.
     
  19. xpea

    xpea Regular

    We don't know for gaming yet, but am I the only one to see MA-SSI-VE architecture changes that provide HU-GE performance jump, thus efficiency jump, in the intended workloads?

    Screenshot_20200520-023642.jpg Screenshot_20200520-023146.jpg Screenshot_20200520-023312.jpg Screenshot_20200520-023108.jpg
    3 to 7 times real world performance gain on BERT training/inferencing is above expectations
     
    Konan65 likes this.
Loading...

Share This Page

Loading...