Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

  1. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,862
    Likes Received:
    2,792
    Location:
    Finland
    Isn't HPC intended workload too? I mean, of course they focused more on the AI stuff this time around, but still that's their big HPC chip for everything else too.
    There the improvements aren't that impressive, considering the 2,5x transistor budget and higher consumption.
    upload_2020-5-19_22-13-56.png

    Also it seems the AI comparisons aren't really apples to apples, they're using lesser TF32 precision with A100 on BERT-Large FP32 Training
     
  2. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    631
    Likes Received:
    297
    Yeah A100 is a funky GEMM machine, but nothing snazzy besides that.
    Which works for intended market, but may or may not piss the wide and varied HPC crowds off.
     
  3. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,254
    Likes Received:
    3,463
    That's 70% avg speed up in HPC workloads, which is pretty good if contrasted against the only 30% increase in pure FP32, along side the more considerable speedups in AI, I would say that's quite impressive actually.
     
    A1xLLcqAgt0qc2RyMz0y likes this.
  4. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,840
    Likes Received:
    5,407
    Is it? What I see is them using the extra transistor budget on putting an absurd amount of tensor cores that matter little to HPC, while at the same time not being able to increase the clocks even when said tensor cores are not being used, thus reducing power efficiency for HPC tasks compared to their own predecessor or even Vega 20 (let alone Arcturus that should appear this year).

    Of course, nvidia will be trying to sell the idea that every scientific calculation out there is being replaced with machine learning, though I don't know if that's true.
     
  5. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    631
    Likes Received:
    297
    Eh GEMM cores now go brrrrrrr even in FP64 proper + it is a solid bandwidth uptick.

    Arcturus is yeah, a far meaner GPGPU actual part, but it has no software or documentation to speak of so it is basically a Frontier devboard.
     
    nnunn likes this.
  6. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,331
    Likes Received:
    118
    Location:
    San Francisco
    Is it a ‘meaner’ part? I haven’t heard anything about it.
     
  7. Bondrewd

    Regular Newcomer

    Joined:
    Sep 16, 2017
    Messages:
    631
    Likes Received:
    297
    Depends on what you want from it.
    Current AMD is a very tightly ran ship yeah.
    But it's somewhere Q3 launch so soon(TM).
     
  8. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,913
    Likes Received:
    2,232
    Location:
    Germany
    I'm looking forward to see the market impact A100 and Arcturus will have. Especially in which markets with their respective TAMs. Right now, all the rage (and thus all the money) seems to be in Machine Learning. Also wondering how much of an HPC-accelerator the FP64-FMACs from the Tensor Core will be, given they're (probably) not fully fledged FP64 units. I'm pretty sure they could be used to good effect in Linpack (yes, I know, linpack linshmack)
     
  9. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    734
    Likes Received:
    309
    I wonder if Nvidia will create a smaller/affordable PCIe based AI/HPC card.
    As a thought experiment, chopping the A100 in half, and keeping 3 HBM2 stacks.
    This would result in a 400mm2 GPU immensely improving yield.
    With 64 SMs, and increasing the clock by 10%, this would result in 65% performance of the A100.
    Spec would be:
    • Power <250 Watt
    • 24 GB at 960 GB/s
    • 200 TFLOPs FP16 / BF16
    • 100 TFLOPs TF32
    • 12.5 TFLOPs FP32,
    • 6.25 TFLOPs FP64
    Price 2K
     
  10. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,862
    Likes Received:
    2,792
    Location:
    Finland
    Nah, but we've already seen the one chip with 4 stacks
     
  11. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    452
    Likes Received:
    171
    It's definitely not, sometimes you know you need accurate simulation. Like up to and including 64bit precision sim, not "an AI says this maybe is the answer based on guessing a trillion times".

    But I think they see the most money in the AI market, and designing a 7nm chip is expensive and takes damned long, so they just went for what they saw as the highest profit margins they could get first and foremost.
     
    ToTTenTranz, ethernity and nnunn like this.
  12. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    75
    Likes Received:
    113
    My understanding of scientific HPC algorithms is that at their hearts they very often rely on huge amounts of matrix multiplication. This is not something restricted to neural nets by any means. Since the Tensor Cores are in fact specialized matrix multiplication hardware, and since they added 32 and 64-bit float support to them for Ampere, they are in fact extremely well suited for HPC.
     
    xpea, nnunn, PSman1700 and 2 others like this.
  13. troyan

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    149
    Likes Received:
    237
    And nVidia has published numbers from diffierent applications or use cases. Performance improvements range between 1,5x and 2,1x for HPC. GA100 has 2,5x more transistors than GV100 so the scaling isnt bad. Considereing that GA100 has more use cases than GV100 performance improvement is as good as it is possible. I dont think just increasing the numbers of SM would archive the same.
     
  14. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,913
    Likes Received:
    2,232
    Location:
    Germany
    Some applications seem to be ok with iterative solvers in order to achieve a desired precision. For those, if your specialized cores' throughput at a given precision is greater than the added duration until the iterative solver reaches it's saturation, you're in a net win as well.
     
    Voxilla likes this.
  15. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    734
    Likes Received:
    309
    From the white paper this one needs more explanation:
    "A100 adds Compute Data Compression. Compression saves up to 4x DRAM read/write bandwidth, up to 4x L2 read bandwidth, and up to 2x L2 capacity."
    In lack of detail how this is implemented It looks this is some kind of SM software based compression/decompression.
     
  16. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    734
    Likes Received:
    309
    Until you realize that the new tensor cores can not replace general FP32 matrix matrix multiplication
    Hint: TF32 != FP32
     
    entity279 and ToTTenTranz like this.
  17. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    7,719
    Likes Received:
    929
    Location:
    Guess...
    Interesting. Sounds suspiciously like what MooresLawIsDead claimed about tensor core based VRAM compression.
     
    Konan65 likes this.
  18. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    285
    Likes Received:
    184
    FP64 = FP64 tho and there A100 tensor cores deliver 19.5 TFlops, which I doubt would be attainable otherwise, since 40 TFlops FP32 and 1:2 FP64 ratio seems more unrealistic and 1:1 seems even more unrealistic and probably a total waste of die and power tbh.

    At the end of the day, I really think that it is more realistic to assume that Nvidia knew (by feedback) which formats and where would benefit their prospect customers the most and delivered accordingly, which is reflected on the fact that they did offer FP64 support on the TCs at 2x the speed as normal, while they didn't even bother supporting FP32. Or I guess we can go around assuming it's a massive oversight and Nvidia is clueless.
     
  19. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    404
    Likes Received:
    430
    According to Nvidia, they provide the same accuracy in training. From here:
    https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/
    In fact, I have the results of Nvidia comparison between FP32 and TF32. Not sure I can share since I don't see it anywhere online, but I can say that the networks trained using TF32 have the same accuracy than FP32. For AI, TF32 is really a safe replacement for FP32 with a huge speedup in performance
     
    pharma, DavidGraham and DegustatoR like this.
  20. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,840
    Likes Received:
    5,407
    Peak FP64 is 9.7 FLOPs, not 19.5.

    19.5TF FP64 is for deep learning training.
    If their FP64 tensor cores were capable of non-ML tasks, why would they even need any FP64 ALUs at all?



    Or GA100 might just be a chip that tries really hard to compete with the likes of Google TPU on the ML market, while going against competitor CPUs, GPUs and dedicated accelerators on the HPC market, while at the same time stubbornly keeping its GPU core functionality.
    This sounds great on paper, but if they find decent opposition on more than one side then it might put the product in a difficult position, because it spread its transistors too thin on fighting different fronts and power efficiency ended up hurting in some of them.

    It's a risk, like e.g. AMD took with Vega 10 and didn't pay out as they expected.


    So nvidia is immune to mistakes and they're only capable of delivering perfect products?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...