Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

  1. techuse

    Regular Newcomer

    Joined:
    Feb 19, 2013
    Messages:
    280
    Likes Received:
    162
    We dont have any real performance comparisons so its unanswerable.
     
  2. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,137
    Likes Received:
    3,036
    Location:
    Finland
    While traditional FP32 (or FP64 for that matter) certainly wasn't the main target for A100, that 2.5x transistor budget turned into 24 % higher theoretical performance at 33% higher consumption (partly disabled chip, but enabling the rest doesn't come for free either)

    So from gaming perspective it looks more far fetched than ever based on A100, but A100 doesn't necessarily reflect gaming models at all and this is just theoretical numbers we're looking at.
     
    Lightman and BRiT like this.
  3. Samwell

    Newcomer

    Joined:
    Dec 23, 2011
    Messages:
    127
    Likes Received:
    154
    It's hard to really guess much out of this data. I would look at it as an 33% increase in Power from V100 Nvlink (300W) to A100 Nvlink (400W). But Nvlink marks already a big problem. Power Draw of 600Gb/s Nvlink should be massive. PciE A100 would make it easier to compare.
    Ignoring this and calculating tdp/transistor we could imagine, that 7nm DUV with ampere can get ~1,9x the transistors at the same TDP with slightly lower clocks (A100 vs V100). But for consumer cards they won't reduce clock speeds and might even increase them.
     
  4. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,289
    Likes Received:
    3,550
    That's the wrong angle, you don't look at traditional FP32 in an AI optimized chip. So looking at transistor budget is enough for now IMO. What we know so far is that based on the shown numbers: NVIDIA crammed 2.5X the number of transistors while only using 33% more power (a considerable portion is coming from the beefed up NVLink). That's a ton of power efficiency right there.

    If we care about comparing similar performance metrics then FP16 and INT8 should be enough, those also got an increase of 2.5X (compared to V100/Titan RTX respectively) despite the reduction of Tensor core count in A100 compared to both of them.

    They will definitely increase them, just like the situation of Volta/Turing, in fact if you think about it, we are in this exact same situation here, V100 was devoid of RT cores, Turing was the gaming/workstation version of V100 with added RT cores and higher clocks, the same way A100 is devoid of RT cores, we are just waiting for the gaming version of Ampere.
     
    nnunn, Cuthalu, pharma and 1 other person like this.
  5. ShaidarHaran

    ShaidarHaran hardware monkey
    Veteran

    Joined:
    Mar 31, 2007
    Messages:
    4,003
    Likes Received:
    51
    Fair enough. Unfortunate, but I'm not surprised NV would not disclose this information.
     
  6. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,333
    Likes Received:
    128
    Location:
    San Francisco
    All major DL frameworks, which call into NVIDIA APIs, will be automatically using TF32 by default.

    The value is not having to change your application. BF16 doesn't tend to work well across a wide range of networks without manual intervention. Most DL developers are not performance experts. They just want their code to work well and fast out of the box.
     
    pharma likes this.
  7. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    751
    Likes Received:
    320
    Any references to articles that show BF16 doesn't work well for a wide range of networks ?
    Google TPU2 and TPU3 are exclusively BF16.
    Intel adoptied BF16 with Habana and AVX512, so has ARM.
    It is quite remarkably that the Ampere introduction video did not even mention BF16 as it's not even there.
     
    ethernity likes this.
  8. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,333
    Likes Received:
    128
    Location:
    San Francisco
    AFAIK TPUs go faster if you use BF16, but they also support FP32.
    Google say BF16 is close to a drop-in replacement for FP32, so it’s not positioned as TF32.

    Evidently with Ampere the focus is to first enable the vast majority of DL practitioners that just want to get great performance out of the box, without having to change a line of code.
     
  9. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    499
    Likes Received:
    220
    Eh, they already pushed the Super cards pretty hard, the 2080 super already takes up 250 watts. While I think RDNA1 made them learn their lesson about powerdraw versus performance, in that 90% or more of consumers don't give a shit and just want to "go fast", the different in fmax curve between 7nm and 12nm isn't actually that great. While I'd expect them to push the chips pretty high much like the Super versions, I doubt we'll see frequencies much beyond those cards.

    Part of this is apparently the consumer cards will be the same architecture. And while the A100 has just over double the transistor count clockspeed actually went down from V100 while having powerdraw increase by 33%. Which suggests there's no particular focus on getting more frequency versus power draw out of the silicon.
     
  10. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    981
    Likes Received:
    1,108
    Lower precision ALUs take much less power AFAIK, so with more chip area spent on those we can not compare to previous generations directly.

    Predictions about the consumer parts are quite hard to make yet i think? I guess they trade tensor area vs. RT, but who knows. Maybe consumer ends up at higher fp32 perf even.
    The way Mr. Jensen presented DLSS as problem solution to make RT practical at least makes me pretty certain tensors will not be removed or shrinked in comparison to Turing.

    It's pretty interesting times. RT and ML on one side, totally unexpected success and demand on traditional compute on the other (UE5). Surely hard to pick the sweet spot of compromise here.
     
    ethernity and chris1515 like this.
  11. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    2,717
    Likes Received:
    905
    As implied by DF, both. On pc atleast.
     
  12. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    288
    Likes Received:
    189
    They also take far fewer transistors and occupy a much smaller die area, so it basically evens out.
     
  13. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    981
    Likes Received:
    1,108
    hmm, yeah - not sure if i interpreted the quote properly.
    Sure, but for entry and midrange the ratio matters (assuming all models get the features this time), and i could only guess what the right ratio should be. It's only harder this gen.
     
    PSman1700 likes this.
  14. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    2,717
    Likes Received:
    905
    Hardware RT probably has a future, would be abit meh if it sits practically unused for the next 7 years :)

    Though what they did in the ue demo was just as amazing, but i guess hardware will be faster, maybe use it for just reflections etc.
     
  15. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    981
    Likes Received:
    1,108
    UE5 really shakes things up, maybe even harder than RTX did. It's a game changer, and it changes a lot of things that are not obvious at first, for example the lead NV > AMD.
    The primary reason for this lead is better rasterization performance (and other fixed function stuff like tessellation). But i assume Nanite draws only a small number of triangles, and most is rendered from compute.
    If this is true, and i have no reason to assume Ampere compute performance could beat RDNA2 by a large factor or at all, the picture could change.
    NV could no longer afford experiments like tensor cores so easily. And also a lead in RT perf. would weight less because GI has more effect on the overall image, and compute can do diffuse GI better than RT. (Personal opinion, but Lumen confirms it a bit.)

    With AMD offering RT too, devs will come up with their own upscaling solutions. e.g. there are more options like RT on half res but visibility on full res which has not been explored but make sense.
    DLSS is vendor locked, so it is no longer an option. NV can no longer rely on it to be the system seller of tensor cores that are unused during most time of the frame.
    So i guess they come up with ML denoising, but it's not guaranteed it will become the norm. If denoising would benefit from ML, we should have seen it already during Turing.

    In the worst case for NV, we soon end up with 20TF flagships from both NV and AMD, and AMD might achieve this with smaller and cheaper chips if NV increases chip area spent on tensor cores.

    So that's why i think the competition NV vs. AMD will become more interesting this time.


    UE5 and RT is also interesting. To make this detail traceable we need some options for LOD and to stream BVH per level, not only per object.
    DX12U is not enough.
     
  16. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    2,717
    Likes Received:
    905
    Thats good because there is none atm, basically.
     
    Rootax likes this.
  17. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    981
    Likes Received:
    1,108
    In that sense my speculations may be a bit optimistic :D
     
    Rootax and PSman1700 like this.
  18. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,289
    Likes Received:
    3,550
    3 years ago, If you looked at V100's frequencies alone, you wouldn't have thought Turing's clocks would reach as high they did. Gaming chips will have completely different configuration.

    These V100 and A100 chips are also passively cooled, you can't expect anyone to go all out on frequency with such cooling solutions.

    As explained above, big part of this increase is the upgraded NVLink with 600GB/s. For example, V100 NVLink increased power consumption over V100S PCI-E by 50W.
     
    #98 DavidGraham, May 18, 2020
    Last edited: May 18, 2020
    A1xLLcqAgt0qc2RyMz0y likes this.
  19. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,579
    Likes Received:
    2,298
    I agree, and AMD is already jumping the gun in that regard. The recent debacle with Radeon Rays 4.0 loosing Open Source privileges and AMD's re-think due to community backlash indicates AMD may be trying to secure a strategic software presence for future architectural endeavors.
     
    Lightman likes this.
  20. troyan

    Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    151
    Likes Received:
    241
    Metro's RTGI is more stable, faster and has a better included RTAO. I dont see Lumen as important as Nanite.
     
    pharma and PSman1700 like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...