Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

Tags:
  1. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    797
    Likes Received:
    1,624
    From what I understand, what is depicted on this image is the possibiblity to use new async modes to overlap all this stuff, probably not the best wording with "concurrency" in use.
     
  2. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    I also wonder how this is holding up wrt power requirements. And if I'm not mistaken, this means, the tensor core either has to apply DLSS for the previous frame or is working on something else like AI denoising for the current frame.
     
    Lightman likes this.
  3. Isn't AI denoising part of DLSS, and isn't DLSS always a post processing effect?
     
  4. dorf

    Newcomer

    Joined:
    Dec 21, 2019
    Messages:
    126
    Likes Received:
    417
    I dont think so. Maybe this elucidates the matter somewhat (timestamped):
     
    Cyan, Krteq, BRiT and 3 others like this.
  5. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    696
    Likes Received:
    446
    Location:
    Slovenia
    FP16 are 2x faster so no. Bandwidth for matrix multiplications depends a lot on how large chunks of both matrices you can keep as close to the ALUs as possible. In context of tensor cores this basically means register file directly. If I remember correctly there were some investigations around here that showed that the code to get absolute max out of tensor cores loaded data into registers twice to avoid bank conflicts.
    The 4x FP16 rate is getting 2 general purpose FP16 ops from scalar path and 2 additional FP16 ops from tensor cores. But tensor cores are capable of more then that. Looking at A100 there's 64 FP16x2 units per SM and 4 tensor cores each capable of 256 ops.
     
    DegustatoR and BRiT like this.
  6. Picao84

    Veteran

    Joined:
    Feb 15, 2010
    Messages:
    2,109
    Likes Received:
    1,195
    Just noticed the RTX3080 requires a 750W power supply. I have a 650W one and my CPU is Ryzen 3900X. I wonder if my power supply would be constrained if I got a 3080. What do you think?
     
  7. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,235
    Likes Received:
    4,259
    Location:
    Guess...
    Yes, and the graph shows the tensor work taking place at the same time as the RT work with a slight lag. So RT denoising makes sense rather than DLSS.
     
    PSman1700 likes this.
  8. Man from Atlantis

    Regular

    Joined:
    Jul 31, 2010
    Messages:
    960
    Likes Received:
    853
    With having +15% power limit 3080 is almost at 370W. An overclocked 3900X is around 200W, you are already near %90 load on the PSU.
     
    Lightman and PSman1700 like this.
  9. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,235
    Likes Received:
    4,259
    Location:
    Guess...
    24:00 - "With Nvidia RTX IO, vast worlds will load instantly. Picking up where you left off will be instant."

    A hint that quick resume might be coming to PC? Please let it be!!

     
    Lightman, BRiT and PSman1700 like this.
  10. iroboto

    iroboto Daft Funk
    Legend Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    14,833
    Likes Received:
    18,633
    Location:
    The North
    Separate If I understand correctly. DLSS occurs around the same time antialiasing algo would normally occur. It’s meant to slot in as a replacement.

    AI denoising should occur during/just after ray tracing step.
     
    Dictator and PSman1700 like this.
  11. LeStoffer

    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    1,262
    Likes Received:
    22
    Location:
    Land of the 25% VAT
    Yeah, but you will have to wait on DirectStorage according to Microsoft: "We’re targeting getting a development preview of DirectStorage into the hands of game developers next year."

    So late 2021 at the earliest, methinks.
     
  12. Picao84

    Veteran

    Joined:
    Feb 15, 2010
    Messages:
    2,109
    Likes Received:
    1,195
    I'm not overclocking the CPU, although I've seen max load at 142W. True that with the boost clocks on the 3080. The 3070 is a much safer proposition indeed. I would wait for a 3070 with 16GB, but my brother's birthday is close, I'm travelling back home and wanted to surprise him with my current GTX1060. Might replace with his GTX750Ti meanwhile maybe...
     
  13. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Maybe I don't understand you right, but FP16 throughput for A100 is 78 TFlops as per Ampere whitepaper. Half of that comes from the FP32-SIMDs in the SM, the TCs are delivering the other half: 39 TFlops. With FP16-Tensor math, they are at 312 TFlops w/o sparsity. So, my take on it is, that that's far from their maximum throughput. That has to have an effect on RF usage.
     
  14. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,235
    Likes Received:
    4,259
    Location:
    Guess...
  15. Picao84

    Veteran

    Joined:
    Feb 15, 2010
    Messages:
    2,109
    Likes Received:
    1,195
    Cyan likes this.
  16. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    disco_ and Lightman like this.
  17. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    696
    Likes Received:
    446
    Location:
    Slovenia
    So if I understand you correctly you mean that general purpose FP16 should be even higher then x4 (78 Tflops) but it's not due to RF usage/bandwidth? 312 Tflops figure does burn pretty much all the RF bandwidth and the 78 Tflops does not. But TC are special pieces of hardware. They for example require cooperation of an entire warp and you can't have divergence where some of your threads in a warp would issue commands to TC and others would not.
    If you mean that you can run RT and FP16 compute shaders at 4x rate and occupy RT cores, CUDA cores and tensor cores than that's technically correct. But you're not really stressing the tensors all that much. :)
     
  18. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Nope, I just stated that it's at that level and that it's a lower utilization than it's peak (78 vs. 312 TFlops). But it obviously has enough RF bandwidth to get near it's peak. So my train of though was, that RF is not ulitzied a 100% at pure FP16 non-MMA calculations, thus a certain amount should be available to at least partially feed the FP32-SIMDs.
     
  19. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    CarstenS: tensor cores lower register file bandwidth per math operation because they work on tensors rather than scalars (N^3/N^2).
    Nvidia actually shows this phenomenon in their animated tensor core cartoons in their keynotes.

    So it’s likely they are close to peak RF bandwidth both at 78 scalar TFlops as well as at 312 tensor TFlops.
     
  20. swaaye

    swaaye Entirely Suboptimal
    Legend

    Joined:
    Mar 15, 2003
    Messages:
    9,044
    Likes Received:
    1,116
    Location:
    WI, USA
    I think AV1 is near so maybe that isn't really worthwhile now. Actually why don't they have AV1 encoding at this point? Heh
     
    #1260 swaaye, Sep 5, 2020
    Last edited: Sep 5, 2020
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...