Nvidia Ampere Discussion [2020-05-14]

Discussion in 'Architecture and Products' started by Man from Atlantis, May 14, 2020.

Tags:
  1. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    14,798
    Likes Received:
    6,984
    If the games are being bottlenecked by bandwidth or the raster engines, as people are suggesting, then I don't see how that could be true. More likely, I think as games drop the vertex shader pipeline and switch to mesh shaders or compute front-ends, that difference in compute power will shine more on the 3080.
     
    pharma and PSman1700 like this.
  2. agent_x007

    Newcomer

    Joined:
    Dec 2, 2014
    Messages:
    25
    Likes Received:
    3
    I found this for undervolt test :
     
    nnunn likes this.
  3. pcchen

    pcchen Moderator
    Moderator Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    2,938
    Likes Received:
    449
    Location:
    Taiwan
    Personally I believe the trend of game engines will be going toward higher compute/bandwidth ratio. The reason is that GPU is actually walking the same path CPU has been through for some time, that is, computation is more power efficient than bandwidth in general.

    Ray tracing is a good example. By using ray tracing it’s possible to reduce the number of off-screen rendering passes per frame, because a lot of them can be replaced with ray tracing (e.g. shadow maps, reflection renderings, etc.) This favors GPU with higher compute/bandwidth ratio.
     
    Lightman, pharma, DegustatoR and 3 others like this.
  4. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    8,700
    Likes Received:
    3,201
    Location:
    Guess...
    I certainly hope so.
     
    PSman1700 and BRiT like this.
  5. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    14,798
    Likes Received:
    6,984
    Yah, I've seen that. That's an undervolt to see how low he can get power while maintaining pretty much stock FE clocks in games. I'm more interested in seeing if you can overclock with a more minor undervolt that just gets it under the power limit. Would be a balancing act of lowering power and pushing clocks as high as they can go while just staying under that limit. I definitely prefer a setup where the clock is 100% stable, which is how I have my card setup right now. It feels a lot smoother when you have an fps cap set.
     
    PSman1700 and BRiT like this.
  6. w0lfram

    Regular Newcomer

    Joined:
    Aug 7, 2017
    Messages:
    254
    Likes Received:
    48

    NO, AMD split their graphic architecture into two separate domain-specific architectures of RDNA & CDNA.

    Real-Time Gaming (Frames/Second) architecture -vs- High-Performance Compute (Flops/Second) architecture.
     
  7. bdmosky

    Newcomer

    Joined:
    Jul 31, 2002
    Messages:
    177
    Likes Received:
    48
    You've mentioned this several times before. What I don't understand is why you continue to act like Nvidia hasn't done the exact same thing. V100 and A100 are very different beasts compared to Nvidia's gaming chips.
     
  8. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    14,798
    Likes Received:
    6,984
    I think people are making the assumption that because the compute power scaled up much more drastically than the rest of the gpu, then it must be because the architecture is not designed for gaming. I honestly don't really agree, because memory bandwidth is just an incredibly expensive issue to solve and the vertex shader pipeline is being replaced with a highly-parallel compute-driven mesh-shader pipeline which will leverage all of the compute performance. That leaves rops, which we don't know if any of these games are limited by rops.
     
  9. Kugai Calo

    Regular Newcomer

    Joined:
    Mar 6, 2020
    Messages:
    254
    Likes Received:
    288
    Location:
    The Prairies
    Even A100 and GA102 appear to be different.
     
    Rootax likes this.
  10. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    739
    Likes Received:
    417
    The arch is obviously not designed for gaming, the compute is way out of bounds for other bottlenecks and just tossing out "mesh shaders will solve it!" isn't a compelling argument. It's going to be just as bandwidth strapped running everything else, even if meshes have a lower data rate than buffers it's still just as bottlenecked, there's still a vast amount of hypothetical compute performance laying around doing nothing for most of a frame.

    So far Ampere is, transistor for transistor, less efficient than Turing for gaming and will probably remain that way. And given the vast power usage versus the advance in silicon nodes it's almost certainly less efficient in power usage as well. And of course it's not like AMD's CD/RDNA split, those appear to be two distinct architectures while the Ampere gaming models mostly seem to differ from A100 in things like Tensor and Cuda core count over anything else.

    I severely doubt Nvidia intended to have their highest end chip perform an average of only 10% or so better than their mass market bin of the exact same die; probably why it's called the 3090 instead of Titan as the PR guys want to preserve that latter's prestige in name.
     
    w0lfram likes this.
  11. Scott_Arm

    Legend

    Joined:
    Jun 16, 2004
    Messages:
    14,798
    Likes Received:
    6,984
    @Frenetic Pony The mesh shader pipeline or compute pipelines like unreal nanite will leverage that compute where the current vertex shader pipeline will not. As for bandwidth, I don’t know what the options are. 512 bit bus or hbm I suppose, but those are very costly choices.

    I think they pushed stock config much more than usual which left less headroom and high power consumption. They could have released it sub 300W and nearly the same performance.
     
    PSman1700 likes this.
  12. Benetanegia

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    394
    Likes Received:
    425
    Nonsense. It's nearly identical on the compute side and what's changed is more related to concurrently being able to RTX and DLSS.

    I don't think so. It's just people being knee-jerk fixated on single dimensional GFLOPS as if that should be the absolute performance metric. It's never been.
    There was a very clear opportunity to maximize scheduling rate on the SM while increasing FP32 and they did it the only way posible, by doubling/adding a second FP SIMD (that didn't even get its own data path**) among many other units that were already there. It's never suposed to come with a doubling of performance, there was simply no other way to increase it but doubling the unit (which is the same that has happened with TMU and ROPs in the past, every few generations they seem overkill, but it's just a small percentage of actual trnasistor budget, same here). It's enough if performance increase is greater than area increase and so far, it came with a more than 30% performance uplift against a card (2080 Ti) with same amount of SM (68) for a very minor increase in area.

    EDIT:

    Yeah and a lot of texturing perofrmance laying around doing nothing and a lot of ROP performance laying around doing nothing and in Turing, also a lot of INT32 computing performance laying around doing nothing, and a long list of many other hypothetical performances laying around doing nothing. What exactly makes FP32 so special that it requires special consideration?

    How so? Even accounting for the much improved RT and TC cores + the scheduling changes to make those run concurrent, GA104 is 17.4 billion transistor vs TU102 18.4 billion and will most definitely beat it. As for the 3080, it has 20% of its chip disabled, so it would be equivalent to a 28 * 0.8 = 22.4 billion transistor chip, and that's just 20% more trnasistors for a 30%+ performance uplift.

    It' is not that way and its advantage will do nothing but grow, as games better supporting its advantages start popping up.

    ** Now that would have been an indication supporting your claim if it had had its own datapath.
     
    #1712 Benetanegia, Sep 20, 2020
    Last edited: Sep 20, 2020
    T2098, tinokun, pharma and 5 others like this.
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,309
    Likes Received:
    1,944
    Location:
    New York
    Titans have never been significantly faster than the Ti's below them so that's not it.
     
    T2098, PSman1700 and Rootax like this.
  14. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    Regarding the comparison in the white paper for the Titan RTX vs 3090, for AI compute.
    In this table the Titan RTX is listed for FP16 Tensor TFLOPS with FP32 Accumulate as having 65.2 TFLOP, where it has actually 130 TFLOP
    The question is this by mistake ? And similarly does the 3090 have then 142 TFLOP or the listed 71 TFLOP
     
    #1714 Voxilla, Sep 20, 2020
    Last edited: Sep 20, 2020
  15. Qesa

    Newcomer

    Joined:
    Feb 23, 2020
    Messages:
    27
    Likes Received:
    46
    Previous generations had fp16 accumulate at full throughput and fp32 accumulate halved. I'd guess it's the same situation here.
     
  16. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,398
    Likes Received:
    3,181
    Location:
    Germany
    Look through the reviews, single out the titles, were the Radeons shine and cross-check with relative improvement for Ampere vs. Turing.
     
  17. w0lfram

    Regular Newcomer

    Joined:
    Aug 7, 2017
    Messages:
    254
    Likes Received:
    48
    Are you suggesting they don't use the same architecture...? (V100/A100)
     
  18. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    Turing TU102 Quadro and Titan have enabled full rate 130 TFLOP FP16 / FP32 accumulate (unlike the Ampere whitepaper mistakenly quotes for the Titan as 65.2 TFLOP)

    upload_2020-9-20_13-20-11.png
    https://blog.slavv.com/titan-rtx-quality-time-with-the-top-turing-gpu-fe110232a28e
    "Full-rate mixed-precision training (FP16 with FP32 accumulation) — A few paragraphs ago, mixed precision training was explained. When the model utilizes Tensor cores, it performs matrix multiply-accumulate operation really quick. The second step in this operation (accumulate) must be done at FP32 to preserve accuracy and is then converted to FP16. The accumulate operation performs at half speed on RTX 2080 and RTX 2080 Ti, but on full-rate on the Titan RTX. In practice, this makes the Titan RTX perform 10% to 20% faster where Tensor Cores are utilized."
     
    #1718 Voxilla, Sep 20, 2020
    Last edited: Sep 20, 2020
  19. Picao84

    Veteran Regular

    Joined:
    Feb 15, 2010
    Messages:
    2,003
    Likes Received:
    1,053
    Do we have any idea when RTX3070 reviews are going to drop? On the same day of release?
     
  20. Cat Merc

    Newcomer

    Joined:
    May 14, 2017
    Messages:
    134
    Likes Received:
    130
    I'm honestly getting Deja Vu from the Fury X days with these arguments.
     
    Lightman, Krteq, Rootax and 1 other person like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...