Nvidia's Next-Generation RTX GPU [3060, 3070, 3080, 3090 now with TIs]

Discussion in 'Architecture and Products' started by Shortbread, Sep 1, 2020.

Tags:
  1. PSman1700

    Legend Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    5,043
    Likes Received:
    2,242
    I think what most ment was this comment.

    The 3080 for example is a 30TF GPU. If one is intrested in the truth, one could dig a little deeper then just saying 'its not the truth, its a lie, we are being misled etc'.
    The post i linked to in my previous comment seems to sum things up quite well. No TF dont mean everything, theres more to it obviously, but that doesnt say they dont mean anything either.

    https://www.gamespot.com/forums/sys...zealous-33515692/?page=1#js-message-356853198
    From user 04dcarraher

    ''... FLOPS mean nothing when your comparing totally unrelated different gpu architectures..... Now if you were to compare say GCN 1.0 vs GCN 1.4 based gpu's then I would say "maybe" since they are still based on the same core design.

    RX 6800xt (20 TFLOPS) has a pixel rate is 288 GPixel/s, texture rate of 648.0 GTexel/s

    The RTX 2080ti(13.45 TFLOPS) has a pixel rate of 136.0 GPixel/s and a texture rate of 420.2 GTexel/s. Yet it still beats the 6800xt in RT but loses to the 6800xt in normal rasterization rendering performance.

    While the RTX 3080( 29.77 TFLOPS) has a pixel rate of 164.2 GPixel/s and a texture rate of 465.1 GTexel/s. and beats the 6800xt overall and has much much better RT performance. The TFLOP increase is because of the 128 core per SM design vs Turing's 64 core design. Hence the 2x potential of math crunching. However half of the 64 cores of the 128 is allocated for INT and or more types of FP ie 8/16/32 etc base on the type of job. Making the gpu more flexible.

    RX 6800 series is a great performing gpu when it comes to normal rasterization rendering. But falls flat on its face when it has to do it all, with RT and or high resolutions. The way the RDNAv2 does alot of its RT work is by using its free TMU(texture mapping units) "what gives the gpu its texture rate". So the higher the texture/pixel resolution and the amount of RT is used eats into the RDNA's TMU resources hurting performance.

    While the design might be more flexible where you "could" allocate more TMU's for RT,adjusting the amount while Nvidia RTX design is a fixed amount of dedicated RT processors. But the fact that AMD is using a gpu's TMU's was a short cut for them to check mark "the does it have RT" box.''
     
  2. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    617
    Likes Received:
    1,076
    Other than that, by NVIDIA's Turing whitepaper, there are 36 INT instructions per 100 FP instructions on average. This leaves up RDNA2 with 73,53 % peak theoretical FP efficiency at such instructions mix due to the 1 to 1 FP and INT units split.
    ̶W̶i̶t̶h̶ ̶A̶m̶p̶e̶r̶e̶,̶ ̶u̶n̶i̶t̶s̶ ̶s̶p̶l̶i̶t̶ ̶i̶s̶ ̶2̶ ̶F̶P̶ ̶t̶o̶ ̶1̶ ̶I̶N̶T̶ ̶S̶I̶M̶D̶ ̶u̶n̶i̶t̶s̶,̶ ̶b̶e̶t̶t̶e̶r̶ ̶e̶f̶f̶i̶c̶i̶e̶n̶c̶y̶ ̶i̶s̶ ̶p̶o̶s̶s̶i̶b̶l̶e̶ ̶o̶n̶ ̶t̶h̶e̶ ̶3̶6̶ ̶I̶N̶T̶ ̶p̶e̶r̶ ̶1̶0̶0̶ ̶F̶P̶ ̶i̶n̶s̶t̶r̶u̶c̶t̶i̶o̶n̶ ̶m̶i̶x̶,̶ ̶8̶6̶,̶7̶9̶ ̶%̶ ̶p̶e̶a̶k̶ ̶t̶h̶e̶o̶r̶e̶t̶i̶c̶a̶l̶ ̶F̶P̶ ̶e̶f̶f̶i̶c̶i̶e̶n̶c̶y̶ ̶s̶i̶n̶c̶e̶ ̶I̶N̶T̶ ̶i̶n̶s̶t̶r̶u̶c̶t̶i̶o̶n̶s̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶e̶x̶e̶c̶u̶t̶e̶d̶ ̶i̶n̶ ̶p̶a̶r̶a̶l̶l̶e̶l̶ ̶w̶i̶t̶h̶ ̶F̶P̶ ̶(̶w̶h̶i̶c̶h̶ ̶l̶e̶a̶v̶e̶s̶ ̶t̶h̶e̶ ̶a̶d̶d̶i̶t̶i̶o̶n̶a̶l̶ ̶1̶3̶,̶2̶6̶ ̶%̶ ̶o̶f̶ ̶F̶P̶ ̶p̶e̶r̶f̶o̶r̶m̶a̶n̶c̶e̶ ̶o̶n̶ ̶t̶h̶e̶ ̶t̶a̶b̶l̶e̶)̶.̶
    ̶L̶e̶t̶s̶ ̶c̶h̶e̶c̶k̶ ̶n̶u̶m̶b̶e̶r̶s̶ ̶o̶n̶ ̶s̶u̶c̶h̶ ̶s̶p̶l̶i̶t̶ ̶-̶ ̶6̶8̶0̶0̶ ̶X̶T̶ ̶=̶ ̶2̶1̶ ̶T̶F̶L̶O̶P̶S̶*̶0̶.̶7̶4̶ ̶+̶ ̶2̶1̶ ̶T̶O̶P̶S̶*̶0̶.̶2̶6̶ ̶=̶ ̶2̶1̶ ̶t̶r̶i̶l̶l̶i̶o̶n̶s̶ ̶o̶f̶ ̶m̶i̶x̶e̶d̶ ̶o̶p̶e̶r̶a̶t̶i̶o̶n̶s̶ ̶p̶e̶r̶ ̶s̶e̶c̶o̶n̶d̶,̶ ̶3̶0̶8̶0̶ ̶=̶ ̶3̶0̶*̶0̶.̶8̶7̶ ̶+̶ ̶1̶5̶*̶0̶.̶2̶6̶ ̶=̶ ̶3̶0̶ ̶t̶r̶i̶l̶l̶i̶o̶n̶s̶ ̶o̶f̶ ̶m̶i̶x̶e̶d̶ ̶o̶p̶e̶r̶a̶t̶i̶o̶n̶s̶ ̶p̶e̶r̶ ̶s̶e̶c̶o̶n̶d̶.̶
    My bad, numbers above are partially wrong for 3080, everything is simpler - 30 tflops is peak, so 30*0.74 = 22.2 FP TFLOPs and 7.8 TOPS are required for the 100 FP / 32 INT split. Still 30 TOPS in total, but doesn't break instructions percentages this time.
    If there were shaders with just integer instructions, then RDNA2 would win, but shaders contain different instructions. And Ampere's 2:1 FP / INT units split is better for the 100 FP / 36 INT instructions mix.
     
    #442 OlegSH, Jan 18, 2021
    Last edited: Jan 18, 2021
    HLJ, pharma, PSman1700 and 1 other person like this.
  3. troyan

    Regular Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    352
    Likes Received:
    688
    INT32 throughput on AMD hardware is half speed.
     
    DavidGraham, HLJ and PSman1700 like this.
  4. OlegSH

    Regular Newcomer

    Joined:
    Jan 10, 2010
    Messages:
    617
    Likes Received:
    1,076
    This makes a perfect sense given the 100 to 36 FP/INT instructions split.
     
    HLJ, pharma and PSman1700 like this.
  5. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    2,405
    Likes Received:
    1,941
    Location:
    msk.ru/spb.ru
    Lightman, HLJ, pharma and 1 other person like this.
  6. troyan

    Regular Newcomer

    Joined:
    Sep 1, 2015
    Messages:
    352
    Likes Received:
    688
    PSman1700, HLJ, pharma and 1 other person like this.
  7. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    2,405
    Likes Received:
    1,941
    Location:
    msk.ru/spb.ru
    Hm, maybe its limited artificially in consumer products for some reason?
     
    Krteq and PSman1700 like this.
  8. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,309
    Likes Received:
    1,944
    Location:
    New York
    Actually he made a clearly false assertion which was corrected as it should be on a technical forum. No need to defend it.

    If you want to claim Ampere isn’t really a 30 TFLOP GPU while running INT instructions you can’t claim that RDNA is a 20 TFLOP GPU in the same breath.
     
  9. Qesa

    Newcomer

    Joined:
    Feb 23, 2020
    Messages:
    27
    Likes Received:
    46
    Professional cards have the same pattern, e.g. https://www.servethehome.com/amd-radeon-pro-w5700-gpu-review/3/

    The white paper you took the exerpt from explicitly lists the various FP instructions as being full rate, then follows with "as well as 24/32-bit integer". The implication there I believe is simply that the execution units handle them, but not necessarily at full rate.
     
    #449 Qesa, Jan 19, 2021
    Last edited by a moderator: Jan 19, 2021
  10. arandomguy

    Newcomer

    Joined:
    Jul 27, 2020
    Messages:
    125
    Likes Received:
    190
    What we really need is a mixed operation test (preferably configurable). As I wonder what the actual throughput is for the FPU/ALU are in mixed workloads with various FP16/FP32/INT32/etc. operations for all uarchs.
     
  11. HLJ

    HLJ
    Regular Newcomer

    Joined:
    Aug 26, 2020
    Messages:
    424
    Likes Received:
    706
    In my recollection NVIDIA has always done better in-game with less FLOPs compared to AMD...so indeed FLOP's do not tell the full picture.
    As far as I remember, NVIDIA has been better at keeping it's "pipeline full", hence why the "async computing" initial didn't do as much for them as for AMD.
     
    PSman1700 likes this.
  12. CarstenS

    Legend Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,398
    Likes Received:
    3,181
    Location:
    Germany
    There's no single magic number that's the be-all-end-all of performance metrics. I'd love to be proven wrong though.
     
    Kej, Lightman, pjbliverpool and 2 others like this.
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    11,309
    Likes Received:
    1,944
    Location:
    New York
    Nope there isn't, even within the same arch. And it's hopeless when comparing different architectures. Take the 3080 vs 3060 Ti for example.

    The 3080 is 40-60% faster depending on the game, averaging around 50% faster. But the 3080 has...

    84% more flops and texture throughput
    70% more bandwidth
    24% more fillrate

    The advantage in games never drops as low as 24% and never rises as high as 84%. None of the top line theoretical numbers predicted the actual 50% gain.
     
  14. ToTTenTranz

    Legend Veteran

    Joined:
    Jul 7, 2008
    Messages:
    12,272
    Likes Received:
    7,226

    Exactly. Ampere has no "free" INT32 ALUs like Turing did and that's exactly like RDNA2, so I don't get sniffy's point. Both metrics are 100% comparable as they both relate to theoretical maximum TFLOP output, regardless of them being used by marketing departments or not.
    No GPU ever really reaches its theoretical maximum TFLOP output, much less when rendering games, but that's a fact that everyone here should indeed take into account.


    Regardless, the tables have turned and, RDNA2 is now an architecture with better rasterization performance-per-theoretical-max-TFLOP than Ampere. Which is as much of a completely worthless metric now as it was when Pascal had a much higher ratio than Vega IMO.
    The only thing that really matters is performance/price (where we deal with die size, process node, yields, PCB and memory cost, margins, etc.) and to a lower degree of importance performance/power.




    According to some, the be-all-end-all performance metric is Cyberpunk 2077 with raytracing options maxed out at 4K with DLSS.
    Nothing else matters anymore. :runaway::runaway:
     
  15. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    2,405
    Likes Received:
    1,941
    Location:
    msk.ru/spb.ru
    This is certainly a possibility but would be pretty skittish way of describing the spec in a whitepaper on AMD's part. INT24 seem to run at full rate at least.

    Actually, the only two things which matter are perf/watt and perf/transistor. The latter is also where you should account for features absent from competition. All the rest are a result of these two.
     
    PSman1700 likes this.
  16. ToTTenTranz

    Legend Veteran

    Joined:
    Jul 7, 2008
    Messages:
    12,272
    Likes Received:
    7,226
    They're not.
    If your 5nm SoC is getting the same perf/watt and perf/transistor as the competition's SoC built on 28nm, then your product is obviously weaker and less able to compete.
    Besides, the perf/transistor metric is a bit worthless considering it's variable within the same process, as chip designers can select between performance-optimized transistors and density-optimized ones.
     
    Qesa likes this.
  17. HLJ

    HLJ
    Regular Newcomer

    Joined:
    Aug 26, 2020
    Messages:
    424
    Likes Received:
    706
    Still waiting for your reply...but I guess no reply is all the reply I need ;)
     
    PSman1700 likes this.
  18. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,334
    Likes Received:
    162
    Location:
    On the path to wisdom
    RDNA2 performs int32 additions at full rate, or an int24*int24+int32 multiply-add (with the multiplication result being int32). An int32 mul requires four smaller mul/mads.

    EDIT: And a substantial share of array indexing calculations, I imagine, can be proven by the compiler to fall into the int24*int24 range (often one side being a constant and the other a bounded loop index).
     
    #458 Xmas, Jan 19, 2021
    Last edited: Jan 21, 2021
    Lightman, pharma, trinibwoy and 2 others like this.
  19. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    2,405
    Likes Received:
    1,941
    Location:
    msk.ru/spb.ru
    They are. A GPU which has higher perf/watt will win the high end in performance. A GPU which has higher perf/transistor will win in perf/price and features. Your weird comparisons of products on 28 and 5nm are irrelevant for this reality.
     
    pharma, PSman1700 and HLJ like this.
  20. Qesa

    Newcomer

    Joined:
    Feb 23, 2020
    Messages:
    27
    Likes Received:
    46
    What if one GPU is using high performance libraries and another high density? What if the HP library GPUs are made on a high contested (and therefore expensive) node, while the high density design is on a less performant (and thus cheaper per transistor) node? I think "performance per transistor" goes out the door as a relevant metric when one GPU designer is laying out 40-50 MT/mm^2 while the other is getting 65 MT/mm^2 on the same node.

    Performance per die cost is the obvious metric, with the minor problem that the few people who know the cost for sure aren't in a position to tell anybody.
     
    #460 Qesa, Jan 19, 2021
    Last edited: Jan 19, 2021
    neckthrough and ToTTenTranz like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...