GPU Ray Tracing Performance Comparisons [2021] *spawn*

Discussion in 'Architecture and Products' started by DavidGraham, Mar 29, 2021.

  1. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,210
    The video has zero games running ray tracing. I am not sure about the relevance of it to this discussion.

    Also pay attention that the user has set the TDP of the 6900XT to 275W, which is lower than the 300W default limit, the clocks of the 6900XT is low (at ~ 2280MHz max), it is also fixed and never change during each test, unlike the 3090 which changes it's clocks constantly in the video. This is a dead giveaway of a hard power cap, perhaps he did it because his 6900XT was running at a constant 75c temps, which is 10c more than his 3090.
     
    #1501 DavidGraham, Apr 2, 2022
    Last edited: Apr 2, 2022
    PSman1700 likes this.
  2. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,210
    I have some numbers for the RTX 3080 vs RX 6800XT, both OC'ed. Generally the 3080 consumes 370w, while the 6800XT maxes out at 285W.

    In Minecraft path tracing, the 3080 delivers 2X to 3X more frames than the 6800XT at 4K.
    In Control the 3080 delivers 60% more fps at 4K.
    In Call of Duty Cold War, the 3080 delivers 60% more fps at 4K.
    In Watch Dogs Legions, the 3080 delivers 60% more fps at 4K.
    In Battlefield V, the 3080 delivers 65% more fps at 4K.
    in Metro Exodus (old version) the 3080 delivers 40% more fps at 4K.
    In Shadow of Tomb Raider the 3080 delivers 50% more fps at 4K.

    So for 30% more watts, the 3080 delivers at minimum double that amount in performance depending on the complexity of ray tracing effects.

     
    PSman1700, sonen and pharma like this.
  3. Albuquerque

    Albuquerque Red-headed step child
    Veteran

    Joined:
    Jun 17, 2004
    Messages:
    4,309
    Likes Received:
    1,102
    Location:
    35.1415,-90.056
    Just to add to the dogpile, my 3080Ti finds a pretty whopping efficiency gain when poking at the undervolting options. I write a decently-sized post a few weeks ago regarding my undervoltage / overclocking findings with my card; I'll post this, go find my old post, and then edit this one to insert it here: I was happier with my laptop's GTX 1050Ti 4GB than I'll ever be with my desktop GTX 1060 3GB | Beyond3D Forum

    The cliff's notes for my particular situation:
     
    #1503 Albuquerque, Apr 2, 2022
    Last edited: Apr 2, 2022
    pharma and PSman1700 like this.
  4. PSman1700

    Legend

    Joined:
    Mar 22, 2019
    Messages:
    7,118
    Likes Received:
    3,088
    Hopefully AMD will also follow suit in the ML game like Intel/NV in their next approach.
     
  5. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    The reason I asked about watts per RT-FPS is to find out whether "theoretical TFLOPS" is part of the difference we're seeing.

    With the wide range in performance delta between RDNA 2 and Ampere (and perhaps with clues based upon Turing) I'm wondering if we can determine whether there are situations where FP throughput is a significant portion of the delta between Ampere and RDNA 2.
     
  6. PSman1700

    Legend

    Joined:
    Mar 22, 2019
    Messages:
    7,118
    Likes Received:
    3,088
    We will know more when were leaving cross gen behind perhaps.
     
  7. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    RT shading performance is probably dominated by shader and memory coherence (or lack thereof) and not so much by raw flops. I don’t know how you would isolate FP throughout from all the other dependencies in an RT pass.

    I ran a few traces of the Bright Memory raytracing benchmark and SM instruction throughput was pretty low during the Dispatch Rays call. What's interesting is that right after each call to Dispatch Rays there was a period of high SM utilization which makes me think RT in this benchmark is writing to a V/G buffer and the hit shading is done after the raytracing pass and not during.

    It's inconclusive but the only thing I could gather is that there are periods of ~80% SM instruction throughput. So if you set aside FP32 flops for a sec and just look at overall instruction throughput Ampere may have an edge as I believe its peak instruction throughput is higher than on RDNA2. It's murky because a lot of those instructions are ALU/SFU transcendentals, type-conversions and bit-manipulation so it's not a straightforward flops comparison.
     
  8. techuse

    Veteran

    Joined:
    Feb 19, 2013
    Messages:
    1,424
    Likes Received:
    908
    Those results show the same power gap just with the variances of silicon quality.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    For those two things, RDNA and Ampere should be equivalent, ish. I suppose.

    That's the standard DXR 1.0 approach, as I understand it. Favours NVidia (and soon to be, Intel). Though I assume there's a dedicated buffer set up by the driver to handle the results of despatch rays (spilling results to off-chip memory). That buffer is then consumed by the hit shaders.

    Is that the correct interpretation?

    Type conversion and bit manipulation should run at native SIMD rate, for at least one SIMD out of the pair in Ampere. Perhaps those only run on the "integer" SIMD?

    Does 80% of SM instruction throughput imply 1.6 instructions per clock? You're implying that's the peak utilisation as far as I can tell, so the overall utilisation is lower. Also, is some of that shading unrelated to hit shaders?

    AMD's ray tracing performance diagnostic tools have only just been announced, so diving deep isn't possible yet.

    In the end, comparing say 6800XT and 3080, where 3080 has around 40% more FLOPS, how much of the 60%+ RT advantage (at 4K) seen in 3080 is due to FLOPS?
     
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    DispatchRays doesn't return until after the hit shader runs so it seems the bound hit shader isn't doing a lot of math. My guess is that it's just writing parameters out to a buffer and a separate compute shader is spun up after to actually do the material shading.

    Yep, they run on the INT pipeline.

    According to the profiler each SM has a peak instruction issue rate of 4.0 and peak measured is 3.2 (80%). It makes sense as an SM has 4 independent partitions each with its own instruction scheduler and execution units. What's bizarre is that the profiler reports peak issue rate for each of the two FP pipelines as 0.5 per clock. I would expect it to be 2.0 as each of the 4 partitions can issue to each FP pipeline every other clock. ALU peak is 2.0 as expected as it takes 2 clocks to process an ALU warp. SFU is 0.5 as expected as it takes 8 clocks to process an SFU warp. I don't know how to interpret the FP peaks.

    [​IMG]
     
    PSman1700 and pharma like this.
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    That sounds like a very good theory.

    This appears to explain the terms being used:

    Advanced Learning :: Nsight Graphics Documentation (nvidia.com)

    What I don't understand is why the picture in that document is not explained by the bullet points that follow. Light Pipe and Heavy Pipe?

    This describes the pipelines:

    Kernel Profiling Guide :: Nsight Compute Documentation (nvidia.com)

    where the heavy FMA pipeline is different from the light one because it has integer dot product functionality. Other "integer" operations occur on other pipelines, the primary one being "alu".

    Then we have the "fma" pipeline: "Fused Multiply Add/Accumulate. The FMA pipeline processes most FP32 arithmetic (FADD, FMUL, FMAD). It also performs integer multiplication operations (IMUL, IMAD), as well as integer dot products. On GA10x, FMA is a logical pipeline that indicates peak FP32 and FP16x2 performance. It is composed of the FMAHeavy and FMALite physical pipelines."

    Notice that this isn't a real pipeline in Ampere, merely a convenience term to describe the grouping of the capability of the heavy and light pipelines.

    The "alu" pipeline is for bitwise and boolean operations and some integer math.

    As for the discrepancy in the peak rates, I have no idea!

    Back to the subject of the benefits of FP32 throughput in NVidia ray tracing:

    nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf

    "Ray tracing denoising shaders are a good example of a workload that can benefit greatly from doubling FP32 throughput." I have no idea what proportion of the frame time is spent denoising ray tracing results.
     
  12. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,210
    A 2080Ti/3070 are superior to the 6900XT in heavy RT workloads.

    In Minecraft RTX, the 2080Ti is 30% faster than 6900XT, the 3070 is 66% faster than 6900XT.
    In Quake 2 RTX, the 2080Ti is 10% faster than 6900XT, the 3070 is 35% faster than 6900XT.
    In Battlefield V, the 2080Ti/3070 are both 10% faster.
    In Call Of Duty Cold War, the 2080Ti is 35% faster than 6900XT.
    Other games have the 6900XT slightly ahead.

    https://www.comptoir-hardware.com/a...-test-nvidia-geforce-rtx-3070-ti.html?start=5

    Other tests from other sources reveal similar results.
    Minecraft RTX: 2080Ti is 35% faster than 6900XT
    Amid Evil RTX: 2080Ti is 45% faster than 6900XT
    Call Of Duty Cold War: 2080Ti is 12% faster than 6900XT



    In Synthetics, the 6900XT is 10% faster than 2080Ti in Port Royal Test but the more heavy ray tracing focused benchmark (Ray Tracing Feature Test) has both the 6900XT and 2080Ti/3070 at equal footings.

    https://www.sweclockers.com/test/34...0-ti-snabbt-dyrt-och-laskigt-effekttorstigt/5

    Take from that what you will.
     
    PSman1700 likes this.
  13. TopSpoiler

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    74
    Likes Received:
    176
    ASVGF takes about 1/5 of frame time in Q2RTX.
    FYI, the best sample is a RTXDI SDK that includes latest versions of all RTX~ish technologies including NRD and you have full control to turn on and off every effect in the sample application, so you can count it by yourself.
    https://github.com/NVIDIAGameWorks/RTXDI

    q2rtx.png
     
    T2098, Jawed, PSman1700 and 2 others like this.
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    In the three games there, merely comparing 99% percentiles for RT on 3070 and 2080Ti, we have:
    • Battlefield - 37 v 44 - 2080Ti is 119% of 3070
    • Control - 23 v 23 - 2080Ti is 100% of 3070
    • Metro Exodus - 33 v 28 - 2080Ti is 85% of 3070
    Without RT:
    • Battlefield - 63 v 68 - 2080Ti is 108% of 3070
    • Control - 38 v 29 - 2080Ti is 78% of 3070
    • Metro Exodus - 52 v 52 - 2080Ti is 100% of 3070
    Control seems to be the only "tough" test there, as Metro Exodus looks like it's the old version with just GI, not the Enhanced Edition.

    So merely comparing NVidia to NVidia across the Turing/Ampere generation makes it hard to conclude much.

    Why does Control punish the 3070 so much more than 2080Ti? 3070 has higher triangle intersection throughput, more FLOPS and "full fat" async compute. Maybe the game's implementation is old enough that it doesn't exploit async compute? With a question over the integer instruction mix for ray tracing, maybe instruction issue rate (concurrently to FMA heavy and FMA light pipes) is more important than FLOPS? Which would mean that 3070 has no advantage in instruction issue rate.

    I worry whether all the results presented on that page are from the same driver.
     
  15. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,210
    Hardwareluxx:
    https://www.hardwareluxx.de/index.p...o3d-geforce-rtx-3090-ti-im-test.html?start=17

    Cyberpunk 2077:
    3090Ti is 85% faster @ both 2160p and 1440p than 6900XT

    Control:
    3090Ti is 70% faster @2160p and 80% faster @1440p than 6900XT

    Battlefield V:
    3090Ti is 120% faster @2160p and 70% faster @1440p than 6900XT

    Call Of Duty Cold War Black Ops:
    3090Ti is 65% faster @2160p and 60% faster @1440p than 6900XT

    Call Of Duty Modern Warfare:
    3090Ti is 30% faster @both 2160p and 1440p than 6900XT
    Probably a VRAM limitation at this 4K resolution.
     
    #1515 DavidGraham, Apr 5, 2022
    Last edited: Apr 5, 2022
    PSman1700 likes this.
  16. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    This should be 120% faster. You don’t need to add another 100% when something is over 100% faster. It’s the same math as all the others.
     
    T2098, DavidGraham, CeeGee and 2 others like this.
  17. TopSpoiler

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    74
    Likes Received:
    176
  18. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,210
    pharma and PSman1700 like this.
  19. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    PSman1700 likes this.
  20. Tarkin1977

    Newcomer

    Joined:
    Mar 10, 2018
    Messages:
    15
    Likes Received:
    15
    Dont get this wrong... Everybody already knows, that RDNA2 has significantly lower Ray Tracing Performance. No need to spam the forums with every single game performance test you come across

    AMD needs at least a 3x increase (which is rumored) to not be obliterated again this fall.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...