GPU Ray Tracing Performance Comparisons [2021] *spawn*

Discussion in 'Architecture and Products' started by DavidGraham, Mar 29, 2021.

  1. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    603
    Likes Received:
    1,123
    3DMark Feature Test uses DXR: https://www.legitreviews.com/3dmark-directx-raytracing-feature-test-benchmark-results_223376
    Ampere is around 60% faster per Compute Unit than Turing.

    /edit:
    Perf/mm^2 increased around 2.2x and perf/xtors increased about 1.25x between TU104 and GA104.
     
    #661 troyan, Aug 20, 2021
    Last edited: Aug 20, 2021
  2. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,397
    I mean most games are "black boxes" too since we don't have their code and can't know what they are doing.
     
    PSman1700, pharma and Jawed like this.
  3. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Anyone with an RTX card can download and run the GPSnoopy RayTracingInVulkan application:

    https://forum.beyond3d.com/posts/2218671/

    like I did. You can vary the bounce counts, rendering resolution and samples per pixel per frame.

    In theory it's possible to coordinate amongst testers to find "worst-case" scenarios, with high bounce counts and high ray divergence.

    Also I think we need to be wary of old benchmarks with "out of date" drivers.
     
    T2098 and BRiT like this.
  4. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    So, that's a no?
     
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    I don't understand what you're saying but I'm pretty sure that divergence will be an issue no matter the workload. So smaller wavefronts = better.
     
  6. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Moving (possibly a lot of..) compute state around to extract coherency can be very expensive. It doesn't automatically give you a win and it can only help with execution divergence, not data divergence.
     
  7. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    1024 work items running on hardware with a hardware thread size of 32 means 32 hardware threads are required to support that workgroup.

    In the limit: if a single ray follows a very long path e.g. traversing more nodes of the BVH than all the other rays in the workgroup, then that ray "drags along" 31 others, assuming there's 1 ray per work item. So, the entire hardware thread runs at the speed of the slowest ray. The 31 other hardware threads can run other code (e.g. hit shader), provided that there's an uber shader wrapping traversal and hit/miss.

    In reality, of course, there'll be averaging. Some hardware threads might have a very narrow range of ALU execution cycles for BVH traversal, e.g. highly coherent rays or rays that travel a short distance or rays that terminate upon their first bounce. Other hardware threads will see a large range of ALU execution cycles for BVH traversal, meaning that those hardware threads will see many ALU cycles wasted to "finish off" the last one or two rays.

    It appears AMD took the decision to use uber ray-trace shaders to combine traversal with miss/hit shaders, and then combined rays into workgroups that are larger than the hardware thread size.

    Yes, smaller hardware threads are always better in terms of ALU utilisation. Intel with 16 work item hardware threads should be rewarded with less wastage. The trade-off is that there'll be more hardware:
    • scheduling for each SIMD
    • register file mechanics (banking, porting control)
    • data paths (to and from the rest of the GPU
    Cache is for data divergence.

    Execution divergence can be mitigated by keeping work on the same SIMD, but moving it to a different hardware thread. If you do that right, then pretty much all the work is in the register file and operand collector.

    But yes, I'm not going to pretend that this is easy. Nested control flow wrecks performance pretty much regardless (though BVH traversal is a kind of uniform nested loop). I'm still sceptical about it in the end.

    Intel looks like it's mitigating the problem with a hardware thread size of 16.
     
  8. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Yeah this is the case for any high latency instruction. Stalled wavefronts are swapped out for other wavefronts from the same or a different thread group. Nothing unique there and it doesn’t help with divergence within an actively executing wavefront (or hardware thread as you call it). I’m not sure why you don’t like to use standard GPU compute terminology :)

    Ok, this is standard best practice for GPU compute. Not following how it helps with divergence within an active wavefront.

    Yup

    Register files are banked and populated based on mapping of SIMD lanes. Arbitrary rearrangement of in-flight threads in those lanes would wreak havoc on register access.

    Why 16 and not 8?
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I didn't say it did. Divergence within a workgroup is the broad topic. Often a workgroup uses shared memory, so shaders that utilise shared memory as part of their algorithm are an example where divergence is less of a hit than expected, because of the speed-up provided by shared memory - provided that there's more than one hardware thread issued as part of the workgroup.

    RDNA 2's BVH traversal algorithm appears to be heavily dependent upon shared memory. It's unclear if it uses workgroups of more than 32 work items. I'm speculating as to why performance in ray-tracing doesn't fall off a cliff:
    • uber-shader mixes BVH traversal and miss/hit shading
    • large work-groups don't waste 1023 ALU cycles per cycle, if 1 work item is running and 1023 are not - the ratio is closer to 31:1 in that extreme case
    The key here is that the program counter is per hardware thread, not per workgroup.

    For RDNA, AMD beefed up:
    • register file size
    • count of hardware threads per SIMD
    • LDS size for large workgroups (portions of 128KB, or 128KB entirely used by a single workgroup?)
    and I speculate that RDNA 2's ray tracing relies upon large workgroup sizes coupled with large shared memory allocations.

    Agreed.

    Remember how G80 did vertical and horizontal register allocations? It was a bit of a mess.

    Reorganisation in units of quad-lanes might be an acceptable trade-off.

    Well, that's the control-hardware versus work-done trade-off.

    Related to this we have the 1-pixel-sized triangles problem. It's as if there were lots of control-hardware for less ALU lanes! Pixels are shaded in units of fragment-quads so the hardware can end up running at one-quarter throughput in this extreme case. So your SIMD-32 is suddenly a SIMD-8 in terms of throughput.

    I do wonder whether the power usage of control hardware is now relatively small versus what the ALU lanes use at full capacity, such that there is a trend towards narrower SIMDs and/or an increase in control hardware per ALU lane.

    We'll have to wait for the Arc whitepaper or similar to find out though.
     
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Pretty cool app. My numbers on a 3090 below.
    • 6900xt is faster than the 3090 with only 1 bounce.
    • 6900xt is faster when intersecting procedural geometry (scene 1 and 2).
    • 3090 is faster with more bounces but not fast enough to make up for deficit on procedural geometry
    • 3090 is faster when intersecting triangle geometry but still loses by a significant margin with only 1 bounce (scene 4 and 5)
    Speculation based on these results:
    • 6900xt has much more raw ray casting throughput but can only achieve it for highly coherent primary rays (1 bounce)
    • 6900xt performance drops precipitously after 1 bounce indicating SIMD efficiency loss due to ray divergence
    • 6900xt is much faster at tracing procedural geometry because everything is done on the shaders (traversal and intersection) and there's no need to exchange data with the ray accelerator
    • 3090 is much faster with multiple bounces due to better handling of divergent / incoherent rays by the MIMD RT core
    • 3090 is much slower at intersecting procedural geometry because it has to hand off to the shader core to run the intersection and the architecture is not optimized for this
    • 3090 is happiest when it can leverage the RT core to intersect triangle geometry
    The 3090's strengths are more in line with practical use cases which explains why it comes out ahead in actual games (triangle based geometry, incoherent rays)

    [​IMG]
     
    Lightman, Newguy, T2098 and 6 others like this.
  11. vjPiedPiper

    Newcomer

    Joined:
    Nov 23, 2005
    Messages:
    136
    Likes Received:
    88
    Location:
    Melbourne Aus.
    Would this describe mesh shader workflows?
     
  12. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Not really. Mesh shaders spit out triangles and are part of the classic rasterization pipeline.
     
  13. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    I've just run the app for the 1st time and my hunch for the 6900xt being faster at primary rays is due to the fact that a larger portion of the traversal work is assigned to the programmable cores (wrt RTX GPUs) and given that surface shading is very simple one is basically pegging the whole 6900xt to perform ray tracing.

    In this specific scenario in theory one could throw much more complex shading at RTX GPUs without impacting performance too much.
     
    pharma, Lightman, Newguy and 4 others like this.
  14. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    That's interesting, since even an RTX 3080 TI supposedly has 67 dedicated "RT-TLOPS" plus it's regular 34 TFLOPS (I can't find the officially mentioned numbers for 3090 right now, should be ever so slightly higher) compared to the 6900XT's 23 Total TFLOPS ;)
    Sorry, I can imagine you also cringed when you read this marketing stunt.
    --
    So, how would one design a test to best isolate the Ray/Triangle intersection rate? In effect like a full screen quad fillrate test for Raytracing.
     
    Jawed likes this.
  15. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    797
    Likes Received:
    1,624
    That GPSnoopy RayTracingInVulkan application has a toy scene with minimum geometry overlapping (each ray will have the bare minimum of traversal steps to hit some geometry in such scene), aside from that, the scene itself is just too small with a too few pieces of geometry, such scene won't be representative of performance in real games that are a way more complex than this.
    Tracing a toy scene which should fit nicely into L2 won't reveal much (real games have large BVHs, not tiny). If we add a little bit of divergence like futuremark did in its RT synthetic DOF benchmark (still primary visibility with a simple toy scene and rays presorting by direction), SW traversal becomes slower. Any sight of divergent execution drops SW traversal performance dramatically and that can be anything from more complex scenes to simply more stochastic rays directions.

    Sorry, but such tests look useless even when we account for their synthetic nature.

    Why just don't use game engines for RT tests? With game engine, you can use ingame scenes, you can adjust number of materials, scene complexity and it would be also representetive of performance in real games. UE5, for example, would allow to test primary visibility, path tracing, RT reflections, RT Shadows, GI, etc.
     
    milk, Lightman, PSman1700 and 3 others like this.
  16. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    797
    Likes Received:
    1,624
    There is exchange of data with ray-AABBs intersection units in ray accelerator units, otherwise GPUs with HW acceleration would have been just as slow as the all SW GTX 1080 Ti in the tables here - https://github.com/GPSnoopy/RayTracingInVulkan
    What this test doesn't take into account is BVH complexity or it can also be SM <-> RA communication latency bound. Wide BVH architectures should win on more complex scenes, while narrow BVH architectures would shine on toy examples where there are not enough of complexity to extract additional BVH parallelism on wider architectures.
     
    PSman1700 likes this.
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    Yeah presumably in mixed workloads the 6900xt will lose a step.

    Here's a trace comparison on the 3090 between scene 1 (procedural spheres) and scene 4 (Cornell box with real geometry). Instruction issue rates are pretty low and it's mostly INT. So lots of opportunity to overlap other compute or graphics. The main difference between the scenes is much higher cache and vram usage for the Cornell box. For scene 1 where the 6900xt shines it seems everything basically fits in L1 which makes sense since there's no real geometry or textures to load.

    [​IMG]
     
    Lightman, Jawed, T2098 and 5 others like this.
  18. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    They're definitely useless as an indicator of performance in games. But they're useful for trying to isolate specific behavior of the hardware. In any real RT scenario like a game there will be so many variables that it would be difficult to identify the contribution of any one factor.
     
    Albuquerque, Lightman, Jawed and 3 others like this.
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Taking the row of 16 bounces, which corresponds with the test results table shown here:

    GPSnoopy/RayTracingInVulkan: Implementation of Peter Shirley's Ray Tracing In One Weekend book using Vulkan and NVIDIA's RTX extension. (github.com)

    • Scene 1 - 42.8fps - 1.26 GRPS - trinibwoy 1.23 GRPS
    • Scene 2 - 43.6fps - 1.29 GRPS - trinibwoy 1.26 GRPS
    • Scene 3 - 38.9fps - 1.15 GRPS - trinibwoy 1.11 GRPS
    • Scene 4 - 79.5fps - 2.34 GRPS - trinibwoy 2.33 GRPS
    • Scene 5 - 40.0fps - 1.18 GRPS - trinibwoy 1.10 GRPS
    You have a 3090FE too, don't you? Scene 5 shows the most difference, with your result at 93% of the test results table.

    Did you test at 2560x1440 or at 3840x2160? I'm curious whether resolution has made a difference here as 7% seems like an outlier. I don't see why resolution would make a difference.

    When I did my tests the GPU was effectively running at throttled clocks. I ran my tests like this (excerpt from a much longer BAT file) :

    Code:
    RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 1    > 016x001.txt
    RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 2    > 016x002.txt
    RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 4    > 016x004.txt
    RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 8    > 016x008.txt
    RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 16   > 016x016.txt
    RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 32   > 016x032.txt
    RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 64   > 016x064.txt
    so that there was no chance for the GPU to "thermally relax" as it progressed. The spreadsheet data I collected shows, in many cases, initial data points that are slow due to "start up", but averaging takes care of that.

    Another explanation for your outlier could be that your card was thermally throttled. When running these tests is your card thermally throttled? Power throttled?

    I should point out that my 6900XT is not using the XTX chip that I once mistakenly referred to it as, but XTXH. Mine, at default, runs at 310-312W according to the Radeon software metrics, which I think is a ~30W overclock versus the XTX chip (280W). Of course the cooler is relevant here, too. I deliberately ran the case at low airflow to get a solid throttle, but the chip supposedly throttles at 95C and in my testing it only got to 91 ("junction", which I think is a "hotspot"). So power appears to be the dominant throttling factor with my card/case.

    XTX chips supposedly throttle at 110C, while XTXH have a lower limit of 95:

    The Radeon RX 6900 XT Owners Thread. | Page 125 | Overclockers UK Forums

    When I ran the tests back in July I was under the impression that the thermal throttle is 90C, but it seems not...

    Throttling, of course, is relevant in a discussion of "shading FLOPS" versus "ray tracing FLOPS"...
     
    Lightman likes this.
  20. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    797
    Likes Received:
    1,624
    It's very far away from microbenchmarks and the lack of ability to adjust scene makes it even less usable probably because it has never been meant for such usage.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...