GPU Ray Tracing Performance Comparisons [2021-2022]

I think it's very much worthwhile to contrast "professional" ray-traced rendering with game ray-traced rendering.

Optix is a black box, as far as I can tell. It's running on top of black box hardware.

Black box inception makes it pretty hard to talk about the hardware.

3DMark Feature Test uses DXR: https://www.legitreviews.com/3dmark-directx-raytracing-feature-test-benchmark-results_223376
Ampere is around 60% faster per Compute Unit than Turing.

/edit:
Perf/mm^2 increased around 2.2x and perf/xtors increased about 1.25x between TU104 and GA104.
 
Last edited:
Anyone with an RTX card can download and run the GPSnoopy RayTracingInVulkan application:

https://forum.beyond3d.com/posts/2218671/

like I did. You can vary the bounce counts, rendering resolution and samples per pixel per frame.

In theory it's possible to coordinate amongst testers to find "worst-case" scenarios, with high bounce counts and high ray divergence.

Also I think we need to be wary of old benchmarks with "out of date" drivers.
 
It's worth remembering that a workgroup of, say, 1024 work items intrinsically offers a speed-up for dynamic branching on current hardware: when an entire hardware thread goes dark, that subset of work items (e.g. 32 on RDNA 2) no longer occupies any execution slots. During traversal, those execution slots can run the closest-/any-/miss-shader for those work items instead, if the hardware is running an uber-ray-shader that combines traversal with closest-/any-/miss-shaders.

I don't understand what you're saying but I'm pretty sure that divergence will be an issue no matter the workload. So smaller wavefronts = better.
 
Finally, if you can build a compute unit that can do conditional routing to mitigate the slow-down of incoherent branching, then hardware traversal is entirely pointless. RDNA 2 is not 10x or more slower even in the worst-case scenarios, so a 2-4x speed-up from such a solution is more than enough.
Moving (possibly a lot of..) compute state around to extract coherency can be very expensive. It doesn't automatically give you a win and it can only help with execution divergence, not data divergence.
 
I don't understand what you're saying but I'm pretty sure that divergence will be an issue no matter the workload. So smaller wavefronts = better.
1024 work items running on hardware with a hardware thread size of 32 means 32 hardware threads are required to support that workgroup.

In the limit: if a single ray follows a very long path e.g. traversing more nodes of the BVH than all the other rays in the workgroup, then that ray "drags along" 31 others, assuming there's 1 ray per work item. So, the entire hardware thread runs at the speed of the slowest ray. The 31 other hardware threads can run other code (e.g. hit shader), provided that there's an uber shader wrapping traversal and hit/miss.

In reality, of course, there'll be averaging. Some hardware threads might have a very narrow range of ALU execution cycles for BVH traversal, e.g. highly coherent rays or rays that travel a short distance or rays that terminate upon their first bounce. Other hardware threads will see a large range of ALU execution cycles for BVH traversal, meaning that those hardware threads will see many ALU cycles wasted to "finish off" the last one or two rays.

It appears AMD took the decision to use uber ray-trace shaders to combine traversal with miss/hit shaders, and then combined rays into workgroups that are larger than the hardware thread size.

Yes, smaller hardware threads are always better in terms of ALU utilisation. Intel with 16 work item hardware threads should be rewarded with less wastage. The trade-off is that there'll be more hardware:
  • scheduling for each SIMD
  • register file mechanics (banking, porting control)
  • data paths (to and from the rest of the GPU
Moving (possibly a lot of..) compute state around to extract coherency can be very expensive. It doesn't automatically give you a win and it can only help with execution divergence, not data divergence.
Cache is for data divergence.

Execution divergence can be mitigated by keeping work on the same SIMD, but moving it to a different hardware thread. If you do that right, then pretty much all the work is in the register file and operand collector.

But yes, I'm not going to pretend that this is easy. Nested control flow wrecks performance pretty much regardless (though BVH traversal is a kind of uniform nested loop). I'm still sceptical about it in the end.

Intel looks like it's mitigating the problem with a hardware thread size of 16.
 
In the limit: if a single ray follows a very long path e.g. traversing more nodes of the BVH than all the other rays in the workgroup, then that ray "drags along" 31 others, assuming there's 1 ray per work item. So, the entire hardware thread runs at the speed of the slowest ray. The 31 other hardware threads can run other code (e.g. hit shader), provided that there's an uber shader wrapping traversal and hit/miss.

Yeah this is the case for any high latency instruction. Stalled wavefronts are swapped out for other wavefronts from the same or a different thread group. Nothing unique there and it doesn’t help with divergence within an actively executing wavefront (or hardware thread as you call it). I’m not sure why you don’t like to use standard GPU compute terminology :)

It appears AMD took the decision to use uber ray-trace shaders to combine traversal with miss/hit shaders, and then combined rays into workgroups that are larger than the hardware thread size.

Ok, this is standard best practice for GPU compute. Not following how it helps with divergence within an active wavefront.

Yes, smaller hardware threads are always better in terms of ALU utilisation. Intel with 16 work item hardware threads should be rewarded with less wastage. The trade-off is that there'll be more hardware:
  • scheduling for each SIMD
  • register file mechanics (banking, porting control)
  • data paths (to and from the rest of the GPU
Yup

Execution divergence can be mitigated by keeping work on the same SIMD, but moving it to a different hardware thread. If you do that right, then pretty much all the work is in the register file and operand collector.

Register files are banked and populated based on mapping of SIMD lanes. Arbitrary rearrangement of in-flight threads in those lanes would wreak havoc on register access.

Intel looks like it's mitigating the problem with a hardware thread size of 16.
Why 16 and not 8?
 
Ok, this is standard best practice for GPU compute. Not following how it helps with divergence within an active wavefront.
I didn't say it did. Divergence within a workgroup is the broad topic. Often a workgroup uses shared memory, so shaders that utilise shared memory as part of their algorithm are an example where divergence is less of a hit than expected, because of the speed-up provided by shared memory - provided that there's more than one hardware thread issued as part of the workgroup.

RDNA 2's BVH traversal algorithm appears to be heavily dependent upon shared memory. It's unclear if it uses workgroups of more than 32 work items. I'm speculating as to why performance in ray-tracing doesn't fall off a cliff:
  • uber-shader mixes BVH traversal and miss/hit shading
  • large work-groups don't waste 1023 ALU cycles per cycle, if 1 work item is running and 1023 are not - the ratio is closer to 31:1 in that extreme case
The key here is that the program counter is per hardware thread, not per workgroup.

For RDNA, AMD beefed up:
  • register file size
  • count of hardware threads per SIMD
  • LDS size for large workgroups (portions of 128KB, or 128KB entirely used by a single workgroup?)
and I speculate that RDNA 2's ray tracing relies upon large workgroup sizes coupled with large shared memory allocations.

Register files are banked and populated based on mapping of SIMD lanes. Arbitrary rearrangement of in-flight threads in those lanes would wreak havoc on register access.
Agreed.

Remember how G80 did vertical and horizontal register allocations? It was a bit of a mess.

Reorganisation in units of quad-lanes might be an acceptable trade-off.

Why 16 and not 8?
Well, that's the control-hardware versus work-done trade-off.

Related to this we have the 1-pixel-sized triangles problem. It's as if there were lots of control-hardware for less ALU lanes! Pixels are shaded in units of fragment-quads so the hardware can end up running at one-quarter throughput in this extreme case. So your SIMD-32 is suddenly a SIMD-8 in terms of throughput.

I do wonder whether the power usage of control hardware is now relatively small versus what the ALU lanes use at full capacity, such that there is a trend towards narrower SIMDs and/or an increase in control hardware per ALU lane.

We'll have to wait for the Arc whitepaper or similar to find out though.
 
Anyone with an RTX card can download and run the GPSnoopy RayTracingInVulkan application

Pretty cool app. My numbers on a 3090 below.
  • 6900xt is faster than the 3090 with only 1 bounce.
  • 6900xt is faster when intersecting procedural geometry (scene 1 and 2).
  • 3090 is faster with more bounces but not fast enough to make up for deficit on procedural geometry
  • 3090 is faster when intersecting triangle geometry but still loses by a significant margin with only 1 bounce (scene 4 and 5)
Speculation based on these results:
  • 6900xt has much more raw ray casting throughput but can only achieve it for highly coherent primary rays (1 bounce)
  • 6900xt performance drops precipitously after 1 bounce indicating SIMD efficiency loss due to ray divergence
  • 6900xt is much faster at tracing procedural geometry because everything is done on the shaders (traversal and intersection) and there's no need to exchange data with the ray accelerator
  • 3090 is much faster with multiple bounces due to better handling of divergent / incoherent rays by the MIMD RT core
  • 3090 is much slower at intersecting procedural geometry because it has to hand off to the shader core to run the intersection and the architecture is not optimized for this
  • 3090 is happiest when it can leverage the RT core to intersect triangle geometry
The 3090's strengths are more in line with practical use cases which explains why it comes out ahead in actual games (triangle based geometry, incoherent rays)

gpsnoopy-3090.png
 
Pretty cool app. My numbers on a 3090 below.
  • 6900xt is faster than the 3090 with only 1 bounce.
  • 6900xt is faster when intersecting procedural geometry (scene 1 and 2).
  • 3090 is faster with more bounces but not fast enough to make up for deficit on procedural geometry
  • 3090 is faster when intersecting triangle geometry but still loses by a significant margin with only 1 bounce (scene 4 and 5)
Speculation based on these results:
  • 6900xt has much more raw ray casting throughput but can only achieve it for highly coherent primary rays (1 bounce)
  • 6900xt performance drops precipitously after 1 bounce indicating SIMD efficiency loss due to ray divergence
  • 6900xt is much faster at tracing procedural geometry because everything is done on the shaders (traversal and intersection) and there's no need to exchange data with the ray accelerator
  • 3090 is much faster with multiple bounces due to better handling of divergent / incoherent rays by the MIMD RT core
  • 3090 is much slower at intersecting procedural geometry because it has to hand off to the shader core to run the intersection and the architecture is not optimized for this
  • 3090 is happiest when it can leverage the RT core to intersect triangle geometry
The 3090's strengths are more in line with practical use cases which explains why it comes out ahead in actual games (triangle based geometry, incoherent rays)

I've just run the app for the 1st time and my hunch for the 6900xt being faster at primary rays is due to the fact that a larger portion of the traversal work is assigned to the programmable cores (wrt RTX GPUs) and given that surface shading is very simple one is basically pegging the whole 6900xt to perform ray tracing.

In this specific scenario in theory one could throw much more complex shading at RTX GPUs without impacting performance too much.
 
I've just run the app for the 1st time and my hunch for the 6900xt being faster at primary rays is due to the fact that a larger portion of the traversal work is assigned to the programmable cores (wrt RTX GPUs) and given that surface shading is very simple one is basically pegging the whole 6900xt to perform ray tracing.

In this specific scenario in theory one could throw much more complex shading at RTX GPUs without impacting performance too much.
That's interesting, since even an RTX 3080 TI supposedly has 67 dedicated "RT-TLOPS" plus it's regular 34 TFLOPS (I can't find the officially mentioned numbers for 3090 right now, should be ever so slightly higher) compared to the 6900XT's 23 Total TFLOPS ;)
Sorry, I can imagine you also cringed when you read this marketing stunt.
--
So, how would one design a test to best isolate the Ray/Triangle intersection rate? In effect like a full screen quad fillrate test for Raytracing.
 
That GPSnoopy RayTracingInVulkan application has a toy scene with minimum geometry overlapping (each ray will have the bare minimum of traversal steps to hit some geometry in such scene), aside from that, the scene itself is just too small with a too few pieces of geometry, such scene won't be representative of performance in real games that are a way more complex than this.
Tracing a toy scene which should fit nicely into L2 won't reveal much (real games have large BVHs, not tiny). If we add a little bit of divergence like futuremark did in its RT synthetic DOF benchmark (still primary visibility with a simple toy scene and rays presorting by direction), SW traversal becomes slower. Any sight of divergent execution drops SW traversal performance dramatically and that can be anything from more complex scenes to simply more stochastic rays directions.

Sorry, but such tests look useless even when we account for their synthetic nature.

Why just don't use game engines for RT tests? With game engine, you can use ingame scenes, you can adjust number of materials, scene complexity and it would be also representetive of performance in real games. UE5, for example, would allow to test primary visibility, path tracing, RT reflections, RT Shadows, GI, etc.
 
6900xt is much faster at tracing procedural geometry because everything is done on the shaders (traversal and intersection) and there's no need to exchange data with the ray accelerator
There is exchange of data with ray-AABBs intersection units in ray accelerator units, otherwise GPUs with HW acceleration would have been just as slow as the all SW GTX 1080 Ti in the tables here - https://github.com/GPSnoopy/RayTracingInVulkan
What this test doesn't take into account is BVH complexity or it can also be SM <-> RA communication latency bound. Wide BVH architectures should win on more complex scenes, while narrow BVH architectures would shine on toy examples where there are not enough of complexity to extract additional BVH parallelism on wider architectures.
 
I've just run the app for the 1st time and my hunch for the 6900xt being faster at primary rays is due to the fact that a larger portion of the traversal work is assigned to the programmable cores (wrt RTX GPUs) and given that surface shading is very simple one is basically pegging the whole 6900xt to perform ray tracing.

In this specific scenario in theory one could throw much more complex shading at RTX GPUs without impacting performance too much.

Yeah presumably in mixed workloads the 6900xt will lose a step.

Here's a trace comparison on the 3090 between scene 1 (procedural spheres) and scene 4 (Cornell box with real geometry). Instruction issue rates are pretty low and it's mostly INT. So lots of opportunity to overlap other compute or graphics. The main difference between the scenes is much higher cache and vram usage for the Cornell box. For scene 1 where the 6900xt shines it seems everything basically fits in L1 which makes sense since there's no real geometry or textures to load.

gputrace-gpsnoopy-scenes-1-4.png
 
Sorry, but such tests look useless even when we account for their synthetic nature.

They're definitely useless as an indicator of performance in games. But they're useful for trying to isolate specific behavior of the hardware. In any real RT scenario like a game there will be so many variables that it would be difficult to identify the contribution of any one factor.
 
Taking the row of 16 bounces, which corresponds with the test results table shown here:

GPSnoopy/RayTracingInVulkan: Implementation of Peter Shirley's Ray Tracing In One Weekend book using Vulkan and NVIDIA's RTX extension. (github.com)

  • Scene 1 - 42.8fps - 1.26 GRPS - trinibwoy 1.23 GRPS
  • Scene 2 - 43.6fps - 1.29 GRPS - trinibwoy 1.26 GRPS
  • Scene 3 - 38.9fps - 1.15 GRPS - trinibwoy 1.11 GRPS
  • Scene 4 - 79.5fps - 2.34 GRPS - trinibwoy 2.33 GRPS
  • Scene 5 - 40.0fps - 1.18 GRPS - trinibwoy 1.10 GRPS
You have a 3090FE too, don't you? Scene 5 shows the most difference, with your result at 93% of the test results table.

Did you test at 2560x1440 or at 3840x2160? I'm curious whether resolution has made a difference here as 7% seems like an outlier. I don't see why resolution would make a difference.

When I did my tests the GPU was effectively running at throttled clocks. I ran my tests like this (excerpt from a much longer BAT file) :

Code:
RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 1    > 016x001.txt
RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 2    > 016x002.txt
RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 4    > 016x004.txt
RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 8    > 016x008.txt
RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 16   > 016x016.txt
RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 32   > 016x032.txt
RayTracer.exe --benchmark --width 3840 --height 2160 --fullscreen --scene 2 --present-mode 0 --bounces 16  --samples 64   > 016x064.txt

so that there was no chance for the GPU to "thermally relax" as it progressed. The spreadsheet data I collected shows, in many cases, initial data points that are slow due to "start up", but averaging takes care of that.

Another explanation for your outlier could be that your card was thermally throttled. When running these tests is your card thermally throttled? Power throttled?

I should point out that my 6900XT is not using the XTX chip that I once mistakenly referred to it as, but XTXH. Mine, at default, runs at 310-312W according to the Radeon software metrics, which I think is a ~30W overclock versus the XTX chip (280W). Of course the cooler is relevant here, too. I deliberately ran the case at low airflow to get a solid throttle, but the chip supposedly throttles at 95C and in my testing it only got to 91 ("junction", which I think is a "hotspot"). So power appears to be the dominant throttling factor with my card/case.

XTX chips supposedly throttle at 110C, while XTXH have a lower limit of 95:

The Radeon RX 6900 XT Owners Thread. | Page 125 | Overclockers UK Forums

When I ran the tests back in July I was under the impression that the thermal throttle is 90C, but it seems not...

Throttling, of course, is relevant in a discussion of "shading FLOPS" versus "ray tracing FLOPS"...
 
Back
Top