GPU Ray Tracing Performance Comparisons [2021-2022]

What performance bug are people encountering here with DLSS exactly? I am not sure if I have "had that" yet?
From the above Hitman 3 review; it seems to occur randomly.
Now as we said in our previous article, we encountered a DLSS glitch/bug which resulted in really low performance. We were able to replicate this by enabling/disabling Ray Tracing (and without making any changes at all to DLSS). This issue appears randomly, so we don’t know what is really causing it. Still, and if you ever experience it, you can fix it by disabling and then re-enabling DLSS.
 
Not a good look that Nvidia brags about the DLSS performance improvement so much here. This is nothing to brag about. RT enabled games should run fine without DLSS and very smooth with it on, not running like a slideshow without DLSS and running barely acceptable with it on. This is ruining the reputition of both, their cards and Raytracing in general. These effects probably run at full resolution, unecessarily so.

This is a disaster IMO. Thankfully its an older game, so this might get overlooked.
I'd say it depends on what improvement you get from using RT. Here specifically it's just doesn't worth the performance hit.
This situation isn't helped by their weird settings UI nowhere in which it is explained that "reflections quality" option works with RT reflections. So most people just turn on RT and get their 30 fps with it as a result - while it's perfectly fine to use RT even with "low" reflections quality here IMO. The biggest visual impact of reflections on transparent glass surfaces is there on "low".

It's difficult to know if DLSS/P is functioning normally since DLSS glitch/bug that results in very low performance is still evident in this game.
I haven't seen any issues with DLSS functioning normally here. DLSS scaling is in line with what you'd expect from internal resolution changes.
 
The 3090 is 67% faster than 6950XT in Serious Sam First Encounter Path Tracing @4K.


The 3090 is 77% faster than 6950XT in Doom Path Tracing @4K.


What's noticeable in these mods, is the poor performance for Turing GPUs, they really appear to be doing something wrong with Turing, the 3070 is more than 65% faster than 2080Ti, which should never happen!
 
The 3090 is 67% faster than 6950XT in Serious Sam First Encounter Path Tracing @4K.


The 3090 is 77% faster than 6950XT in Doom Path Tracing @4K.


What's noticeable in these mods, is the poor performance for Turing GPUs, they really appear to be doing something wrong with Turing, the 3070 is more than 65% faster than 2080Ti, which should never happen!

Nvidia did claim Ampere was almost twice as fast as Turing in the optimal case when doing RT, and there's an awful lot of Ampere-specific enhancements that might come into play with a legacy title 'converted' to path tracing that may not elsewhere.

- GPU Accelerated RT Motion Blur
- Massive FP32 increase
- RT + Compute concurrency

These path traced legacy games tend to have a much higher proportion of their frame time spent on RT vs legacy rasterization, so a lot more benefit to be seen.

In a modern AAA title that might only splash a couple of RT 'effects' on top of a render pipeline that's mostly legacy rasterization, if you are only spending let's say 25% of your frame time on RT, then even if you make RT infinitely fast (0ms addition to your frame time) then you still only get ~25% more FPS.

Quake 2 RTX on my 3080 spends fully 30-35% of its frametime on denoising (which I believe leans heavily on FP32, giving Ampere a significant advantage) and who knows, they may have managed to take advantage of RT+compute concurrency too.
 

Attachments

  • q2rtx.png
    q2rtx.png
    218.7 KB · Views: 21
These path traced legacy games tend to have a much higher proportion of their frame time spent on RT vs legacy rasterization, so a lot more benefit to be seen.
The 2080Ti is so far behind the 6900XT and 6800XT as well, which just doesn't happen elsewhere. Certainly not the norm when it comes to either Quake 2 or Minecrafe path tracing, where the 2080Ti dominates RDNA2 GPUs by a significant margin. Something is wrong in these two mods (Doom and Serious Sam) which causes Turing GPUs to exhibit awful performance.
 
The 2080Ti is so far behind the 6900XT and 6800XT as well, which just doesn't happen elsewhere. Certainly not the norm when it comes to either Quake 2 or Minecraft path tracing, where the 2080Ti dominates RDNA2 GPUs by a significant margin. Something is wrong in these two mods (Doom and Serious Sam) which causes Turing GPUs to exhibit awful performance.
That's a good point; there's clearly something different about the way those two titles are implemented, although it may help to look at it from the opposite viewpoint. Don't ask 'why did they nerf poor Turing', but it may be something that they are leveraging that both RDNA2 and Ampere happen to be better at than Turing.

One such thing I was able to find:

"Ampere RT doubled ray-triangle performance vs Turing, which according to some evidence suggest it runs 2 ray-box and 1 ray-triangle per clock.
Ampere RT now does 2 ops/clk for both.
RDNA 2 RT has 4 ray-box and 1 ray-triangle per clock. Thus, RDNA2 has 2x faster ray-box, but only 50% ray-triangle vs Ampere."

I wasn't able to find concrete numbers from NVidia about Turing's ray-box rate per clock, but RDNA1 is definitely 4x the ray-triangle rate.
If Doom or Serious Sam were heavily leveraging ray-box intersections, given that RDNA2 is 4x faster at those per clock than Turing, the results may make a bit more sense.
 

Attachments

  • rdna2-rb.PNG
    rdna2-rb.PNG
    157.3 KB · Views: 16
"Ampere RT doubled ray-triangle performance vs Turing, which according to some evidence suggest it runs 2 ray-box and 1 ray-triangle per clock.”

Out of curiosity what’s the evidence for 2 box intersections per clock? Nvidia’s RT patents refer to 8 boxes per clock. Either way the BVH should be relatively simple given the low density geometry and low detail environment being rendered. Is the BVH build being done on the CPU or GPU?

It doesn’t really make sense for RDNA 2 to gain ground on Turing in raytraced versions of old games. RT should be an even greater percentage of the workload in such games given the simple geometry and shading and should therefore play to Turing’s RT strengths.
 
That's a good point; there's clearly something different about the way those two titles are implemented, although it may help to look at it from the opposite viewpoint. Don't ask 'why did they nerf poor Turing', but it may be something that they are leveraging that both RDNA2 and Ampere happen to be better at than Turing.

One such thing I was able to find:

"Ampere RT doubled ray-triangle performance vs Turing, which according to some evidence suggest it runs 2 ray-box and 1 ray-triangle per clock.
Ampere RT now does 2 ops/clk for both.
RDNA 2 RT has 4 ray-box and 1 ray-triangle per clock. Thus, RDNA2 has 2x faster ray-box, but only 50% ray-triangle vs Ampere."

I wasn't able to find concrete numbers from NVidia about Turing's ray-box rate per clock, but RDNA1 is definitely 4x the ray-triangle rate.
If Doom or Serious Sam were heavily leveraging ray-box intersections, given that RDNA2 is 4x faster at those per clock than Turing, the results may make a bit more sense.

This reminds me of the good old days before RDNA2 launched when we where all comparing the XSXs advertised intersection rates (or some other paper RT metric) and some were concluding it was going to run rings around any Turing based GPU.

Ahhh good times.
 
Out of curiosity what’s the evidence for 2 box intersections per clock? Nvidia’s RT patents refer to 8 boxes per clock. Either way the BVH should be relatively simple given the low density geometry and low detail environment being rendered. Is the BVH build being done on the CPU or GPU?

It doesn’t really make sense for RDNA 2 to gain ground on Turing in raytraced versions of old games. RT should be an even greater percentage of the workload in such games given the simple geometry and shading and should therefore play to Turing’s RT strengths.

I've been trying to find details, but NV is pretty tight lipped in their documentation:

Starting on Page 17 they talk about the doubled ray-triangle intersection rate and are very careful not to mention ray-box intersection rate increases anywhere, which leads me to believe the ray-box intersection rate is the same as Turing. Still trying to find concrete information on Turing's rate for each, though.

They also talk about Ampere's concurrent compute and RT capability on the next page; specifically talking about the case where you do RT + denoising concurrently. Looking at the profile of the frametime for Quake2 RTX I posted above, if an engine were to leverage that concurrent compute properly, that would be an enormous speedup given that there looked to be at least 3ms each of denoising and RT for a 12ms frame. According to Nvidia, Turing simply isn't capable of this. They show their best case example of the speedup from concurrent compute/RT and it's nearly double. If RDNA2 is capable of the same thing, it's not too hard to envision a scenario where we get the results like in Doom/Serious Sam.
 
They also talk about Ampere's concurrent compute and RT capability on the next page; specifically talking about the case where you do RT + denoising concurrently. Looking at the profile of the frametime for Quake2 RTX I posted above, if an engine were to leverage that concurrent compute properly, that would be an enormous speedup given that there looked to be at least 3ms each of denoising and RT for a 12ms frame. According to Nvidia, Turing simply isn't capable of this. They show their best case example of the speedup from concurrent compute/RT and it's nearly double. If RDNA2 is capable of the same thing, it's not too hard to envision a scenario where we get the results like in Doom/Serious Sam.
Turing is fully capable of running compute concurrently with RT.
What Ampere adds is the ability to run compute, tensor ops and RT concurrently.
I'm assuming that Nvidia is talking about denoising done on tensor cores in this example. But this is hardly relevant to any real world usage scenario because no one is running denoising on tensor cores due to compatibility issues which would arise from that.
 
Turing is fully capable of running compute concurrently with RT.
What Ampere adds is the ability to run compute, tensor ops and RT concurrently.
I'm assuming that Nvidia is talking about denoising done on tensor cores in this example. But this is hardly relevant to any real world usage scenario because no one is running denoising on tensor cores due to compatibility issues which would arise from that.
That's an odd one as Nvidia claims multiple times in the Ampere whitepaper that Turing cannot.
Table 4 on page 18 pretty clearly says "Concurrent RT and Shading: NO" for Turing.

 

Attachments

  • ampere-concurrency.PNG
    ampere-concurrency.PNG
    67 KB · Views: 11
  • turing-rt-featurematrix.PNG
    turing-rt-featurematrix.PNG
    86.1 KB · Views: 11
Back
Top