GPU Ray Tracing Performance Comparisons [2021-2022]

6900 XT uses a bit more power when running RT, about 10-20 watts over rasterization. A 3090 uses quite a bit more power across a variety of games. They are not in the same ballpark.

The video has zero games running ray tracing. I am not sure about the relevance of it to this discussion.

Also pay attention that the user has set the TDP of the 6900XT to 275W, which is lower than the 300W default limit, the clocks of the 6900XT is low (at ~ 2280MHz max), it is also fixed and never change during each test, unlike the 3090 which changes it's clocks constantly in the video. This is a dead giveaway of a hard power cap, perhaps he did it because his 6900XT was running at a constant 75c temps, which is 10c more than his 3090.
 
Last edited:
Is there a ray-tracing frames per watt measurement out there?
I have some numbers for the RTX 3080 vs RX 6800XT, both OC'ed. Generally the 3080 consumes 370w, while the 6800XT maxes out at 285W.

In Minecraft path tracing, the 3080 delivers 2X to 3X more frames than the 6800XT at 4K.
In Control the 3080 delivers 60% more fps at 4K.
In Call of Duty Cold War, the 3080 delivers 60% more fps at 4K.
In Watch Dogs Legions, the 3080 delivers 60% more fps at 4K.
In Battlefield V, the 3080 delivers 65% more fps at 4K.
in Metro Exodus (old version) the 3080 delivers 40% more fps at 4K.
In Shadow of Tomb Raider the 3080 delivers 50% more fps at 4K.

So for 30% more watts, the 3080 delivers at minimum double that amount in performance depending on the complexity of ray tracing effects.

 
Just to add to the dogpile, my 3080Ti finds a pretty whopping efficiency gain when poking at the undervolting options. I write a decently-sized post a few weeks ago regarding my undervoltage / overclocking findings with my card; I'll post this, go find my old post, and then edit this one to insert it here: I was happier with my laptop's GTX 1050Ti 4GB than I'll ever be with my desktop GTX 1060 3GB | Beyond3D Forum

The cliff's notes for my particular situation:
my GPU could stabily maintain 1640MHz at 0.750v for hours on end while consuming literally half the power I observed earlier. Because the clock was capped at 1640, the peak framerate was notably lower than stock -- but the average framerate was only down by just a few frames max. Multiple rounds of testing over the course of a weekend found that 800mv could reliably sustain 1715MHz at a paltry 240W, 850mv could maintain 1790MHz at an acceptable 280W, and 875mv would get me all the way to 1840MHz pushing into the 330W max power consumption area.

And when I say sustained speeds, I actually mean sustained speeds. it wasn't the stock behavior of a brief glimpse of 1800MHz followed by a crash to the 1600's, it was legit sticking at 1715MHz, or 1790MHz, or even 1840MHz and not budging because it wasn't hitting either the thermal limit nor the power limit.
 
Last edited:
The reason I asked about watts per RT-FPS is to find out whether "theoretical TFLOPS" is part of the difference we're seeing.

With the wide range in performance delta between RDNA 2 and Ampere (and perhaps with clues based upon Turing) I'm wondering if we can determine whether there are situations where FP throughput is a significant portion of the delta between Ampere and RDNA 2.
 
The reason I asked about watts per RT-FPS is to find out whether "theoretical TFLOPS" is part of the difference we're seeing.

With the wide range in performance delta between RDNA 2 and Ampere (and perhaps with clues based upon Turing) I'm wondering if we can determine whether there are situations where FP throughput is a significant portion of the delta between Ampere and RDNA 2.

RT shading performance is probably dominated by shader and memory coherence (or lack thereof) and not so much by raw flops. I don’t know how you would isolate FP throughout from all the other dependencies in an RT pass.

I ran a few traces of the Bright Memory raytracing benchmark and SM instruction throughput was pretty low during the Dispatch Rays call. What's interesting is that right after each call to Dispatch Rays there was a period of high SM utilization which makes me think RT in this benchmark is writing to a V/G buffer and the hit shading is done after the raytracing pass and not during.

It's inconclusive but the only thing I could gather is that there are periods of ~80% SM instruction throughput. So if you set aside FP32 flops for a sec and just look at overall instruction throughput Ampere may have an edge as I believe its peak instruction throughput is higher than on RDNA2. It's murky because a lot of those instructions are ALU/SFU transcendentals, type-conversions and bit-manipulation so it's not a straightforward flops comparison.
 
I'm not sure if that's the right interpretation. The RTX 3090 has a higher TDP at 350w vs 300w and therefore uses about 50w more power in GPU limited gaming scenarios, regardless of what those are. Modern cards basically just all clock up and then run at their limits. It's not like the pre Kepler/GCN era where you could have extremely wide ranging power consumption especially if you ran something like Furmark. The video in question is basically showing this to be case, all the tests essentially hit roughly the same ceiling without much significant deviation.

Also how reliable is that youtube channel? Some of those youtube channels, especially the one's that only show gameplay loops with stat overlays with zero other footage, are somewhat controversial/questionable in terms of veracity. The 275w limit shown by the 6900xt in this sample seems to be different than other publications that are more vetted -

https://www.tomshardware.com/reviews/amd-radeon-rx-6900-xt-review/4
https://www.techpowerup.com/review/amd-radeon-rx-6900-xt/31.html
https://www.igorslab.de/en/grasps-a...with-benchmarks-and-a-technology-analysis/13/ (igor's does have both raster and ray trace data)

All three of those that measure power at the GPU essentially show them roughly in spec with what their official TDP power limits are. Which make sense since the power governors are set in mind based on other engineering/safety considerations and would cause issues if one type of workload ends up consistently way above maximum spec.
Those results show the same power gap just with the variances of silicon quality.
 
RT shading performance is probably dominated by shader and memory coherence (or lack thereof) and not so much by raw flops. I don’t know how you would isolate FP throughout from all the other dependencies in an RT pass.
For those two things, RDNA and Ampere should be equivalent, ish. I suppose.

I ran a few traces of the Bright Memory raytracing benchmark and SM instruction throughput was pretty low during the Dispatch Rays call. What's interesting is that right after each call to Dispatch Rays there was a period of high SM utilization which makes me think RT in this benchmark is writing to a V/G buffer and the hit shading is done after the raytracing pass and not during.
That's the standard DXR 1.0 approach, as I understand it. Favours NVidia (and soon to be, Intel). Though I assume there's a dedicated buffer set up by the driver to handle the results of despatch rays (spilling results to off-chip memory). That buffer is then consumed by the hit shaders.

Is that the correct interpretation?

It's inconclusive but the only thing I could gather is that there are periods of ~80% SM instruction throughput. So if you set aside FP32 flops for a sec and just look at overall instruction throughput Ampere may have an edge as I believe its peak instruction throughput is higher than on RDNA2. It's murky because a lot of those instructions are ALU/SFU transcendentals, type-conversions and bit-manipulation so it's not a straightforward flops comparison.
Type conversion and bit manipulation should run at native SIMD rate, for at least one SIMD out of the pair in Ampere. Perhaps those only run on the "integer" SIMD?

Does 80% of SM instruction throughput imply 1.6 instructions per clock? You're implying that's the peak utilisation as far as I can tell, so the overall utilisation is lower. Also, is some of that shading unrelated to hit shaders?

AMD's ray tracing performance diagnostic tools have only just been announced, so diving deep isn't possible yet.

In the end, comparing say 6800XT and 3080, where 3080 has around 40% more FLOPS, how much of the 60%+ RT advantage (at 4K) seen in 3080 is due to FLOPS?
 
That's the standard DXR 1.0 approach, as I understand it. Favours NVidia (and soon to be, Intel). Though I assume there's a dedicated buffer set up by the driver to handle the results of despatch rays (spilling results to off-chip memory). That buffer is then consumed by the hit shaders.

Is that the correct interpretation?

DispatchRays doesn't return until after the hit shader runs so it seems the bound hit shader isn't doing a lot of math. My guess is that it's just writing parameters out to a buffer and a separate compute shader is spun up after to actually do the material shading.

Type conversion and bit manipulation should run at native SIMD rate, for at least one SIMD out of the pair in Ampere. Perhaps those only run on the "integer" SIMD?

Yep, they run on the INT pipeline.

Does 80% of SM instruction throughput imply 1.6 instructions per clock? You're implying that's the peak utilisation as far as I can tell, so the overall utilisation is lower. Also, is some of that shading unrelated to hit shaders?

According to the profiler each SM has a peak instruction issue rate of 4.0 and peak measured is 3.2 (80%). It makes sense as an SM has 4 independent partitions each with its own instruction scheduler and execution units. What's bizarre is that the profiler reports peak issue rate for each of the two FP pipelines as 0.5 per clock. I would expect it to be 2.0 as each of the 4 partitions can issue to each FP pipeline every other clock. ALU peak is 2.0 as expected as it takes 2 clocks to process an ALU warp. SFU is 0.5 as expected as it takes 8 clocks to process an SFU warp. I don't know how to interpret the FP peaks.

ampere-instr.png
 
DispatchRays doesn't return until after the hit shader runs so it seems the bound hit shader isn't doing a lot of math. My guess is that it's just writing parameters out to a buffer and a separate compute shader is spun up after to actually do the material shading.
That sounds like a very good theory.

According to the profiler each SM has a peak instruction issue rate of 4.0 and peak measured is 3.2 (80%). It makes sense as an SM has 4 independent partitions each with its own instruction scheduler and execution units. What's bizarre is that the profiler reports peak issue rate for each of the two FP pipelines as 0.5 per clock. I would expect it to be 2.0 as each of the 4 partitions can issue to each FP pipeline every other clock. ALU peak is 2.0 as expected as it takes 2 clocks to process an ALU warp. SFU is 0.5 as expected as it takes 8 clocks to process an SFU warp. I don't know how to interpret the FP peaks.

ampere-instr.png
This appears to explain the terms being used:

Advanced Learning :: Nsight Graphics Documentation (nvidia.com)

What I don't understand is why the picture in that document is not explained by the bullet points that follow. Light Pipe and Heavy Pipe?

This describes the pipelines:

Kernel Profiling Guide :: Nsight Compute Documentation (nvidia.com)

where the heavy FMA pipeline is different from the light one because it has integer dot product functionality. Other "integer" operations occur on other pipelines, the primary one being "alu".

Then we have the "fma" pipeline: "Fused Multiply Add/Accumulate. The FMA pipeline processes most FP32 arithmetic (FADD, FMUL, FMAD). It also performs integer multiplication operations (IMUL, IMAD), as well as integer dot products. On GA10x, FMA is a logical pipeline that indicates peak FP32 and FP16x2 performance. It is composed of the FMAHeavy and FMALite physical pipelines."

Notice that this isn't a real pipeline in Ampere, merely a convenience term to describe the grouping of the capability of the heavy and light pipelines.

The "alu" pipeline is for bitwise and boolean operations and some integer math.

As for the discrepancy in the peak rates, I have no idea!

Back to the subject of the benefits of FP32 throughput in NVidia ray tracing:

nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf

"Ray tracing denoising shaders are a good example of a workload that can benefit greatly from doubling FP32 throughput." I have no idea what proportion of the frame time is spent denoising ray tracing results.
 
(and perhaps with clues based upon Turing) I'm wondering if we can determine whether there are situations where FP throughput is a significant portion of the delta between Ampere and RDNA 2.

A 2080Ti/3070 are superior to the 6900XT in heavy RT workloads.

In Minecraft RTX, the 2080Ti is 30% faster than 6900XT, the 3070 is 66% faster than 6900XT.
In Quake 2 RTX, the 2080Ti is 10% faster than 6900XT, the 3070 is 35% faster than 6900XT.
In Battlefield V, the 2080Ti/3070 are both 10% faster.
In Call Of Duty Cold War, the 2080Ti is 35% faster than 6900XT.
Other games have the 6900XT slightly ahead.

https://www.comptoir-hardware.com/a...-test-nvidia-geforce-rtx-3070-ti.html?start=5

Other tests from other sources reveal similar results.
Minecraft RTX: 2080Ti is 35% faster than 6900XT
Amid Evil RTX: 2080Ti is 45% faster than 6900XT
Call Of Duty Cold War: 2080Ti is 12% faster than 6900XT


In Synthetics, the 6900XT is 10% faster than 2080Ti in Port Royal Test but the more heavy ray tracing focused benchmark (Ray Tracing Feature Test) has both the 6900XT and 2080Ti/3070 at equal footings.

https://www.sweclockers.com/test/34...0-ti-snabbt-dyrt-och-laskigt-effekttorstigt/5

Take from that what you will.
 
I have no idea what proportion of the frame time is spent denoising ray tracing results.

ASVGF takes about 1/5 of frame time in Q2RTX.
FYI, the best sample is a RTXDI SDK that includes latest versions of all RTX~ish technologies including NRD and you have full control to turn on and off every effect in the sample application, so you can count it by yourself.
https://github.com/NVIDIAGameWorks/RTXDI

q2rtx.png
 
In Synthetics, the 6900XT is 10% faster than 2080Ti in Port Royal Test but the more heavy ray tracing focused benchmark (Ray Tracing Feature Test) has both the 6900XT and 2080Ti/3070 at equal footings.

https://www.sweclockers.com/test/34...0-ti-snabbt-dyrt-och-laskigt-effekttorstigt/5
In the three games there, merely comparing 99% percentiles for RT on 3070 and 2080Ti, we have:
  • Battlefield - 37 v 44 - 2080Ti is 119% of 3070
  • Control - 23 v 23 - 2080Ti is 100% of 3070
  • Metro Exodus - 33 v 28 - 2080Ti is 85% of 3070
Without RT:
  • Battlefield - 63 v 68 - 2080Ti is 108% of 3070
  • Control - 38 v 29 - 2080Ti is 78% of 3070
  • Metro Exodus - 52 v 52 - 2080Ti is 100% of 3070
Control seems to be the only "tough" test there, as Metro Exodus looks like it's the old version with just GI, not the Enhanced Edition.

So merely comparing NVidia to NVidia across the Turing/Ampere generation makes it hard to conclude much.

Why does Control punish the 3070 so much more than 2080Ti? 3070 has higher triangle intersection throughput, more FLOPS and "full fat" async compute. Maybe the game's implementation is old enough that it doesn't exploit async compute? With a question over the integer instruction mix for ray tracing, maybe instruction issue rate (concurrently to FMA heavy and FMA light pipes) is more important than FLOPS? Which would mean that 3070 has no advantage in instruction issue rate.

I worry whether all the results presented on that page are from the same driver.
 
Hardwareluxx:
https://www.hardwareluxx.de/index.p...o3d-geforce-rtx-3090-ti-im-test.html?start=17

Cyberpunk 2077:
3090Ti is 85% faster @ both 2160p and 1440p than 6900XT

Control:
3090Ti is 70% faster @2160p and 80% faster @1440p than 6900XT

Battlefield V:
3090Ti is 120% faster @2160p and 70% faster @1440p than 6900XT

Call Of Duty Cold War Black Ops:
3090Ti is 65% faster @2160p and 60% faster @1440p than 6900XT

Call Of Duty Modern Warfare:
3090Ti is 30% faster @both 2160p and 1440p than 6900XT
Why does Control punish the 3070 so much more than 2080Ti?
Probably a VRAM limitation at this 4K resolution.
 
Last edited:
Back
Top