GPU Ray Tracing Performance Comparisons [2021-2022]

OlegSH · Jul 11, 2021

JoeJ said:
Also the overall loss on denoising is unexpected to me.

I wonder why?
Usually such significant changes in number of FLOPs or whatever else specs happen for a reason.
Denoising shaders have a lot of math and this has even been mentioned in the GA102 whitepaper - "Ray tracing denoising shaders are a good example of a workload that can benefit greatly from doubling FP32 throughput."

From my experience with PS5 and PC games, denoisers from consoles (RE8, WDL, Miles Morales, and less so Ratchet and Clank) usually have noticeably lower quality in comparison with denoisers in PC games.
The best denoisers I've seen to date are NVIDIA's ones (WDL, CP2077, etc). They also seem to be faster on PC in comparison with consoles' SVGF from Alex's Watch Dogs analysis. So all this additional math has been put to a good use on PC.

HLJ · Jul 11, 2021

Man from Atlantis said:
https://forum.beyond3d.com/posts/2185240

It would be fun if we had the same numbers for 2080Ti and 1080Ti (the span between Pascal and Turing in BHV would show how much NVIDIA benefited from moving the BVH from CUDA cores to their RT cores) and it would be interesting to see the span between 1st gen and 2 gen RT cores.

trinibwoy · Jul 11, 2021

JoeJ said:
A-RT setup from PCHW:
View attachment 5673
To hilight RT core improvement, we would want to turn TAA off and increase those settings to the max, even if resulting FPS end up 'unplayable'.

I did a few runs and TAA extreme is basically free in this benchmark. Shadows though have a noticeable hit that decreases as you increase the RT workload. But even at 16 bounces there was a 3% drop from low to ultra shadows.

Interesting experiment - 1 ray with 2 bounces vs 2 rays with 1 bounce. I guess the result is as expected.

PSman1700 · Jul 11, 2021

The increase in performance going from Turing to Ampere in Ray Tracing operations is quite large imo. Im quite jalous with my 2080Ti to equally high end Ampere gpu owners. I wont go for an ampere, but RTX4000? Sure.

JoeJ · Jul 11, 2021

OlegSH said:
I wonder why?
Usually such significant changes in number of FLOPs or whatever else specs happen for a reason.
Denoising shaders have a lot of math and this has even been mentioned in the GA102 whitepaper - "Ray tracing denoising shaders are a good example of a workload that can benefit greatly from doubling FP32 throughput."

If i got this correctly, Ampere has equal fp16 and fp32 throughput, while RDNA2 has 2x16, and Ampere sits about in the middle of those two numbers across comparable GPUs.
So i would expect fp16 is good enough for denoising, and if utilizing this AMD ends up a bit faster for the task.

OlegSH · Jul 11, 2021

JoeJ said:
So i would expect fp16 is good enough for denoising, and if utilizing this AMD ends up a bit faster for the task.

fp16 running packed on a single SIMD requires code vectorization, so I would not expect it to be 2x faster than scalar FP32.
DICE were able to extract +30% speedup from FP16 in checkerboard resolve shader on PS4 Pro and that's probably the best case speedup from the FP16 since not all code can be vectorized for efficient vec2 execution, not all code can use fp16 precision and FP16 would not be faster for branches.
I would also expect Ampere to win from fp16 packed data in registers as well due to better occupancy.

JoeJ · Jul 11, 2021

OlegSH said:
DICE were able to extract +30% speedup from FP16

Yeah, but then AMD still has more CUs than NV has SMs, and from my dated experience, just more compute perf. from those. However - just personal assumptions.

But i always wondered if packing fp16 vectorization needs manual work, like SIMD on CPU would do?
I have never used it yet and hope it's little work to add. Seems compiler cares for the packing? If somebody can confirm...

DegustatoR · Jul 11, 2021

JoeJ said:
Yeah, but then AMD still has more CUs than NV has SMs, and from my dated experience, just more compute perf. from those. However - just personal assumptions.

This has changed with Turing and went to an opposite with Ampere.

DavidGraham · Jul 11, 2021

JoeJ said:
and from my dated experience, just more compute perf

In Crytek's software RT demo, the 3090 is almost 40% faster than 6900XT @4K max RT.

HLJ · Jul 11, 2021

DavidGraham said:
In Crytek's software RT demo, the 3090 is almost 40% faster than 6900XT @4K max RT.

I remember when their software RT was touted as "evidence" that RT cores were not needed and AMD would have parity with NVIDIA because of this.

That fell flat on it its face...both in terms of performance and I.Q.

OlegSH · Jul 11, 2021

JoeJ said:
But i always wondered if packing fp16 vectorization needs manual work, like SIMD on CPU would do?

Compiler should be able to vectorize automatically for vectors, though without full utilization for vec3. As for scalars, it won't likely do anything.

xz321zx · Jul 11, 2021

HLJ said:
I remember when their software RT was touted as "evidence" that RT cores were not needed and AMD would have parity with NVIDIA because of this.

That fell flat on it its face...both in terms of performance and I.Q.

Really?

4999 vs. 6923
at
285W vs. 400W
for one,
I'm not sure I want to know how is this "felling flat" on "I.Q. ".

JoeJ · Jul 11, 2021

DavidGraham said:
In Crytek's software RT demo, the 3090 is almost 40% faster than 6900XT @4K max RT.

Yeah, but RT is a very bad compute benchmark. VRAM access in inner loops, and even random access. Though, IC should help this a lot, and still it's slower.
Maybe NV has compute lead now even for me, but i need to see that first to believe it. If i get opportunity to compare, i will share results...

HLJ said:
I remember when their software RT was touted as "evidence" that RT cores were not needed and AMD would have parity with NVIDIA because of this.

That fell flat on it its face...both in terms of performance and I.Q.

My personal argument here was: We would approach RT also without HW and its restrictions, and Crytek has proofed that. I did not expect similar results ofc., in terms of reflections, shadows, and GI all at once. But even this has been shown to be possible and practical from Epic, to quite some degree.
My work on compute GI uses raytracing, but algorithm can't be compared to the 'classical' approach of Crytek or DXR (also not with SDF or voxel tracing). In my case AMD indeed was much faster with all previous generations, which turned me into that 'AMD fanboy' i appear to be. Before my compute work, i preferred NV GPUs like most people.
In early discussion after DXR launch i was too optimistic towards classical RT using compute. I assumed i could find similar efficient solutions for high frequency details than those i've had found for low frequency GI. But i was wrong about that.

That said just to explain my stance, which is complicated and not the general perspective.
Thing is: My GI can solve the complete lighting model faster than any other solution i've seen so far. But i also want HF detail, thus i want HW RT.
Because the improvement from 'adding just HF detail' is not that big, i remain more critical about high costs of HW RT than others, simply because that cost / benefit ratio is smaller in my case.
On top of this, the limitation of continuous LOD not being possible, really adds a lot to those doubts, up to the point i see DXR as a 'failure / barely useful', etc.
But that's not because i'm a stubborn 'Glorious Compute Warrior'

I just need those API improvements to make HW RT 'worth it' for my case.

trinibwoy · Jul 11, 2021

JoeJ said:
My GI can solve the complete lighting model faster than any other solution i've seen so far. But i also want HF detail, thus i want HW RT. Because the improvement from 'adding just HF detail' is not that big, i remain more critical about high costs of HW RT than others, simply because that cost / benefit ratio is smaller in my case. On top of this, the limitation of continuous LOD not being possible, really adds a lot to t I just need those API improvements to make HW RT 'worth it' for my case.

That sounds very impressive. Are you planning to share your GI solution anytime soon?

HLJ · Jul 11, 2021

xz321zx said:
Really?

4999 vs. 6923
at
285W vs. 400W
for one,
I'm not sure I want to know how is this "felling flat" on "I.Q. ".

What does Watt have to do with anything I said?
I have no clue what your other numbers are...but I have an inkling that I won't be interested, as you already entered the fallacy-realm....but please...do surpise me?

HLJ · Jul 11, 2021

JoeJ said:
Yeah, but RT is a very bad compute benchmark. VRAM access in inner loops, and even random access. Though, IC should help this a lot, and still it's slower.
Maybe NV has compute lead now even for me, but i need to see that first to believe it. If i get opportunity to compare, i will share results...

My personal argument here was: We would approach RT also without HW and its restrictions, and Crytek has proofed that. I did not expect similar results ofc., in terms of reflections, shadows, and GI all at once. But even this has been shown to be possible and practical from Epic, to quite some degree.
My work on compute GI uses raytracing, but algorithm can't be compared to the 'classical' approach of Crytek or DXR (also not with SDF or voxel tracing). In my case AMD indeed was much faster with all previous generations, which turned me into that 'AMD fanboy' i appear to be. Before my compute work, i preferred NV GPUs like most people.
In early discussion after DXR launch i was too optimistic towards classical RT using compute. I assumed i could find similar efficient solutions for high frequency details than those i've had found for low frequency GI. But i was wrong about that.

That said just to explain my stance, which is complicated and not the general perspective.
Thing is: My GI can solve the complete lighting model faster than any other solution i've seen so far. But i also want HF detail, thus i want HW RT.
Because the improvement from 'adding just HF detail' is not that big, i remain more critical about high costs of HW RT than others, simply because that cost / benefit ratio is smaller in my case.
On top of this, the limitation of continuous LOD not being possible, really adds a lot to those doubts, up to the point i see DXR as a 'failure / barely useful', etc.
But that's not because i'm a stubborn 'Glorious Compute Warrior' I just need those API improvements to make HW RT 'worth it' for my case.

I stopped being interested in Crytek's approach after this:

DegustatoR · Jul 11, 2021

HLJ said:
I stopped being interested in Crytek's approach after this:
View attachment 5674

Scene simplification for faster tracing is pretty common in DXR h/w approach too, and sometimes it's hardly even visible.

neckthrough · Jul 11, 2021

JoeJ said:
Yeah, but RT is a very bad compute benchmark. VRAM access in inner loops, and even random access. Though, IC should help this a lot, and still it's slower.
Maybe NV has compute lead now even for me, but i need to see that first to believe it. If i get opportunity to compare, i will share results...

My personal argument here was: We would approach RT also without HW and its restrictions, and Crytek has proofed that. I did not expect similar results ofc., in terms of reflections, shadows, and GI all at once. But even this has been shown to be possible and practical from Epic, to quite some degree.
My work on compute GI uses raytracing, but algorithm can't be compared to the 'classical' approach of Crytek or DXR (also not with SDF or voxel tracing). In my case AMD indeed was much faster with all previous generations, which turned me into that 'AMD fanboy' i appear to be. Before my compute work, i preferred NV GPUs like most people.
In early discussion after DXR launch i was too optimistic towards classical RT using compute. I assumed i could find similar efficient solutions for high frequency details than those i've had found for low frequency GI. But i was wrong about that.

That said just to explain my stance, which is complicated and not the general perspective.
Thing is: My GI can solve the complete lighting model faster than any other solution i've seen so far. But i also want HF detail, thus i want HW RT.
Because the improvement from 'adding just HF detail' is not that big, i remain more critical about high costs of HW RT than others, simply because that cost / benefit ratio is smaller in my case.
On top of this, the limitation of continuous LOD not being possible, really adds a lot to those doubts, up to the point i see DXR as a 'failure / barely useful', etc.
But that's not because i'm a stubborn 'Glorious Compute Warrior' I just need those API improvements to make HW RT 'worth it' for my case.

Thanks for sharing, this adds a lot more depth and nuance to your stance. Are you willing to share more detail about your approach?

JoeJ · Jul 11, 2021

trinibwoy said:
That sounds very impressive. Are you planning to share your GI solution anytime soon?

neckthrough said:
Thanks for sharing, this adds a lot more depth and nuance to your stance. Are you willing to share more detail about your approach?

Thanks, but no i won't show off to the public soon. Actually i work on necessary preprocessing tools, then i have to finalize LOD, and only after that i will start work on actual demo renderer. (Actually i only have debug visualization of surface probes.)
Goal is to sell it to games industry, and if that fails make some game myself using it.

HLJ said:
I stopped being interested in Crytek's approach after this:

Oh, that's proxies. That's what we get by enabling RT for UE5. Even if both models would have more detail, the RT version will have less of it. So no precise RT shadows, for example. As said: You better join my fight!

HLJ said:
What does Watt have to do with anything I said?
I have no clue what your other numbers are...but I have an inkling that I won't be interested, as you already entered the fallacy-realm....but please...do surpise me?

If we relate performance to Watts, it turns out AMD does better. I did not know power differences are that big currently and had ignored this, but Watt is no bad measure to compare, since TF makes no more sense after NV doubled FP units.
The other point is probably: Cryteks reflections are very precise (those bullet shells aside), and DXR does not really better for IQ using sharp reflections as example.
I agree after DXR this is no longer interesting other than for backwards compatibility, but they have my respect for this showcase.

HLJ · Jul 11, 2021

JoeJ said:
Thanks, but no i won't show off to the public soon. Actually i work on necessary preprocessing tools, then i have to finalize LOD, and only after that i will start work on actual demo renderer. (Actually i only have debug visualization of surface probes.)
Goal is to sell it to games industry, and if that fails make some game myself using it.

Oh, that's proxies. That's what we get by enabling RT for UE5. Even if both models would have more detail, the RT version will have less of it. So no precise RT shadows, for example. As said: You better join my fight!

If we relate performance to Watts, it turns out AMD does better. I did not know power differences are that big currently and had ignored this, but Watt is no bad measure to compare, since TF makes no more sense after NV doubled FP units.
The other point is probably: Cryteks reflections are very precise (those bullet shells aside), and DXR does not really better for IQ using sharp reflections as example.
I agree after DXR this is no longer interesting other than for backwards compatibility, but they have my respect for this showcase.

Only if you ignore facts:

Are you saying a RTX 3090 pulls double the wattage of a 6900 XT?
If not...you need to redo your math about performanc per Watt as NVIDIA outdoes AMD by nealy 100% here.

GPU Ray Tracing Performance Comparisons [2021-2022]

OlegSH

HLJ

trinibwoy

Meh

PSman1700

JoeJ

OlegSH

JoeJ

DegustatoR

DavidGraham

HLJ

OlegSH

xz321zx

JoeJ

trinibwoy

Meh

HLJ

HLJ

DegustatoR

neckthrough

JoeJ

HLJ

Similar threads