GPU Ray Tracing Performance Comparisons [2021-2022]

DavidGraham · Aug 18, 2021

DegustatoR said:
https://www.computerbase.de/2021-08/f1-2021-raytracing-mittel-hoch-ultrahoch-test/

Wow, so F1 2021 fixed many of the problems with RT shadows, and added higher levels of RT.

The differece goes like this @4K:
The 3080 is 18% faster than 6800XT at medium RT.
The 3080 is 43% faster at High RT
The 3080 is 55% faster at Ultra RT.

DegustatoR · Aug 19, 2021

Speaking of ray tracing, this is another area where Intel is giving us a bit more detail on the inner-workings of the architecture. Intel is now confirming that their RT units are capable of accelerating ray traversals, bounding box intersections, and triangle intersections. All of which is very similar to what NVIDIA’s own RT cores are capable of. Intel is not talking about the actual performance throughput of these units at this time, so how they will perform remains to be seen.

https://www.anandtech.com/show/16895/a-sneak-peek-at-intels-xe-hpg-gpu-architecture

PSman1700 · Aug 19, 2021

DegustatoR said:
https://www.anandtech.com/show/16895/a-sneak-peek-at-intels-xe-hpg-gpu-architecture

Yea, RT hardware acceleration like nvidia does is the way to go. AMD currently seems the only one foregoing it but RNDA3 might and probably will change that.

trinibwoy · Aug 19, 2021

DegustatoR said:
https://www.anandtech.com/show/16895/a-sneak-peek-at-intels-xe-hpg-gpu-architecture

Nice, I like "core" much better than "subslice". Intel is sticking to its guns and referring to self contained execution units as cores. This is arguably more accurate but pretty useless for comparison to AMD's and Nvidia's "cores". This picture also makes it seem like the RT and texture units can be accessed from any of the cores. That would be really interesting but the picture is probably just misleading. Rasterizer and ROPs inside the slice are very similar to Ampere and RDNA.

Intel Slice = Nvidia GPC = AMD Shader Array
Intel Core = Nvidia SM = AMD WGP
Intel Vector engine = Nvidia Partition = AMD SIMD

Did I get that right?

TopSpoiler · Aug 20, 2021

PSman1700 said:
Yea, RT hardware acceleration like nvidia does is the way to go. AMD currently seems the only one foregoing it but RNDA3 might and probably will change that.

If rumors are true that AMD will be reusing the Navi 2x for their next-gen mid-range SKUs, I think the RT implementation for RDNA3 will not be different from RDNA2.

nAo · Aug 20, 2021

TopSpoiler said:
If rumors are true that AMD will be reusing the Navi 2x for their next-gen mid-range SKUs, I think the RT implementation for RDNA3 will not be different from RDNA2.

Why?

TopSpoiler · Aug 20, 2021

nAo said:
Why?

For keeping architectural/strategic consistency. I think AMD is against on special purpose compute units in the GPUs and they probably thought improving efficiency of general purpose compute units are better way. Massive L3 cache is one of it's execution.

nAo · Aug 20, 2021

TopSpoiler said:
For keeping architectural/strategic consistency. I think AMD is against on special purpose compute units in the GPUs and they probably thought improving efficiency of general purpose compute units are better way. Massive L3 cache is one of it's execution.

AMD already has special purpose units for RT. Sure, they don't have as much HW acceleration for RT as other GPUs (including Intel DG2, according what was disclosed earlier today..), but that doesn't necessarily mean they are not going to throw more fixed function at it.

Consistency doesn't matter if you can't compete well on a progressively larger fraction of future workloads.

CarstenS · Aug 20, 2021

And since it's using a driver abstraction, you can have consistency from a software POV anyway. You just need to provision a compute path, when you do something, the ff hw cannot do.

trinibwoy · Aug 20, 2021

TopSpoiler said:
For keeping architectural/strategic consistency. I think AMD is against on special purpose compute units in the GPUs and they probably thought improving efficiency of general purpose compute units are better way. Massive L3 cache is one of it's execution.

If anything Intel has the best setup for running divergent RT kernels on general compute units given their support for small wavefront sizes. 8 threads vs 32 on AMD/Nvidia. Yet Intel still decided to go with hardware traversal.

Jawed · Aug 20, 2021

Geometry Shader is a lesson from history: it is essentially a complete waste of time and was when it was introduced.

Also, triangle intersection test rate (which RDNA 2 seems to be bad at) is difficult to isolate as a bottleneck separate from BVH traversal incoherency. BVH traversal is bottlenecked by memory (just like texture filtering, in general), meaning that compressed BVH nodes and a rambo cache are vital.

In my opinion NVidia has extremely good node-compression, based on the years-long research that NVidia has on the subject of BVH format.

It's worth remembering that a workgroup of, say, 1024 work items intrinsically offers a speed-up for dynamic branching on current hardware: when an entire hardware thread goes dark, that subset of work items (e.g. 32 on RDNA 2) no longer occupies any execution slots. During traversal, those execution slots can run the closest-/any-/miss-shader for those work items instead, if the hardware is running an uber-ray-shader that combines traversal with closest-/any-/miss-shaders.

Finally, if you can build a compute unit that can do conditional routing to mitigate the slow-down of incoherent branching, then hardware traversal is entirely pointless. RDNA 2 is not 10x or more slower even in the worst-case scenarios, so a 2-4x speed-up from such a solution is more than enough.

Intel's variable-width hardware threads, if Arc has them, will be interesting in their own right as a solution to dynamic branching in shaders. In combination with hardware BVH traversal that will be interesting.

Intel's hardware traversal may not be MIMD.

Dictator · Aug 20, 2021

Jawed said:
Intel's hardware traversal may not be MIMD.

From what I understand - it is!

CarstenS · Aug 20, 2021

Jawed said:
Also, triangle intersection test rate (which RDNA 2 seems to be bad at)

How so? Isn't it 1 tri/clk/RA?

Jawed · Aug 20, 2021

CarstenS said:
How so? Isn't it 1 tri/clk/RA?

Apparently, Ampere is at least twice that. Is that right?

troyan · Aug 20, 2021

nVidia claimed they have doubled triangle intersection test rate with Ampere.

CarstenS · Aug 20, 2021

Jawed said:
Apparently, Ampere is at least twice that. Is that right?

2x vs. Turing. Now we need to find "x". Even in the hotchips presentation from Burgess, I did not find something substantial. Only the "2x" and that (supposedly, but not explicitly) RTX 3080 vs. an unnamed Turing achieves "58 vs. 34 RT TFLOPS" (in Nvidia-style calculation). Additionally, they say, that Turing is 7x RT perf over Pascal, but that includes Ray/Box-intersection and Ray Traversal, which, allegedly, consumes "many thousands of instruction slots per ray" in a shader program.

I could not find any definitive number, if Ray/Triangle intersection on Turing really is 1/RT core/clk. I'd be happy to see such a number. Or a benchmark I could run, that would not consume my whole free weekend.

Coreteks did analyze the perf drop of 3080 vs. 2080 Ti with RT based on data from HBU shortly after the . They come to the conclusion, that the drop only lessens by 6% between both cards on average in 4k and 8% in 1440p. Maybe that's an indication, or maybe Ray/Triangle intersection is not such an important performance metric? It should happen not too often for each ray (1 for solids, 2 with translucents?).

And apart from all the numbers: If RDNA2 really is on par with Turing here, that's already "bad"? RX 6900XT would still be faster in this metrix than a 3070 Ti.

DegustatoR · Aug 20, 2021

CarstenS said:
Coreteks did analyze the perf drop of 3080 vs. 2080 Ti with RT based on data from HBU shortly after the . They come to the conclusion, that the drop only lessens by 6% between both cards on average in 4k and 8% in 1440p. Maybe that's an indication, or maybe Ray/Triangle intersection is not such an important performance metric? It should happen not too often for each ray (1 for solids, 2 with translucents?).

Gaming side of RT is limited by traditional shading and bandwidth more than the RT h/w of Ampere. HUB "analysis" is a joke.

Blender-2.91-Cycles-NVIDIA-OptiX-Render-Performance-Classroom-Render-December-2020.jpg

https://techgage.com/article/nvidia-turing-ampere-cuda-optix-rendering-performance/

Here 3070 with it's 46 RT cores is 37% faster than Titan RTX with its 72 RT cores. Clocks are comparable.

CarstenS · Aug 20, 2021

DegustatoR said:
Gaming side of RT is limited by traditional shading and bandwidth more than the RT h/w of Ampere. HUB "analysis" is a joke.

https://techgage.com/article/nvidia-turing-ampere-cuda-optix-rendering-performance/

Here 3070 with it's 46 RT cores is 37% faster than Titan RTX with its 72 RT cores. Clocks are comparable.

It's true, that gaming workloads do obscure individual architectural traits.
Are you sure, that this is not the doubled L1's with twice the transfer rate at work? I could dismiss that whole benchmark on missing scaling between 2080S and Titan RTX alone (the latter should be 46% faster, is only 5%) and that would have more merit than your blatant dismissal of HBUs data (it was data from HBU, analysis was from coreteks).

So I take it, you don't have concrete data either?

Jawed · Aug 20, 2021

I think it's very much worthwhile to contrast "professional" ray-traced rendering with game ray-traced rendering.

Optix is a black box, as far as I can tell. It's running on top of black box hardware.

Black box inception makes it pretty hard to talk about the hardware.

DegustatoR · Aug 20, 2021

CarstenS said:
So I take it, you don't have concrete data either?

ProViz benches are as concrete as it gets when comparing Turing with Ampere in RT.

GPU Ray Tracing Performance Comparisons [2021-2022]

DavidGraham

DegustatoR

PSman1700

trinibwoy

Meh

TopSpoiler

nAo

Nutella Nutellae

TopSpoiler

nAo

Nutella Nutellae

CarstenS

Moderator

trinibwoy

Meh

Jawed

Dictator

CarstenS

Moderator

Jawed

troyan

CarstenS

Moderator

DegustatoR

CarstenS

Moderator

Jawed

DegustatoR

Similar threads