GPU Ray Tracing Performance Comparisons [2021-2022]

Speaking of ray tracing, this is another area where Intel is giving us a bit more detail on the inner-workings of the architecture. Intel is now confirming that their RT units are capable of accelerating ray traversals, bounding box intersections, and triangle intersections. All of which is very similar to what NVIDIA’s own RT cores are capable of. Intel is not talking about the actual performance throughput of these units at this time, so how they will perform remains to be seen.
https://www.anandtech.com/show/16895/a-sneak-peek-at-intels-xe-hpg-gpu-architecture
 

Alchemist-Slice2-678x452.jpg


Nice, I like "core" much better than "subslice". Intel is sticking to its guns and referring to self contained execution units as cores. This is arguably more accurate but pretty useless for comparison to AMD's and Nvidia's "cores". This picture also makes it seem like the RT and texture units can be accessed from any of the cores. That would be really interesting but the picture is probably just misleading. Rasterizer and ROPs inside the slice are very similar to Ampere and RDNA.

Intel Slice = Nvidia GPC = AMD Shader Array
Intel Core = Nvidia SM = AMD WGP
Intel Vector engine = Nvidia Partition = AMD SIMD

Did I get that right?
 
Yea, RT hardware acceleration like nvidia does is the way to go. AMD currently seems the only one foregoing it but RNDA3 might and probably will change that.

If rumors are true that AMD will be reusing the Navi 2x for their next-gen mid-range SKUs, I think the RT implementation for RDNA3 will not be different from RDNA2.
 

For keeping architectural/strategic consistency. I think AMD is against on special purpose compute units in the GPUs and they probably thought improving efficiency of general purpose compute units are better way. Massive L3 cache is one of it's execution.
 
For keeping architectural/strategic consistency. I think AMD is against on special purpose compute units in the GPUs and they probably thought improving efficiency of general purpose compute units are better way. Massive L3 cache is one of it's execution.
AMD already has special purpose units for RT. Sure, they don't have as much HW acceleration for RT as other GPUs (including Intel DG2, according what was disclosed earlier today..), but that doesn't necessarily mean they are not going to throw more fixed function at it.

Consistency doesn't matter if you can't compete well on a progressively larger fraction of future workloads.
 
And since it's using a driver abstraction, you can have consistency from a software POV anyway. You just need to provision a compute path, when you do something, the ff hw cannot do.
 
For keeping architectural/strategic consistency. I think AMD is against on special purpose compute units in the GPUs and they probably thought improving efficiency of general purpose compute units are better way. Massive L3 cache is one of it's execution.

If anything Intel has the best setup for running divergent RT kernels on general compute units given their support for small wavefront sizes. 8 threads vs 32 on AMD/Nvidia. Yet Intel still decided to go with hardware traversal.
 
Geometry Shader is a lesson from history: it is essentially a complete waste of time and was when it was introduced.

Also, triangle intersection test rate (which RDNA 2 seems to be bad at) is difficult to isolate as a bottleneck separate from BVH traversal incoherency. BVH traversal is bottlenecked by memory (just like texture filtering, in general), meaning that compressed BVH nodes and a rambo cache are vital.

In my opinion NVidia has extremely good node-compression, based on the years-long research that NVidia has on the subject of BVH format.

It's worth remembering that a workgroup of, say, 1024 work items intrinsically offers a speed-up for dynamic branching on current hardware: when an entire hardware thread goes dark, that subset of work items (e.g. 32 on RDNA 2) no longer occupies any execution slots. During traversal, those execution slots can run the closest-/any-/miss-shader for those work items instead, if the hardware is running an uber-ray-shader that combines traversal with closest-/any-/miss-shaders.

Finally, if you can build a compute unit that can do conditional routing to mitigate the slow-down of incoherent branching, then hardware traversal is entirely pointless. RDNA 2 is not 10x or more slower even in the worst-case scenarios, so a 2-4x speed-up from such a solution is more than enough.

Intel's variable-width hardware threads, if Arc has them, will be interesting in their own right as a solution to dynamic branching in shaders. In combination with hardware BVH traversal that will be interesting.

Intel's hardware traversal may not be MIMD.
 
Apparently, Ampere is at least twice that. Is that right?
2x vs. Turing. Now we need to find "x". Even in the hotchips presentation from Burgess, I did not find something substantial. Only the "2x" and that (supposedly, but not explicitly) RTX 3080 vs. an unnamed Turing achieves "58 vs. 34 RT TFLOPS" (in Nvidia-style calculation). Additionally, they say, that Turing is 7x RT perf over Pascal, but that includes Ray/Box-intersection and Ray Traversal, which, allegedly, consumes "many thousands of instruction slots per ray" in a shader program.

I could not find any definitive number, if Ray/Triangle intersection on Turing really is 1/RT core/clk. I'd be happy to see such a number. Or a benchmark I could run, that would not consume my whole free weekend. ;)

Coreteks did analyze the perf drop of 3080 vs. 2080 Ti with RT based on data from HBU shortly after the . They come to the conclusion, that the drop only lessens by 6% between both cards on average in 4k and 8% in 1440p. Maybe that's an indication, or maybe Ray/Triangle intersection is not such an important performance metric? It should happen not too often for each ray (1 for solids, 2 with translucents?).

And apart from all the numbers: If RDNA2 really is on par with Turing here, that's already "bad"? RX 6900XT would still be faster in this metrix than a 3070 Ti.
 
Last edited:
Coreteks did analyze the perf drop of 3080 vs. 2080 Ti with RT based on data from HBU shortly after the . They come to the conclusion, that the drop only lessens by 6% between both cards on average in 4k and 8% in 1440p. Maybe that's an indication, or maybe Ray/Triangle intersection is not such an important performance metric? It should happen not too often for each ray (1 for solids, 2 with translucents?).
Gaming side of RT is limited by traditional shading and bandwidth more than the RT h/w of Ampere. HUB "analysis" is a joke.

Blender-2.91-Cycles-NVIDIA-OptiX-Render-Performance-Classroom-Render-December-2020.jpg

https://techgage.com/article/nvidia-turing-ampere-cuda-optix-rendering-performance/

Here 3070 with it's 46 RT cores is 37% faster than Titan RTX with its 72 RT cores. Clocks are comparable.
 
Gaming side of RT is limited by traditional shading and bandwidth more than the RT h/w of Ampere. HUB "analysis" is a joke.

Blender-2.91-Cycles-NVIDIA-OptiX-Render-Performance-Classroom-Render-December-2020.jpg

https://techgage.com/article/nvidia-turing-ampere-cuda-optix-rendering-performance/

Here 3070 with it's 46 RT cores is 37% faster than Titan RTX with its 72 RT cores. Clocks are comparable.
It's true, that gaming workloads do obscure individual architectural traits.
Are you sure, that this is not the doubled L1's with twice the transfer rate at work? I could dismiss that whole benchmark on missing scaling between 2080S and Titan RTX alone (the latter should be 46% faster, is only 5%) and that would have more merit than your blatant dismissal of HBUs data (it was data from HBU, analysis was from coreteks).

So I take it, you don't have concrete data either?
 
I think it's very much worthwhile to contrast "professional" ray-traced rendering with game ray-traced rendering.

Optix is a black box, as far as I can tell. It's running on top of black box hardware.

Black box inception makes it pretty hard to talk about the hardware.
 
Back
Top