GPU Ray Tracing Performance Comparisons [2021-2022]

I don't see how anyone can make a qualified statement on AMD RT hardware right now. It was literally just released a few months ago. Its a different implementation than RTX which has been around 2+ years. Given that RDNA2's RT doesn't accelerate everything that RTX does, it will force devs to come up with better solutions to deal with that issue and that's going to take time to implement.
I see this quite differently: 1. Current games and performance results are a qualified measure and statement.
2. Devs on PC can not come up with better solutions to optimize for HW architecture. Currently we can do only general things like optimizing sampling strategy and denoising, but that's orthogonal to HW architecture.
(Well, we can do things like trying inline tracing and see if certain architecture runs faster or slower form that, but that's not something we can expect so much from.)

I imagine that RDNA2 RT will never be as performant as most of RTX's current offerings but it does not mean we will not see more capable games that will better reflect what the current AMD cards can offer.
If AMD would expose their intersection instructions to compute shaders, i could reuse my own BVH fo RT. Because this BVH is streamed from disk no building at runtime is necessary. My BVH also links to lower LOD geometry in internal nodes, so traversal finishes earlier.
I can not imagine RTX would be faster than RDNA2 for me, because RTX only works with BVH built from driver and LOD only works with replacing parts of geometry (which increases the necessary but redundant work on BVH building even further).

I expect a very similar situation for UE5, and also other engines as they increase detail and utilize acceleration structures. AMDs potential flexibility might become a big advantage, reversing the picture we see now.
At the moment, hardware traversal is a performance win, but on the long run the resulting fixed acceleration structure might turn up being a restriction and net loss.
 
'
At the moment, hardware traversal is a performance win, but on the long run the resulting fixed acceleration structure might turn up being a restriction and net loss.

I believe we will see low level api in future. RT likely will follow similar path as unified shaders. First you have something pretty fixed function, then something kind of programmable and eventually things unify. For now the fixed function hw seems to be faster but in future flexibility will likely win over brute force. We are not there yet.

edit. I'll reserve exception for chiplets though. It's possible due to manufacturing scaling issues that chiplets become a really big thing. In that case perhaps we could see RT chiplets connected to unified l3-cache and unified memory subsystem. Instead of unified shaders we might end up with heterogenous computing consisting of chiplets designed to accelerate very specific things.

edit2. ue5 I believe is pure compute/sw rasterizing in compute. I don't believe ue5 uses dxr at all. I wouldn't discount the doubled fp32 throughput of ampere here. rdna2 is also fp32 beast though due to very different reason(clockspeed versus doubled execution units)
 
Last edited:
If AMD would expose their intersection instructions to compute shaders, i could reuse my own BVH fo RT. Because this BVH is streamed from disk no building at runtime is necessary. My BVH also links to lower LOD geometry in internal nodes, so traversal finishes earlier.
I can not imagine RTX would be faster than RDNA2 for me, because RTX only works with BVH built from driver and LOD only works with replacing parts of geometry (which increases the necessary but redundant work on BVH building even further).
Do you know if its exposed on console DX12U?
Would be nice to know where the console has access to lower and more details than PC.
 
Because this BVH is streamed from disk no building at runtime is necessary.
I don't buy it.

Can you stream BVH for destructable parts, movable objects and skinned geometry? That's a rhetorical question.

Static geometry is pretty much free and it's faster and more practical to create BVH for it in runtime rather than stream from disk, which would take way more time and would require way more memory for cache to be practical.

It's dynamic and skinned geometry that is expensive to build, yet it can be overlapped and hidden with async compute.
 
I don't buy it.

Can you stream BVH for destructable parts, movable objects and skinned geometry? That's a rhetorical question.

Static geometry is pretty much free and it's faster and more practical to create BVH for it in runtime rather than stream from disk, which would take way more time and would require way more memory for cache to be practical.

It's dynamic and skinned geometry that is expensive to build, yet it can be overlapped and hidden with async compute.
Just curious, but what do you think is happening with RE8 with respect to loading rooms? @Dictator show cases where the 2060 drops dramatically in frame rate every single room it's hitting.
Reducing texture sizes eases pressure, but I haven't figured out why exactly it's dropping so much. Do you think it's possible they are doing some sort of loading/rebuilding of the BVH as they enter a new room?
 
Can you stream BVH for destructable parts, movable objects and skinned geometry? That's a rhetorical question.
Yes for movable parts and skinning. Destruction actually would require to precompute the fractured parts and replacement with them at runtime when destruction happens.

Static geometry is pretty much free and it's faster and more practical to create BVH for it in runtime rather than stream from disk, which would take way more time and would require way more memory for cache to be practical.
Disagree. Basically all my concerns and worries mentioned here all the time are about static geometry. Dynamic stuff is no big problem because it's usually rare in comparison.
If we have all static geometry at high detail, we need LOD for this too. We can no longer ignore it and have that only for detailed dynamic objects like characters.
Thus static world becomes dynamic mesh in some form, which we likely handle best with hierarchical data structures. RTX BVH is such hierarchy, but we can not modify it, thus need to rebuild from scratch even if only a small patch of the model changes detail.
Very important: LOD does only small and gradual changes, so we want to have one single BVH over all LODs of a model, and changing detail should only turn internal nodes into leaf nodes and vice versa.

It's dynamic and skinned geometry that is expensive to build
But it does not have to be expensive. That's an API limitation if so. In fact we can precompute BVH once and only refit after skinning.
We can even enlarge bounding boxes so they are guaranteed to bound triangles under animation and remove inter-tree-level barriers this way, so the whole process becomes just a transformation without dependencies (ofc. this hurts tracing speed a bit, and it might break HW assumptions of parent boxes also bounding child boxes, not just leaf triangles).


Do you know if its exposed on console DX12U?
Would be nice to know where the console has access to lower and more details than PC.
Never worked on consoles, just PC and mobiles.
A4 Games said in some interview they do 'custom traversal' on consoles, which they can not do on PC because API limitation. But IDK if this was understood / printed correctly, and which platforms were meant.


I believe we will see low level api in future. RT likely will follow similar path as unified shaders. First you have something pretty fixed function, then something kind of programmable and eventually things unify. For now the fixed function hw seems to be faster but in future flexibility will likely win over brute force. We are not there yet.
I believe and hope so too. Otherwise i have a problem :O
 
RTX BVH is such hierarchy, but we can not modify it, thus need to rebuild from scratch even if only a small patch of the model changes detail.
Very important: LOD does only small and gradual changes, so we want to have one single BVH over all LODs of a model, and changing detail should only turn internal nodes into leaf nodes and vice versa.

There is refitting in dxr api. No need to rebuild from scratch for every change

https://developer.nvidia.com/rtx/ra...acing-tutorial/extra/dxr_tutorial_extra_refit
 
Do you think it's possible they are doing some sort of loading/rebuilding of the BVH as they enter a new room?
Honestly I don't know the exact reason of these freezes. It can be PSO compilation on rendering thread, memory oversubscription, etc. I don't think they build BVH per room and neither should this stall rendering.

Yes for movable parts and skinning. Destruction actually would require to precompute the fractured parts and replacement with them at runtime when destruction happens.
I don't think precomputed BVH can be efficient for procedurely animated stuff (lets say foliage or cloth) and skinning, creating such BVH would require accounting for the worst case scenario (making boxes large and loose hence degrading BVH quality).
Procedurally animated trees or cloth or particles/etc/etc can swing widely and this would require inefficient BLAS with large boxes, traversing low quality BLAS can cost way more than refitting BVH for the same tree in every frame (which still decreases BLAS quality).

Basically all my concerns and worries mentioned here all the time are about static geometry
There are quite a few RTX accelerated renderers which handle scenes with billions of triangles and instancing at interactive frame rates, here is just 1 example

Of cause instancing does help a lot since there are not too much BLASes, but we are talking about at least a few millions poligons per 1 BLAS here.

Dynamic stuff is no big problem because it's usually rare in comparison.
But that's where all rendering time is spent.

If we have all static geometry at high detail, we need LOD for this too.
Each LOD should be stored in a separate BLAS in any case, how is your proposal any different to how LODs are implemented in DXR?

We can no longer ignore it and have that only for detailed dynamic objects like characters.
What limitations don't allow you to do LOD for whatever geometry you want right now?

RTX BVH is such hierarchy, but we can not modify it, thus need to rebuild from scratch even if only a small patch of the model changes detail.
Not true, in case of static geometry, it's recommended using refitting whenever it benifits overall perf (which will work as passthru if geometry is static)
Refitting BLAS is an order of magnitude faster than building it from scratch, but it obviously degrades BLAS quality, hence it's up to you to decide on what to use in case of procedurally animated geomery or skinning.

That's an API limitation if so. In fact we can precompute BVH once and only refit after skinning.
Creating BVH with good quality is the expensive thing, you can create low quality BVH way faster, but then you can lost way more time on traversing this BVH.
 
There is refitting in dxr api.
Refitting does not allow to remove or add triangles, so won't help me with my LOD issues.

However, there should be two forms of refitting:
Refitting TLAS: Useful if BLAS are known to displace only slightly, so we can assume the same tree is still good after refitting bounds. (This is what's shown in your link.)
Refitting BLAS: Useful if we know vertices of a model only change slightly, e.g. after animating and skinning a character. (I think DXR should have this too, but after OlegSHs response of skinning being expensive i'm not sure. Refitting is not free either ofc, especially if barriers between tree levels are necessary.)
 
I don't see how anyone can make a qualified statement on AMD RT hardware right now. It was literally just released a few months ago.

It would help if AMD released something that showed off what their hardware can really do. For comparison, Nvidia showed off their Star Wars demo months before Turing launched.
 
I don't think precomputed BVH can be efficient for procedurely animated stuff (lets say foliage or cloth) and skinning, creating such BVH would require accounting for the worst case scenario (making boxes large and loose hence degrading BVH quality).
That's true, but on the other hand offline tree allows higher quality than building it quickly on model load, so your assumption might not hold in practice. I guess even with degradation from animation it is still better at least on the lower tree levels.

Of cause instancing does help a lot since there are not too much BLASes, but we are talking about at least a few millions poligons per 1 BLAS here.
Yeah, if i'm lucky the current DXR approach works well enough for me to be practical. But i doubt it, and even if, constantly building BVH over millions of triangles of static triangles during gameplay is wasted cycles. They have to address this sooner or later.

But that's where all rendering time is spent.
Because current games have high detail characters but low detail environment. I doubt in UE5 demo character costs more than environment.

Each LOD should be stored in a separate BLAS in any case, how is your proposal any different to how LODs are implemented in DXR?
You missed my point here: "Very important: LOD does only small and gradual changes, so we want to have one single BVH over all LODs of a model, and changing detail should only turn internal nodes into leaf nodes and vice versa."
Imagine UE5 geometry again, which is divided into surface patches, each scaling triangle count up and down on demand.
At such high geometry resolution, a LOD change probably means to double or half geometric detail, just like texels and mip maps for textures.
What you propose and DXR allows is like having a whole mip pyramid for each mip level, which ofc. is dumb. We want to have only one mip pyramid, but use the detail level we are currently interested in. Otherwise storage and memory demands explode for no reason.
So we also want to have only one BVH for all LODs. For the same reason. It's more difficult because geometry is irregular, but it's possible and necessary anyways to achieve high detail.

What limitations don't allow you to do LOD for whatever geometry you want right now?
Constant BVH building on LOD change at runtime. Division into 'assets' or 'models' does not allow fine grained LOD of larger objects like terrain. Instancing does not allow to composite all larger objects from modular models (terrain being again the best example for this, but at high detail anything becomes a terrain, technically).

Not true, in case of static geometry, it's recommended using refitting whenever it benifits overall perf (which will work as passthru if geometry is static)
I can not select some triangles of a mesh, subdivide them and add detail, and then refit BVH. If topology changes, the whole BLAS has to be rebuilt. That's what i meant. (But correct me again if i'm wrong.)

What i could do eventually is this: Make BLAS for all surface patches, replace some of them on demand, rebuild/refit TLAS. But that's not fine grained enough. I'd end up having 100k of tiny BLASes around. TLAS rebuild would become too costly.

DXR is fine if you make your stuff from instances of static models and some dynamic ones. But as soon as we make LOD a frist class citizen, the whole paradigm of having static models is gone.
 
So, once more, the RT in this game is using a lot of screen space information, whether in GI or reflections, exhibiting the same issues with screen space effects, where object occlusion or camera movement decimates the effect, the video also shows how horrible the resolution for the RT reflection is.

 
Well, some of the issues can be explained with making performance compromises, e.g. using lower res for RT, or not using shadow rays to prevent leaking.
But others can't. E.g. why not jittering the low res grid in TAA fashion, and using a better filter for upscaling? That's really little work, would fix the blocky reflections without cost, and devs are used to do such things all the time.

I would prefer the idea to do RT at low res over upscaling the whole frame, but this game does not serve to make comparisons between the two approaches.
I did not really expect much because it's not a next gen game yet, but this feels lazy for no reason even if i ignore the PC port.
 
So, once more, the RT in this game is using a lot of screen space information, whether in GI or reflections, exhibiting the same issues with screen space effects, where object occlusion or camera movement decimates the effect, the video also shows how horrible the resolution for the RT reflection is.


I mean, theoretically, temporal re-use is great. In practice... we've seen RTGI has noise at 1spp, let alone a fourth that, and shouldn't be that expensive when done right (See Metro EE). The reflections... well again, great, temporally accumulate and reproject smooth reflections, awesome. But surely even the most basic filtering of the reflections buffer or however they store it would look better. or as JoeJ point out with temporal jitter as well. As an aside it shows off a good target for upscaling/filtering. Take the curvature of the reflection in screenspace, stick to smooth continuous reflections of course, then you have a curvature correction for neighbor based upscaling (neural net or whatever really), a possibly large win allowing you to use quite low res reflections as long as they're nigh perfectly smooth surfaces (where they're the most noticeable anyway).

Seems to support the notion that Village was released perhaps a tad early, and could've used another month or so. Well, maybe there'll be a nice patch or two.
 
For anyone interested in how Quake II RTX implements multi-bounce lighting effects, I can shed some more details about their code ...

As I've elaborated previously before, it is an established fact that all of Quake II RTX's ray tracing pipeline state objects have a hard coded maximum ray recursion depth of 1. On a tangent, Nvidia mentions an alternative to recursive ray tracing for multi-bounce effects which is an iterative ray tracing method by using loops in the ray generation shaders and I imagine this example demonstrates their description. When I compared this example to Quake II RTX's shader source (any ".rgen" files), what I found out is that what the mod is doing with it's ray generation shaders isn't consistent with the given description by Nvidia so this precludes that the mod is doing loops inside the ray generation shaders ...

When we take a look into the "vkpt_pt_create_pipelines()" function, what is of particular interest to us is that they are creating unique pipelines for each different bounces! This means that the pipeline for "INDIRECT_LIGHTING_FIRST" corresponds to the 1st bounce and that the pipeline for "INDIRECT_LIGHTING_SECOND" will correspond to the 2nd bounce. A similar idea applies to our pipelines for "REFLECT_REFRACT". We notice next that the pipelines for "INDIRECT_LIGHTING" are sharing the exact same shader modules which is "QVK_MOD_INDIRECT_LIGHTING_RGEN" (contains ray generation program for indirect lighting) but what is different are the specialization constants! The most obvious conclusion that I've come to expect is that multi-bounce lighting effects are also hard coded into the shaders themselves and that specialization constants are used to trigger/disable different portions of the shader program ...

As a bonus light scattering effects like caustics are implemented with it's own unique pipeline as well despite sharing the same shader module with the direct lighting pipeline!

If the Quake II RTX mod did use recursion, then the mod wouldn't work on AMD drivers at all since RT PSO creation would lead into failure (could result in application crashing or device loss) when the application doesn't respect the limit behind "maxRayRecursionDepth" which is exactly equal to 1 on AMD's implementation. We can see is that pipeline creation succeeds on their drivers so we know for a fact that the mod doesn't use recursion at all but instead what the mod is doing to implement multi-bounce effects is that it's tracing rays with a different RT PSO for each bounce ...
 
PCGH used a second, more RT heavy scene in Metro - the church in Volga: https://www.pcgameshardware.de/Metr...us-Enhanced-Edition-Test-Vergleich-1371524/3/

In 4K with raytracing "Ultra" the 3090 is 74% faster than the 6900XT.

In the post you quoted I said "Dont lose context of the post". And yet you did. 74% is still not 300% and vice-versa.

And that result is still in parity VS the 2080ti, Nvidia's first gen RT vs AMD first gen RT. Is The 2080ti suddenly unviable as a RT gpu? No.
 
Back
Top