GPU Ray Tracing Performance Comparisons [2021-2022]

You would never use 16-wide WGs in practice because half of threads would remain idle.
If i understood this correctly, RDNA can fill its SIMDs faster with new work, which matters after switching shaders, so also if shaders are short, and if lots of stuff is to schedule (e.g. async compute).
Maybe it also matters after switching WGs because of a wait on VRAM access, texture fetch, RT, etc. Would be a huge win then.
On GCN this takes longer because the pipelining of 64 threads over 16-wide SIMDs. RDNA with 32 threads and 32-wide SIMDs can fill those SIMDs twice as fast.
A downside is higher latency. Many instructions take 4 cycles on GCN, but 5 on RDNA.
But i'm really not sure about any of this!


That's another topic. Indeed i don't like tabletop mechanics in realtime video games, because it means GUI interface clutter not conforming the reality simulation going on beside that. So i have ignored the RPG genre as a whole and expect some issues on my side with playing CP.
But obviously it implements all FPS mechanics: Free roam of 3D world, and shooting at enemies which shoot back. And even ignoring the RPG mechanics, i can judge this implementation is bad and way below to what other modern (console) games achieve.
To address aiming does not work so well using gamepads, there are many options:
Aim Assist. See a style shooter like Deathloop does not even have a crosshair. Just like Doom had no crosshair because it had no mouselook yet.
Cover shooter. Solving the problem by alternating movement and shooting.
Melee mechanics. Button mash to attack close enemies, not requiring aim.
Special abilities. See Dishonored, Control, etc. Usually married with the style shooter genre.

But all of those solutions have issues, which are mostly annoying to me:
Aim Assist: No problem with that, but game ends up too shallow and simple, so to add some challenge, we add some of the other mechanics as well.
Cover Shooter: Disables free roam. Constatly i get sucked into some cover, then i'm trapped into that cover and try to break free, ending up doing movements i did not intend.
Melee: That's just too dumb to be fun, mostly.
Special abilities: I have to learn them, remember buttons, and when i come back to the game weeks later i have it all forgotten. I hate it - it's a bolt on cancer. Remembering Controls levitation mode where you're intended to control flying up and flying down with just one button gives me the creeps. I've fallen to death more often than not.

So, in short, it's all a form of complexity bloat. I want simple controls, and complexity should manifest in the games world and challenges, not in GUI pages or by mastering combos of a dozen of buttons.
If Doom Eternal had a 'Classic' mode, without glory kills and stuff, just simple but challenging shooting, i would replay this game 5 times. As is, i won't, although i'd actually like to. Not willing to relearn its mechanics again.



It's not about shooting. The best genre of recent years to me was indie horror, as invented from Frictional Games Penumbra or Amnesia. Those games have no shooting at all, but free roam and interaction with a properly simulated and interactive world, and this gives great immersion, which is what i'm after.
So actually i think to evolve games respecting all platforms, i would sacrifice that 'shooting' all together, or tone it down. But if we do so, it turns out too quickly game worlds are just smoke and mirrors and not interesting enough on its own. Thus my believe in pushing physics to lift the technical base here.

I too have problems with FPS games like CoD or BF due to being a veteran so I avoid games trying to be "real" (but actully being arcade).
One FPS (more mil-sim actually) I do kinda enjoy is ARMA (with 3rd person disabled servers)...the closest I have seen to "real shoting/damage implementation".

Hence I avoid games trying to be "realistic"..and go for games like CP2077 or Warhammer 40K...unless I am doing a mil-flight-sim.

Shooters on rails bore me to death...autoaim/assist is a pestilence...and I hate controllers with a vengance ;)
 
Yes, if anything, they should be more efficient by being able to pool certain ressources from two old-style Vega NCUs. Except maybe if you've got tons and tons of independent 16-wide workgroups which for some reason cannot be combined in half as much 32-wide ones.
They are more efficient on simple and single threaded wotkloads but these aren't a typical thing in GPU compute since you generally use it when you have a massive parallel workload. Thus this efficiency is more about graphics and less about compute where GCN may still end up being more efficient.

And it's 32 and 64 wide, not 16 and 32. I don't think there were any research on how WGPs compare to CUs when running old code but I'd expect them to be less efficient if only because the dCU mode isn't their native mode of operation. And in case of RVII vs 6900XT specifically memory bandwidth will play a huge role for RVII's advantage. Compute workloads don't do too well on IC.
 
F1 2021 uses light RT for shadows and reflections, RT shadows in the game are of low resolution, which causes shimmering, the distance to the shadows appears low. RT reflections are rendered using 1/4 which makes them blurry and flickery.

Despite this, RDNA2 loses more performance with RT than even Turing, RT reflections incur a ~ 35% penalty on the 2070 Super and 3080, but incur a 42% hit on the 6800XT, RT shadows incur an 8% penalty on the 2070 Super, and 5% penalty on the 3080, while the 6800XT takes a 14% hit.

In the end, the 3090 is 20% faster than 6900XT @4K.
https://www.computerbase.de/2021-07/f1-2021-benchmark-test/3/#diagramm-f1-2021-3840-2160-raytracing
 
F1 2021 uses light RT for shadows and reflections, RT shadows in the game are of low resolution, which causes shimmering, the distance to the shadows appears low. RT reflections are rendered using 1/4 which makes them blurry and flickery.

Despite this, RDNA2 loses more performance with RT than even Turing, RT reflections incur a ~ 35% penalty on the 2070 Super and 3080, but incur a 42% hit on the 6800XT, RT shadows incur an 8% penalty on the 2070 Super, and 5% penalty on the 3080, while the 6800XT takes a 14% hit.

In the end, the 3090 is 20% faster than 6900XT @4K.
https://www.computerbase.de/2021-07/f1-2021-benchmark-test/3/#diagramm-f1-2021-3840-2160-raytracing
CB updated their review with results with 21.7.1 driver

CB.de - F1 2021 im Test: Benchmarks in Full HD, WQHD sowie UHD, Frametimes und der Adrenalin 21.7.1 (Update)
 
Seems like a nice bit of extra perf. But only applicable to RDNA2. Makes you think why they couldn't get the developer to include the fix in the first place.
My guess is that most gains in RT workloads on RDNA2 are down to IC data management and devs won't do them since there is no IC anywhere but in RDNA2 PC GPUs. There's also no way for devs to do it through regular APIs either.
 
When I asked roughly around the RDNA2 launch timeframe, AMD said, there was no mechanism to explicitly manage the ∞$.
 
My guess is that most gains in RT workloads on RDNA2 are down to IC data management and devs won't do them since there is no IC anywhere but in RDNA2 PC GPUs. There's also no way for devs to do it through regular APIs either.
The gains aren't in RT workloads, the driver improves F1 2021 performance even more without RT.
 
For @JoeJ Frosbite next gen solution uses surfel and hardware raytracing acceleration. They will give a presentation at SIGGRAPH 2021

Global Illumination Based on Surfels
Henrik Halen (SEED at Electronic Arts),
Andreas Brinck (Ripple Effect Studios at Electronic Arts),
Kyle Hayward (Frostbite at Electronic Arts),
Xiangshun Bei (Ripple Effect Studios at Electronic Arts)

https://advances.realtimerendering.com/s2021/index.html

Surfel

image024.jpg


Abstract: Global Illumination Based on Surfels (GIBS) is a solution for calculating indirect diffuse illumination in real-time. The solution combines hardware ray tracing with a discretization of scene geometry to cache and amortize lighting calculations across time and space. It requires no pre-computation, no special meshes, and no special UV sets, freeing artists from tedious and time-consuming processes required by traditional solutions. GIBS enables new possibilities in the runtime, allowing for high fidelity lighting in dynamic environments and for user created content, while accommodating content of arbitrary scale. The algorithm is part of the suite of tools available to developers and teams throughout EA as part of the Frostbite engine.


This talk will detail the GIBS algorithm and how surfels are used to enable real-time ray traced global illumination. We will describe how the scene is discretized into surfels on the fly, and why we think this discretization is a good fit for caching lighting operations. The talk will describe the acceleration structure used to enable efficient access to surfel data, and how this structure allows us to cover environments of arbitrary size, while keeping a predictable performance and memory footprint. We will detail how the algorithm handles dynamic objects, skinned characters, and transparency. Several techniques have been developed to efficiently integrate irradiance on surfels. We will describe our use of ray guiding, ray binning, spatial filters, and how we handle scenes with large numbers of lights.
 
Last edited:
Some notes gathered ...

Common:

Rebuilding your TLAS is virtually always a good idea.
Ray flags FORCE_OPAQUE, ACCEPT_FIRST_HIT_AND_END_SEARCH, and SKIP_PROCEDURAL_PRIMITIVES can help accelerate traversal of the acceleration structure.

AMD:

Ray flags involving traversal can change the traversal programs. Using these flags will allow the compiler to generate the optimal traversal program for the hardware.
General consensus is that building the acceleration structure is slow so use the build flag PREFER_ FAST_ TRACE as much as possible for both static and low deformation dynamic geometry. Avoid including high deformation dynamic geometry to the acceleration structure. Static geometry will give the best performance since you can build the highest quality acceleration structure and it never needs to be rebuilt or updated.
Avoid simultaneously tracing multiple ray query objects in the shaders.

NV:

Traversal is implied to be implemented as a state machine in the hardware. Using related ray flags will let the hardware change into a more optimal state for traversal.
Using the build flag ALLOW_COMPACTION is a hint for the driver to apply compression to the acceleration structures.
 
Didn't know where to post this, it's Nvidia Real-time Neural Radiance Caching for Path Tracing:

Neural Radiance Caching combines RTX’s neural network acceleration hardware (NVIDIA TensorCores) and ray tracing hardware (NVIDIA RTCores) to create a system capable of fully-dynamic global illumination that works with all kinds of materials, be they diffuse, glossy, or volumetric. It handles fine-scale textures such as albedo, roughness, or bump maps, and scales to large, outdoor environments neither requiring auxiliary data structures nor scene parameterizations.

Combined with NVIDIA’s state-of-the-art direct lighting algorithm, ReSTIR, Neural Radiance Caching can improve rendering efficiency of global illumination by up to a factor of 100—two orders of magnitude.

At the heart of the technology is a single tiny neural network that runs up to 9x faster than TensorFlow v2.5.0. Its speed makes it possible to train the network live during gameplay in order to keep up with arbitrary dynamic content. On an NVIDIA RTX 3090 graphics card, Neural Radiance Caching can provide over 1 billion global illumination queries per second.

White paper:
https://d1qx31qr3h6wln.cloudfront.net/publications/paper_4.pdf

CUDA source code:
https://github.com/nvlabs/tiny-cuda-nn
 
So in the end, they're trading bandwidth usage / memory access for tensors cores usage, right ? If tensors are not fully used even with DLSS, it make sense, even if I didn't understand everything :D
 
I must admit to not understanding most of the words in that article, but with a 100x speed up, do we expect path tracing to become practical for modern games?

Wow the raytraced radiance caching techniques are coming out fast and furious.

Those fps numbers do look promising. Granted it’s on a 3090 using tensors so we won’t be seeing fully path traced games anytime soon. The paper doesn’t mention anything about dynamic BVH update cost and the test scenes are pretty small so these numbers probably aren’t representative of an actual game.

I don’t know if Quake 2 RTX already does importance sampling or any sort of irradiance caching but it’ll be interesting to see it updated with the latest techniques.

So in the end, they're trading bandwidth usage / memory access for tensors cores usage, right ? If tensors are not fully used even with DLSS, it make sense, even if I didn't understand everything :D

I understood it as trading tensor core usage for a lower number of rays and a less noisy image for the denoiser to deal with.

The point about avoiding trips to vram when training the network was an optimization of their DL training routine, not really reducing bandwidth usage compared to the baseline renderer. The actual performance savings comes from casting fewer rays overall.
 
Back
Top