GPU Ray Tracing Performance Comparisons [2021-2022]

JoeJ · Dec 11, 2022

Hehe, ok. But i was talking about NV here, not Epic or other game devs : )

Though, it's NVs young internal game studio, and this is their first game. So maybe some inefficiencies are indeed expected.

OlegSH · Dec 11, 2022

JoeJ said:
Yes. Inline disables any reordering or material sorting in HW, so we should use it only with care.

Inline RT doesn't disable anything when it's done via API (SER way) or manually.

JoeJ said:
UberShader seems weakly defined here maybe

The UberShader name speaks for itself - it's just a big shader. Inlining produces big shaders.

JoeJ said:
Looking at the Quake RTX code back then, i definitively had this impression of classical RT, and simplicity > performance. The whole approach just ignores how GPUs work. They are not single threaded, and treating them as such performs poor.

Quake RTX has SW materials sorting, UE 4 and 5 have it and all other major engines with RT support have it too, so there is no single-threaded anything, it's up to programmer whether he wants to sort something or not (api doesn't impose restrictions on that and many devs do sorting in SW), which will obviously depend on programers skills as well as runtime stats (sorting itself must be fast, perf hit w/o sorting must be bad enough to compensate for sorting overhead).
SER just makes it way easier for devs to implement such features and get good performance out of them.

JoeJ · Dec 11, 2022

OlegSH said:
Inline RT doesn't disable anything when it's done via API (SER way) or manually.

As i understand it, inline RT means tracing rays from any shader, likely compute. It returns the result immideatly to the same thread, so there is no more way HW could implement any reordering or material binning.
Contrary, if we use generation and hit shaders, the HW can shuffle rays around in arbitrary ways between those shader stages. Either to group hits by material, or to enable in traversal reordering in the future.

What's the benefit of SER in the scenario of inline tracing? I assume it has no effect at all. Am i wrong with that?

OlegSH · Dec 11, 2022

JoeJ said:
What's the benefit of SER in the scenario of inline tracing? I assume it has no effect at all. Am i wrong with that?

SER is just an API (that exploits Ada HW capabilities) that you can use in any Inline RT shader type to sort something by "key", you can sort hits by materials IDs or you can sort basically anything else with it. API does allow that, you can read here how it's done.

TopSpoiler · Dec 11, 2022

The ReorderThread function only available in the raygeneration shader. I have no idea if it could be callable from the arbitrary material shaders.

JoeJ · Dec 11, 2022

OlegSH said:
The UberShader name speaks for itself - it's just a big shader. Inlining produces big shaders.

'Big shader' does not tell much, so that's no definition of what it means.
Afaik, the term came up with deferred shading. The goal was to have just one shader handling all materials. That's a definition. But it's not meant to be big, but as small as possible while still handling variety.
Now it could also mean stuffing many material shaders into one, selecting with a jump per thread. Not sharing common code paths, no goal of generalization, minimized parallel execution.

So which definition would you associate with uber shaders? You can not just say it speaks for itself, because it doesn't.

I don't agree inline tracing causes big shaders in general. It simply depends on how much code you type around your traceRay call. For short range AO, inline would be fine for example. Because SER or reordering is pointless, and shader stages only add overhead.

OlegSH said:
Quake RTX has SW materials sorting

Ok, missed this form looking at the code if so. Maybe they added it later, or i was wondering about some other things.

OlegSH said:
SER just makes it way easier for devs to implement such features and get good performance out of them.

How? Idk what SER exactly is. NVAPI seems not public.

I assume it does internal binning so hit shaders have more active threads. But idk what's the granulary. I assume the feature is restricted to a single SM.
Thus, a global binning in SW might still make sense in cases. But SER won't help me with that. It only gives me the option of not doing this, in case SER does better.
But that's just guessing based on marketing slides.

nAo · Dec 12, 2022

JoeJ said:
How? Idk what SER exactly is. NVAPI seems not public.

The SER API is public, you can download the SER SDK and use it right away:

Improve Shader Performance and In-Game Frame Rates with Shader Execution Reordering | NVIDIA Technical Blog

Learn about Shader Execution Reordering (SER), a performance optimization that unlocks the potential for better ray and memory coherency in ray tracing shaders.

developer.nvidia.com

SER in-depth whitepaper:

https://developer.nvidia.com/sites/default/files/akamai/gameworks/ser-whitepaper.pdf

JoeJ · Dec 12, 2022

OlegSH said:
SER is just an API (that exploits Ada HW capabilities) that you can use in any Inline RT shader type to sort something by "key", you can sort hits by materials IDs or you can sort basically anything else with it. API does allow that, you can read here how it's done.

Very interesting - missed this before, thanks!
I see my assumptions are pretty right but it can do more than just that. Good stuff.
Question about granularity is still open. Is the reordering happening across the whole chip or local to a SM?

JoeJ · Dec 12, 2022

nAo said:
The SER API is public, you can download the SER SDK and use it right away:

Is there something out on DMM as well?
I have filled out the form to get API infos, but nothing yet.

TopSpoiler · Dec 12, 2022

Dictator said:
That is one thing that is enjoyable in Portal RTX from my perspective: things in reflections have visually the similar quality as things in primary view. GI, material responses, reflections in their own right... I cannot wait to see more games getting to that level.

It is nice to see after playing Fortnite where reflections are... not very good looking in the base set up they have. I really wish Epic allowed hitlighting in Fortnite as an option.

FYI, RTX Remix uses Primary Surface Replacement for mirror reflection and refraction. For the diffuse reflection, it's done through secondary bounces sampling of the GI pass.

trinibwoy · Dec 12, 2022

JoeJ said:
Very interesting - missed this before, thanks!
I see my assumptions are pretty right but it can do more than just that. Good stuff.
Question about granularity is still open. Is the reordering happening across the whole chip or local to a SM?

The white paper is a little vague on that point. There are a few references to sorting and moving thread context “across the GPU” but that can mean anything. If I had to make I wild guess the sorting may be localized to a GPC as a single SM doesn’t have enough active threads to make sorting useful.

gamervivek · Dec 12, 2022

https://twitter.com/x/status/1602129847576559617

intel could use some tuning as well since it doesn't even start on 770.

Portal RTX not launching with ARC a770

Hi I downloaded Portal RTX today it's a free DLC for steam but that game won't launch with the ARC a770 there are some other post's out on the internet from people with the same problem. Is there a workaround on this issue?

community.intel.com

OlegSH · Dec 12, 2022

TopSpoiler said:
The ReorderThread function only available in the raygeneration shader. I have no idea if it could be callable from the arbitrary material shaders.

Whitepaper suggest setting up a ray tracing pipeline for games with Ray Queries as an easy path to integrate SER (not sure what the hard path would be), this can be just a raygen ubershader without indexing into shader table for closest hit shaders as far as I understand.

JoeJ · Dec 12, 2022

trinibwoy said:
The white paper is a little vague on that point. There are a few references to sorting and moving thread context “across the GPU” but that can mean anything. If I had to make I wild guess the sorting may be localized to a GPC as a single SM doesn’t have enough active threads to make sorting useful.

Makes sense. SM is not enough, whole chip would probably already diminish the wins, even more so if their is a chiplet future for NV too.

I always wanted some HW accelerated binning / sorting. SER is an interesting solution. As with mesh shader tasks, it would be nice if such features could become exposed to compute at some point. It's useful in general.

troyan · Dec 12, 2022

More register and L1 cache and you can get 3080 level performance in Portal RTX with a 7900XT(X): https://www.comptoir-hardware.com/a...-radeon-rx-7900-xt-a-rx-7900-xtx.html?start=7

/edit: Explains the huge difference between Turing and Ampere, too. Ampere has 50% more L1 cache per ComputeUnit.

troyan · Dec 12, 2022

BTW: Computerbase has Raytracing numbers from the Callisto Protocol: https://www.computerbase.de/2022-12...bschnitt_rdna_3_in_aktuellen_neuerscheinungen
A 4090 loses 100FPS in 4K and it performs worse than Cyberpunk with Pyscho RT settings on a 4090...

Phantom88 · Dec 12, 2022

troyan said:
BTW: Computerbase has Raytracing numbers from the Callisto Protocol: https://www.computerbase.de/2022-12...bschnitt_rdna_3_in_aktuellen_neuerscheinungen
A 4090 loses 100FPS in 4K and it performs worse than Cyberpunk with Pyscho RT settings on a 4090...

yeah, that game with RT is double broken. Broken in general regarding performance and another set of broken on nvidia cards. Losing 3 times the performance on nvidia cards for some shadows ... It only loses a little over double the performance on the new XTX

DavidGraham · Dec 12, 2022

troyan said:
Computerbase has Raytracing numbers from the Callisto Protocol: https://www.computerbase.de/2022-12...bschnitt_rdna_3_in_aktuellen_neuerscheinungen
A 4090 loses 100FPS in 4K and it performs worse than Cyberpunk with Pyscho RT settings on a 4090...

It's a single threaded RT implementation, it's limited by the CPU on NVIDIA GPUs, not by the GPU, as the utilization of NVIDIA GPUs is sub 80%. The CPU overhead on NVIDIA GPUs is much larger than AMD GPUs in this game.

troyan · Dec 12, 2022

The 4090 is 70% faster than the 3090. It doesnt look like it is CPU limited. More like the hardware acceleration isnt working and this is a pure compute implementation on nVidia GPUs.

Dictator · Dec 12, 2022

troyan said:
BTW: Computerbase has Raytracing numbers from the Callisto Protocol: https://www.computerbase.de/2022-12...bschnitt_rdna_3_in_aktuellen_neuerscheinungen
A 4090 loses 100FPS in 4K and it performs worse than Cyberpunk with Pyscho RT settings on a 4090...

Getting a big lol about how the RX 6800 XT is just behind the 3090 Ti in a game with 3 ray tracing effects. Truly, the signs of a very representative title in ray tracing.

GPU Ray Tracing Performance Comparisons [2021-2022]

JoeJ

OlegSH

JoeJ

OlegSH

TopSpoiler

JoeJ

nAo

Nutella Nutellae

Improve Shader Performance and In-Game Frame Rates with Shader Execution Reordering | NVIDIA Technical Blog

JoeJ

JoeJ

TopSpoiler

trinibwoy

Meh

gamervivek

Portal RTX not launching with ARC a770

OlegSH

JoeJ

troyan

troyan

Phantom88

DavidGraham

troyan

Dictator

Similar threads