GPU Ray Tracing Performance Comparisons [2021-2022]

Hehe, ok. But i was talking about NV here, not Epic or other game devs : )

Though, it's NVs young internal game studio, and this is their first game. So maybe some inefficiencies are indeed expected.
 
Yes. Inline disables any reordering or material sorting in HW, so we should use it only with care.
Inline RT doesn't disable anything when it's done via API (SER way) or manually.

UberShader seems weakly defined here maybe
The UberShader name speaks for itself - it's just a big shader. Inlining produces big shaders.

Looking at the Quake RTX code back then, i definitively had this impression of classical RT, and simplicity > performance. The whole approach just ignores how GPUs work. They are not single threaded, and treating them as such performs poor.
Quake RTX has SW materials sorting, UE 4 and 5 have it and all other major engines with RT support have it too, so there is no single-threaded anything, it's up to programmer whether he wants to sort something or not (api doesn't impose restrictions on that and many devs do sorting in SW), which will obviously depend on programers skills as well as runtime stats (sorting itself must be fast, perf hit w/o sorting must be bad enough to compensate for sorting overhead).
SER just makes it way easier for devs to implement such features and get good performance out of them.
 
Last edited:
Inline RT doesn't disable anything when it's done via API (SER way) or manually.
As i understand it, inline RT means tracing rays from any shader, likely compute. It returns the result immideatly to the same thread, so there is no more way HW could implement any reordering or material binning.
Contrary, if we use generation and hit shaders, the HW can shuffle rays around in arbitrary ways between those shader stages. Either to group hits by material, or to enable in traversal reordering in the future.

What's the benefit of SER in the scenario of inline tracing? I assume it has no effect at all. Am i wrong with that?
 
What's the benefit of SER in the scenario of inline tracing? I assume it has no effect at all. Am i wrong with that?
SER is just an API (that exploits Ada HW capabilities) that you can use in any Inline RT shader type to sort something by "key", you can sort hits by materials IDs or you can sort basically anything else with it. API does allow that, you can read here how it's done.
 
The ReorderThread function only available in the raygeneration shader. I have no idea if it could be callable from the arbitrary material shaders.
 
The UberShader name speaks for itself - it's just a big shader. Inlining produces big shaders.
'Big shader' does not tell much, so that's no definition of what it means.
Afaik, the term came up with deferred shading. The goal was to have just one shader handling all materials. That's a definition. But it's not meant to be big, but as small as possible while still handling variety.
Now it could also mean stuffing many material shaders into one, selecting with a jump per thread. Not sharing common code paths, no goal of generalization, minimized parallel execution.

So which definition would you associate with uber shaders? You can not just say it speaks for itself, because it doesn't.

I don't agree inline tracing causes big shaders in general. It simply depends on how much code you type around your traceRay call. For short range AO, inline would be fine for example. Because SER or reordering is pointless, and shader stages only add overhead.

Quake RTX has SW materials sorting
Ok, missed this form looking at the code if so. Maybe they added it later, or i was wondering about some other things.

SER just makes it way easier for devs to implement such features and get good performance out of them.
How? Idk what SER exactly is. NVAPI seems not public.

I assume it does internal binning so hit shaders have more active threads. But idk what's the granulary. I assume the feature is restricted to a single SM.
Thus, a global binning in SW might still make sense in cases. But SER won't help me with that. It only gives me the option of not doing this, in case SER does better.
But that's just guessing based on marketing slides.
 
How? Idk what SER exactly is. NVAPI seems not public.

The SER API is public, you can download the SER SDK and use it right away:

SER in-depth whitepaper:
 
SER is just an API (that exploits Ada HW capabilities) that you can use in any Inline RT shader type to sort something by "key", you can sort hits by materials IDs or you can sort basically anything else with it. API does allow that, you can read here how it's done.
Very interesting - missed this before, thanks!
I see my assumptions are pretty right but it can do more than just that. Good stuff.
Question about granularity is still open. Is the reordering happening across the whole chip or local to a SM?
 
That is one thing that is enjoyable in Portal RTX from my perspective: things in reflections have visually the similar quality as things in primary view. GI, material responses, reflections in their own right... I cannot wait to see more games getting to that level.
b-roll.02_45_28_52.stdxf7u.png


It is nice to see after playing Fortnite where reflections are... not very good looking in the base set up they have. I really wish Epic allowed hitlighting in Fortnite as an option.
FYI, RTX Remix uses Primary Surface Replacement for mirror reflection and refraction. For the diffuse reflection, it's done through secondary bounces sampling of the GI pass.
 
Very interesting - missed this before, thanks!
I see my assumptions are pretty right but it can do more than just that. Good stuff.
Question about granularity is still open. Is the reordering happening across the whole chip or local to a SM?

The white paper is a little vague on that point. There are a few references to sorting and moving thread context “across the GPU” but that can mean anything. If I had to make I wild guess the sorting may be localized to a GPC as a single SM doesn’t have enough active threads to make sorting useful.
 
The ReorderThread function only available in the raygeneration shader. I have no idea if it could be callable from the arbitrary material shaders.
Whitepaper suggest setting up a ray tracing pipeline for games with Ray Queries as an easy path to integrate SER (not sure what the hard path would be), this can be just a raygen ubershader without indexing into shader table for closest hit shaders as far as I understand.
 
The white paper is a little vague on that point. There are a few references to sorting and moving thread context “across the GPU” but that can mean anything. If I had to make I wild guess the sorting may be localized to a GPC as a single SM doesn’t have enough active threads to make sorting useful.
Makes sense. SM is not enough, whole chip would probably already diminish the wins, even more so if their is a chiplet future for NV too.

I always wanted some HW accelerated binning / sorting. SER is an interesting solution. As with mesh shader tasks, it would be nice if such features could become exposed to compute at some point. It's useful in general.
 
BTW: Computerbase has Raytracing numbers from the Callisto Protocol: https://www.computerbase.de/2022-12...bschnitt_rdna_3_in_aktuellen_neuerscheinungen
A 4090 loses 100FPS in 4K and it performs worse than Cyberpunk with Pyscho RT settings on a 4090...

yeah, that game with RT is double broken. Broken in general regarding performance and another set of broken on nvidia cards. Losing 3 times the performance on nvidia cards for some shadows ... It only loses a little over double the performance on the new XTX
 
Computerbase has Raytracing numbers from the Callisto Protocol: https://www.computerbase.de/2022-12...bschnitt_rdna_3_in_aktuellen_neuerscheinungen
A 4090 loses 100FPS in 4K and it performs worse than Cyberpunk with Pyscho RT settings on a 4090...
It's a single threaded RT implementation, it's limited by the CPU on NVIDIA GPUs, not by the GPU, as the utilization of NVIDIA GPUs is sub 80%. The CPU overhead on NVIDIA GPUs is much larger than AMD GPUs in this game.
 
The 4090 is 70% faster than the 3090. It doesnt look like it is CPU limited. More like the hardware acceleration isnt working and this is a pure compute implementation on nVidia GPUs.
 
Back
Top