GPU Ray Tracing Performance Comparisons [2021-2022]

PowerVR has outlined 6 levels of RT acceleration:
  • Level 3: Bounding Volume Hierarchy (BVH) processing in hardware (NVIDIA RTX GPUs)
  • Level 4: BVH processing and coherency sorting in hardware (PowerVR IMG CXT GPU)

Some researchers have estimated that the RTX performing ray sorting/grouping in the hardware.
http://ceur-ws.org/Vol-2485/paper3.pdf

Conclusion #2: Nvidia RTX implements some raygrouping/ray-sorting. It’s done probably in combination with GPU work creation (see conclusion #4). ...

It's only an assumption based on the result, though.
 
PowerVR has outlined 6 levels of RT acceleration:
  • Level 0: Legacy solutions/CPUs
  • Level 1: Software on traditional GPUs
  • Level 2: Ray/box and ray/tri-testers in hardware (AMD GPUs/consoles)
  • Level 3: Bounding Volume Hierarchy (BVH) processing in hardware (NVIDIA RTX GPUs)
  • Level 4: BVH processing and coherency sorting in hardware (PowerVR IMG CXT GPU)
  • Level 5: Coherent BVH processing with Scene Hierarchy Generation (SHG) in hardware

https://www.imaginationtech.com/whitepapers/ray-tracing-levels-system/
https://www.imaginationtech.com/ray-tracing/

Interesting. They’re tripling down on dedicated hardware. No mention of custom traversal or LOD. Hardware accelerated BVH construction would impose even more limits on api flexibility.

I guess the theory is if you can build and cast really, really fast you can just brute force your RT through high resolution geometry without software optimization tricks.
 
The majority of GPU reviewers are targeting PC gamers and showing them how different GPUs stack up, your expectations of them are are way off and alot of sites will fail meeting your critera.
Stop attributing your words to me, I made it cristal clear that by my criteria they should not be jumping on conclusions in areas where they don't have any expertise, this has nothing to do with benchmarking GPUs.
Though the quality of tests leaves much to be desired as was noted on a previous page (not by me).

If game X gets 30 FPS on high-end GPUs at high settings with RT enabled at 1440 and we see screenshots or even a video of the benchmarked scene, am I interpreting it right that you really think that these results are worthless unless the reviewer is able to show that he understands the math?
Again you're trying to attribute your wild guesses to me.
But yes, It would have been nice indeed if all reviewers were able to interpret the math behind FPS, and the low FPS metrics, which they seem to struggle with because otherwise that weird low "instantaneous" FPS metric would have never appeared in the first place.

if gamers do not think that tessellation or raytracing or any other upcoming feature are worth the performance loss, it will get a bad reputation until the swansong game comes that convinces everyone of its worth.
The problem is that they succeed at highlighting basic FPS numbers (because it's so easy), but fail hard at highlighting where RT has an impact (super easy too) because they don't know how it works or where to look at, hence wrong conclusions and misinformation. All that typical "RT works only on puddles" moaning in YT comments comes from watching videos like that.
What's the problem to show shortcomings of screen space solutions by simply moving camera up and down in a game, or why they can't show more diffuse reflections on metall walls and other surfaces in games (which there are plenty), why don't they highlight light leaking without RT, specular occlusion is just as important as mirror reflections in puddles, etc.

After that it is a fully subjective opinion of the reviewer and their individual readers whether enabling or disabling individual features are worth it or not.
If that's a fully subjective opinion of the reviewer then they need a disclaimer: "Hey, we are not experts in RT, so we can't highlight it for you in this video because we don't know where to look at", but they prefer jumping to conclusions on how RT doesn't matter.
 
All that typical "RT works only on puddles" moaning in YT comments comes from watching videos like that.
What's the problem to show shortcomings of screen space solutions by simply moving camera up and down in a game, or why they can't show more diffuse reflections on metall walls and other surfaces in games (which there are plenty), why don't they highlight light leaking without RT, specular occlusion is just as important as mirror reflections in puddles, etc.

It’s probably true that some people don’t notice or don’t care about those details or that good looking reflections don’t matter to them when gaming. On the other hand I’m currently playing FC4 and there is a lot of water in that game. The reflections are hot steaming garbage. The people who aren’t bothered by it should consider themselves lucky.
 
PowerVR has outlined 6 levels of RT acceleration:
These tiers seem to be artificial at best without PPA and absolute performance metrics in games traces.
They could have invented the same tiers for their deferred raster architecture and would likely win for the number of levels, lol, but that means nothing in practice.

People do SW coherency sorting since the very first game with RT - Battlefield V, probably that's a mandatory requirement for RT on consoles.
Unlike with SW tree traversal, modern SMs/CUs are efficient at sorting thanks to all sorts of inter lane operations, so dedicated sorting HW might be useless or bad for perfomance (if there is not enough of dedicated silicon).
 
It’s probably true that some people don’t notice or don’t care about those details or that good looking reflections don’t matter to them when gaming. On the other hand I’m currently playing FC4 and there is a lot of water in that game. The reflections are hot steaming garbage. The people who aren’t bothered by it should consider themselves lucky.
For me, specular occlusion is just as important as mirror like reflections on water, etc. Without specular occlusion, light leaking makes objects look out of place (due to how unnatural it looks). Without explaining that stuff, many would prefer the shiny glowing look just because people love the shiny things even when they look out of place.
 
Interesting. They’re tripling down on dedicated hardware. No mention of custom traversal or LOD. Hardware accelerated BVH construction would impose even more limits on api flexibility.

I guess the theory is if you can build and cast really, really fast you can just brute force your RT through high resolution geometry without software optimization tricks.
Is building BVH through GPU compute considered "h/w accelerated" here?
 
Is that screenshot reprensative of the actual RT in action in hitman 3? its a still-image though so ye.... doesnt say much then seeing the Intel logo at the top-right.
 
Reflections and transparency reflections are in it seems.
Not sure that it's a good choice for Hitman specifically since I don't really recall a lot of reflective materials in the first two games.
But maybe that is the reason for that choice?
 
Stop attributing your words to me, I made it cristal clear that by my criteria they should not be jumping on conclusions in areas where they don't have any expertise, this has nothing to do with benchmarking GPUs.
Though the quality of tests leaves much to be desired as was noted on a previous page (not by me).

What I am telling you is that you are looking in the wrong place if you think that GPU reviews targeting the public is where you will find experts of the field. Because that is what you have been complaining about, game benchmarks and GPU reviews lacking actual technical expertise and why no one should care about the results they present.
And that is why you should stick to actual GPU conferences or other material directly written by and targeted towards actual experts of the field.

This thread has alot of posts with RT benchmarks from different sites whose validity are questionable according to your criteria. The only ones I assume get an actual pass are the ones from actual tech papers or live presentations.

Again you're trying to attribute your wild guesses to me.
But yes, It would have been nice indeed if all reviewers were able to interpret the math behind FPS, and the low FPS metrics, which they seem to struggle with because otherwise that weird low "instantaneous" FPS metric would have never appeared in the first place.

What I asked is exactly the way you come across with your statements. And I asked the question to be sure that I interpreted you correctly, which seems to indeed be the case, so you really ought to stop calling them wild guesses now.
It is baffling though that you indeed are prepared to turn tail the moment the reviewer shows proof he has the technical expertise.

Things like poor frame latency had even been missed by AMD's actual driver team and alot of benchmarking tools started including it first after the issue had been discovered, so even the actual experts are not error free. It is even one of the disadvantages that you would have thought the experts working for the competition would have discovered at once and tried make a big deal of to the public.

The problem is that they succeed at highlighting basic FPS numbers (because it's so easy), but fail hard at highlighting where RT has an impact (super easy too) because they don't know how it works or where to look at, hence wrong conclusions and misinformation. All that typical "RT works only on puddles" moaning in YT comments comes from watching videos like that.
What's the problem to show shortcomings of screen space solutions by simply moving camera up and down in a game, or why they can't show more diffuse reflections on metall walls and other surfaces in games (which there are plenty), why don't they highlight light leaking without RT, specular occlusion is just as important as mirror reflections in puddles, etc.

So now you say that you prefer cherrypicking scenes in order to show off the tech. Whereas many reviews, I assume, try to find an average but repeatable scene to show the average performance you can expect in the game.

Basically, what you ought to be looking for are specific reviews like "Where will you see the key advantages in raytracing?", not general gaming and GPU benchmarks. Read around and you will see alot of people focusing on the gaming benchmarks and not the 3DMark ones.


That is the problem every new feature has had in the past, the industry says this shiny new feature is better, but the provided showcases has failed to convince the public why.

If that's a fully subjective opinion of the reviewer then they need a disclaimer: "Hey, we are not experts in RT, so we can't highlight it for you in this video because we don't know where to look at", but they prefer jumping to conclusions on how RT doesn't matter.

A review by itself is prone to be subjective, be it for books or films or GPUs. And that is what most reviews targeting PC gamers boil down to, both the reviewer and readers asking, is what you see here worth the performance loss?
 
Last edited:
The sole purpose of the 99% percentile frame times aka the 1% low "FPS" metric is to highlight outliers, you don't need graphs to figure that out unless that's a very first time you have to do with frame times.
Shader compilation and resources allocations are the two most common causes of stutters, so guess what frames will fall into the 99% frame time bucket?
Besides, you probably didn't notice, but look at the results for 3070 Ti here, drops like that usually happen due to memory allocation problems - memory leaks, etc.


People have changed. I am missing professional reviews from techreport.com, but other guys simply don't care what they are being fed with.

Considering that you can get 100 frames within 1-2seconds, the 99% percentile frame time isn't exactly an outlier. If they're being measured correctly, then you'd be getting a hitch every two seconds or the fps drops big time in an area.

So the save file doesn't save time of day or sky settings. Makes sense, but a bit inconvenient for what we need.

This time grabbed screenshot properly with Temporal AA and same sky settings:
2020-12-23-18-46.png

Revisited this save game on my 5800X and 6800XT. The biggest difference I see is that there is next to no CPU load, not sure if it's a change or there was something else going in the background.

WbVhcTn.jpg
 
Many levels in Hitman 3 have LOTS and LOTS of reflective surfaces.
Yeah I played the demo with the Dubai level - or whatever the skyscraper level was - and all I could think was "this should have RT", so many reflective surfaces and so many artifacts. Even though their use of planar reflections is pretty nice in some places. I wonder if RT will make its way to consoles at some point?
 
After some digging into Nvidia's TTU(RT Core) patents published in 2018, I've found that they actually do ray grouping in the hardware.

https://patents.justia.com/patent/11157414
https://www.freepatentsonline.com/11157414.pdf

Sorry for the long text!
Ray Operation Scheduling

To provide high efficiency, the example non-limiting embodiment L0 cache 750 provides ray execution scheduling via the data path into the cache itself. In example non-limiting embodiments, the cache 750 performs its ray execution scheduling based on the order in which it fulfills data requests. In particular, the cache 750 keeps track of which rays are waiting for the same data to be returned from the memory system and then—once it retrieves and stores the data in a cache line—satisfies at about the same time the requests of all of those rays that are waiting for that same data.

The cache 750 thus imposes a time-coherency on the TTU 700's execution of any particular collection of currently-activated rays that happen to be currently waiting for the same data by essentially forcing the TTU to execute on all of those rays at about the same time. Because all of the rays in the group execute at about the same time and each take about the same time to execute, the cache 750 effectively bunches the rays into executing time-coherently by serving them at about the same time. These bunched rays go on to repetitively perform each iteration in a recursive traversal of the acceleration data structure in a time-coherent manner so long as the rays continue to request the same data for each iteration.

The cache 750's bunching of rays in a time-coherent manner by delivering to them at about the same time the data they were waiting on, effectively schedules the TTU 700's next successive data requests for these rays to also occur at about the same time, meaning that the cache can satisfy those successive data requests at about the same time with the same new data retrieved from the memory system.

An advantage of the L0 cache 750 grouping rays in this way is that the resulting group of ray requests executing on the same data take roughly the same traversal path through the hierarchical data structure and therefore will likely request the same data from the L0 cache 750 at about the same time for each of several successive iterations—even though each individual ray request is not formally coordinated with any other ray request. By the fact that the L0 cache 750 is scheduling via the data path to TTU blocks 710, 712, 740, the L0 cache is effectively scheduling its own future requests to the memory system on behalf of the rays it has bunched together in order to minimize latency while providing acceptable performance with a relatively small cache data RAM having a relatively small number of cache lines. This bunching also has the effect of improving the locality of reference in the L1 cache and any downstream caches.

If rays in the bunch begin to diverge by requesting different traversal data, cache 750 ceases serving them at the same time as other rays in the bunch. The divergence happens as the size of the bounding boxes in the BVH decreases at lower levels. What might have been minute, ignorable differences in origin or direction early on, now cause rays previously bunched to miss or hit those smaller bounding boxes differently.


Hierarchical Data Structure Traversal—how Ray Operations are Activated

...

Additionally, by grouping the requests together so that many ray-complet tests that are tested against the same complet data are scheduled to be performed at more or less at the same time, rays that are “coherent”—meaning that they are grouped together to perform their ray-complet tests against the same complet data—will remain grouped together for additional tests. By continuing to group these rays together as a bundle of coherent rays, the number of redundant memory access to retrieve the same complet data over and over again is substantially reduced and therefore the TTU 700 operates much more efficiently. In other words, the rays that are taking more or less the same traversal path through the acceleration data structure tend to be grouped together for purposes of execution of the ray-complet test—not just for the current test execution but also for further successive test executions as this bundle of “coherent” rays continue their way down the traversal of the acceleration data structure. This substantially increases the efficiency of each request made out to memory by leveraging it across a number of different rays, and substantially decreases the power consumption of the hardware.

...

Great advantages are obtained by the ability of the L0 caching structure 750 to group ray execution based on the data the grouped rays require to traverse the acceleration data structure. The SM 132 that presents rays to TTU 700 for complet testing, in a general case may have no idea that those rays are “coherent.” The rays may exist adjacent to one another in three-dimensional space, but typically traverse the acceleration data structure entirely independently. Whether particular rays are thus coherent with one another depends not only on the spatial positions of the rays (which SM 132 can determine itself or through other utility operations such as even artificial intelligence), but also on the detailed particular acceleration data structure the rays are traversing. Because the SM 132 requesting the ray-complet tests does not necessarily have direct access to the detailed acceleration data structure (not would it have the time to analyze the acceleration data structure even if it did have access), the SM relies on TTU 700 to accelerate the data structure traversal for whatever rays the SM 132 presents to the TTU for testing. In the example non-limiting embodiment, the TTU 700's own L0 cache 750 provides an additional degree of intelligence that discovers, based on independently-traversing rays requesting the same complet data at about the same time, that those rays are coherently traversing the acceleration data structure. By initially grouping these coherent rays together so that they execute ray-complet tests at about the same time, these coherent rays are given the opportunity again to be grouped together for successive tests as the rays traverse the acceleration data structure. The TTU L0 cache 750 thus does not rely on any predetermined grouping of rays as coherent (although it does make use of a natural presentation order of rays by the requesting SM 132 based simply on spatial adjacency of the rays as presented by the SM for testing), but instead observes based on the data the rays require for testing as they traverse the acceleration data structure that these rays are traversing the same parts of the acceleration data structure and can be grouped together for efficiency.
 
Back
Top