GPU Ray Tracing Performance Comparisons [2021-2022]

More programmable than the "normal" cores? So a second full programmable co processor just for raytracing?

It makes sense with "fixed function" hardware at first (NVIDIA's RT performance over AMD shows that current "compute units" do not have sufficient power when compared") and then migrate it over when the " compute units" are capable enough.

RT is a computational beast and AMD's implementation shows the downside of using "compute units" versus "fixed units".
This will most likely change in the future...but for now, it is not the optimal solution...all reviews shows this.

You have to crawl before you walk...
 
I also don't understand the discussion. Are people really against offering more programmable hardware? I see the consoles as having more flexible APIs in the RT space, even if the differences are fairly small. Like if DXR1.2 comes out and allows developers to write some kind of custom traversal, are people going to argue against it? Like, asking people to prove programmability is good is kind of weird.
The question is what's the cost of this flexibility? Would you be okay with RT performance dropping 2-3 times universally just so some (not all) engines be able to use h/w RT in a "nice" way (as in instead of relying on hacks like all graphics do)? What would that give us? Nanite native mesh RT at performance lower than that of compute based Lumen?

There's a reason why current APIs are limited. This reason is performance. Full flexibility gives you general compute, good luck using it for per pixel RT.

Console APIs are not better either since if that would be the case we would have seen the advantages they would provide already.
 
It makes sense with "fixed function" hardware at first (NVIDIA's RT performance over AMD shows that current "compute units" do not have sufficient power when compared") and then migrate it over when the " compute units" are capable enough.

Dont understand this. You can not make "general compute units" more capable for raytracing. Professionel ISV have used GPUs for raytracing since years. nVidia has a pure software solution with Optix. And yet nearly everyone has adapted hardware accelerated raytracing.

Obviously "new" cores or compute units are necessary for hardware accelerated raytracing.
 
Ok, don't think anyone is arguing with you there.
But that's always been the case. Consoles allow you to spend dev cycles programming to the metal because there's only one or a handful of hardware platforms to support. With PCs you need abstraction, which is a boon for many but a bane for ninja devs. Always has been, always will be.

I think @JoeJ 's complaint is that DXR's abstraction level is even higher than usual. I could see that, it's wrapping a fairly complex set of hardware primitives. The API will evolve with time (as will the hardware), but it'll never reach console-level control.
 
Dont understand this. You can not make "general compute units" more capable for raytracing. Professionel ISV have used GPUs for raytracing since years. nVidia has a pure software solution with Optix. And yet nearly everyone has adapted hardware accelerated raytracing.

Obviously "new" cores or compute units are necessary for hardware accelerated raytracing.

Look at how "compute units" and API's have evolved over time.
The G80 marked NVIDIA's first transistion to a more compute architecture.
This is a nice look back:
https://www.extremetech.com/gaming/...orce-8800-changed-pc-gaming-computing-forever

Look at what he says about Tesla to Fermi to Pascal...
 
The question is what's the cost of this flexibility? Would you be okay with RT performance dropping 2-3 times universally just so some (not all) engines be able to use h/w RT in a "nice" way (as in instead of relying on hacks like all graphics do)? What would that give us? Nanite native mesh RT at performance lower than that of compute based Lumen?
No, in context of open BVH and Nanite that's not the question, because it does not come up:
Nanite is regular triangle meshes, RT Cores expect just hat, so their operation is not affected, nor do we need new flexibility here. Traversal and RT Core can and should remain black boxed as is.
We only need to modify the BVH data to update partial mesh clusters to switch geometry resolution. The result is again static and regular triangle meshes.

So the question if we need or want flexibility within traversal is a very different one, see Intels stochastic LOD paper.
Nanite solves LOD on the level of geometry, so both RT and rasterization can 'fake' continuous LOD using discrete changes of the mesh.
Stochastic LOD in contrast solves LOD in image space, requiring to switch discrete detail levels individually per pixel, and so requiring traversal shaders to work properly.

Both is interesting, but Nanite is quite convincing by its results and proven to work. It's harder to implement, but no need to question actual hardware acceleration.

Actually the only affect on tracing performance comes from differing BVH generated from offline custom engine code, vs. driver building it in real time. But that's a hypothetical question, because the driver can not do this in practice.
 
I think @JoeJ 's complaint is that DXR's abstraction level is even higher than usual. I could see that, it's wrapping a fairly complex set of hardware primitives. The API will evolve with time (as will the hardware), but it'll never reach console-level control.

Yeah the BVH structure is opaque for a very good reason. Microsoft probably didn’t want the responsibility of defining one acceleration structure to rule them all. DXR doesn’t even care if underneath the hood it’s a BVH, k-d tree or something else as long as you can cast a ray into it and hit something.

The irony is that in order to give developers flexibility you have to limit the flexibility of the hardware implementation by mandating a specific acceleration structure. This would probably also mean mandating how compression of that structure works. This is no different to mandating the structure of a triangle strip or a texture compression format. The only difference is that triangle strips have an obvious “best” representation. This isn’t the case for RT acceleration structures so Microsoft decided to punt. Or the IHVs demanded control.
 
The irony is that in order to give developers flexibility you have to limit the flexibility of the hardware implementation by mandating a specific acceleration structure. This would probably also mean mandating how compression of that structure works. This is no different to mandating the structure of a triangle strip or a texture compression format. The only difference is that triangle strips have an obvious “best” representation. This isn’t the case for RT acceleration structures so Microsoft decided to punt. Or the IHVs demanded control.
Yes. But that's solvable, e.g. using abstracted shader language structures and/or functions to access nodes and set child pointers, and then running some post process for vendor compression.
Such post process could even handle conversion from BVH4 to BVH8 for example, to make it really easy for the devs.
Though, personally i think this would end up again compromising performance or having limitations still, not sure.
Seems better to get started with vendor extensions, and make DXR API after it turns out what's the differences, practices, problems, etc.
We see on console, where there is only one vendor, it's quite easy. And treating all vendors specifically seams easier than forcing conventions on them yet.
 
We only need to modify the BVH data to update partial mesh clusters to switch geometry resolution. The result is again static and regular triangle meshes.

What data structure would allow you to do this? Unless you treat each cluster as its own BLAS there is no practical way for a developer to reference a specific node or treelet within the BVH. How would they even know where to look for the geometry they want to modify?

If we treat each cluster as its own BLAS then we can accomplish LOD today with DXR. Just delete/rebuild the BLAS as needed.
 
Yes. But that's solvable, e.g. using abstracted shader language structures and/or functions to access nodes and set child pointers, and then running some post process for vendor compression.
Such post process could even handle conversion from BVH4 to BVH8 for example, to make it really easy for the devs.
Though, personally i think this would end up again compromising performance or having limitations still, not sure.
Seems better to get started with vendor extensions, and make DXR API after it turns out what's the differences, practices, problems, etc.
We see on console, where there is only one vendor, it's quite easy. And treating all vendors specifically seams easier than forcing conventions on them yet.

If you do this you still bias the api toward a particular hardware implementation. If an IHV chooses to go their own way they will have to pay the cost of constantly converting back and forth from Microsoft’s data structure. Not worth it.
 
If we treat each cluster as its own BLAS then we can accomplish LOD today with DXR. Just delete/rebuild the BLAS as needed.
Yeah, in theory. But then we have much too many BLASes, and building huge TLAS every frame takes too long.
That's bad, we still talk about static geometry, se we can keep the node count for TLAS as small as is.
Now we could maybe add some more levels of interaction here, like TL, BL0, BL1, BL2. Maybe this is what Karis has in mind from is twitter posts. IDK, my own proposals here are just my personal visions and may differ from Epics ideas.

What data structure would allow you to do this? Unless you treat each cluster as its own BLAS there is no practical way for a developer to reference a specific node or treelet within the BVH. How would they even know where to look for the geometry they want to modify?
The data structure we need is a node, it's child pointers / triangle indices. The developer knows which node refers to which patch of triangles by linking this to his own tree he uses to select LOD.
(Notice this means we still have duplicated AS data this way: ours and RT BVH. A traversal shader in a flexible future could eventually work using just one, totally custom AS. But that's far fetched, and not sure if we ever want this.)

Edit: Ofc. we also need to set bounding boxes per node, and most difficult: We need to generate / delete nodes, involving memory management and compaction problems, and eventually expected memory orderings from HW. Depending on HW, this can be quite a big problem.
 
Last edited:
In regards to "compute units" and perfomance:
https://www.gamersnexus.net/dictionary/2-cuda-cores

"Architecture changes in a fashion that makes cross-generation comparisons often non-linear, but generally speaking (within a generation), more CUDA cores will equate more raw compute power from the GPU. The Kepler to Maxwell architecture jump saw nearly a 40% efficiency gain in CUDA core processing ability, illustrating the difficulty of linearly drawing comparisons without proper benchmarks."
 
The question is what's the cost of this flexibility? Would you be okay with RT performance dropping 2-3 times universally just so some (not all) engines be able to use h/w RT in a "nice" way (as in instead of relying on hacks like all graphics do)? What would that give us? Nanite native mesh RT at performance lower than that of compute based Lumen?

There's a reason why current APIs are limited. This reason is performance. Full flexibility gives you general compute, good luck using it for per pixel RT.

Console APIs are not better either since if that would be the case we would have seen the advantages they would provide already.

Imagining the hypothetical scenario where you could install and run full Windows10 on a PS5 or Series X, I doubt that the console APIs offer less performance than DXR for exactly the same hardware. Whatever additional flexibilty the console APIs have probably does not come with any overall performance cost. So I don't really know what people are arguing. We have the hardware. I don't know if there are features that are not exposed in DXR. I'm not exactly sure what extra options you have for manipulating the ray tracing data in the console APIs. I just think it's kind of plainly true that over time the DXR api will expose more access to hardware features and more opportunities to manipulate the data. This seems non controversial. I'm not talking about weird hypothetical scenarios like what if DXR never existed and we could have a do-over and redesign the ray tracing hardware to work differently and be a generalized compute function.
 
Yeah, in theory. But then we have much too many BLASes, and building huge TLAS every frame takes too long.
That's bad, we still talk about static geometry, se we can keep the node count for TLAS as small as is.
Now we could maybe add some more levels of interaction here, like TL, BL0, BL1, BL2. Maybe this is what Karis has in mind from is twitter posts. IDK, my own proposals here are just my personal visions and may differ from Epics ideas.

The data structure we need is a node, it's child pointers / triangle indices. The developer knows which node refers to which patch of triangles by linking this to his own tree he uses to select LOD.
(Notice this means we still have duplicated AS data this way: ours and RT BVH. A traversal shader in a flexible future could eventually work using just one, totally custom AS. But that's far fetched, and not sure if we ever want this.)

Edit: Ofc. we also need to set bounding boxes per node, and most difficult: We need to generate / delete nodes, involving memory management and compaction problems, and eventually expected memory orderings from HW. Depending on HW, this can be quite a big problem.

Yeah when you start unpacking the details it becomes super obvious why DXR is the way it is. Consoles have the benefit of a fixed platform so there’s no need to be flexible from a hardware perspective.
 
If an IHV chooses to go their own way they will have to pay the cost of constantly converting back and forth from Microsoft’s data structure. Not worth it.
Thus i want vendor extensions first, to be sure.
However, BVH data structures are usually simple. The only potential variations seem this:
Branching factor (BVH2,4,8...64?)
Pointer per child vs. pointer for first child and child count. (Same for triangles in leafs)
Treelets with relative boxcoords from the treelet root.


What else?
API to expose all this, plus query info from driver on what's expected would do, and we would already have a vendor independent BHV API.
 
Thus i want vendor extensions first, to be sure.
You have to understand to politics of the situation, only NVIDIA seems to regard RT highly, and they think their current approach is the optimal one.

AMD is not promoting RT, their public speakers regard it as a compelemntary feature along side rasterization, two years ago they outright downplayed RT completely, their RT roadmap wants current RT to be limited to shadows, with the next step happening on the clouds, they actively sponsor games to implement RT shadows alone (Godfall, WoW, Dirt 5, Riftbreaker, Far Cry 6), they didn't even release a public demo to showcase RT to it's user base, even their pre rendered RT demo is downright unimpressive, heck .. Sony and Microsoft appear more enthusiastic about RT than AMD themselves.

Then, add to that, the almost non existent professional RT acceleration, their low market share, and their not competitive hardware implementation which could likely change to a completely better solution in RDNA3 .. if you think they are going to waste time on creating custom extensions on PC for RDNA2 after all of that, then I would say you are unrealistically optimistic.
 
Nvidia isn't really incentivized to create extensions either. Their immediate goal should be to drive adoption in a fair game they're currently winning. Not to split the user base and encourage griping about proprietary features / consumer lock-in etc.
 
Going back two decades, we had two generations of fixed function hardware TnL, before some more configurability was introduced and full programmability came even later. Maybe it will be the same with RT. A few generations with limited functionality just to get developers accustomed to and establish best practices all the while building an installed hw base. Then expanding RT into more flexible approaches while at the same time the compute part might be fast enough so we can do away with hybrid approaches.

Exactly, thats how things have been going for ages now.

I also don't understand the discussion. Are people really against offering more programmable hardware? I see the consoles as having more flexible APIs in the RT space, even if the differences are fairly small. Like if DXR1.2 comes out and allows developers to write some kind of custom traversal, are people going to argue against it? Like, asking people to prove programmability is good is kind of weird.

Thats not what im wanting to get to say atleast on my part. What im debating is that 'console RT is better' than PC ray tracing, which is a totally false statement.

Current console API is better, not HW. Console perf. being less than high end PC is nothing new and can be scaled as usual. It's not relevant to the API topic at all.

You dont need 'high end' to greatly surpass the consoles in ray tracing capabilities either. I do agree API's are having lower access on consoles, but that doesnt really help the case right now in the stance of console RT vs PC RT.

i'm out of this for some while.

No your not :p

I don't think people are against customization at the lowest possible of levels. I think the debate is around perspective of what needs to arrive first: faster generic speed, at the cost of customization, or slower generic speed with customization.

As many have stated earlier, hardware TnL is how we started before moving to compute. I mean admittedly, a lot of games still leverage the 3D pipeline even though compute exists. So there is some benefit (to teams) to having a faster fixed function pipeline (as an option) I suspect, even though I believe in fully compute based engines. And in the same path, I believe RT does in fact need to go in the direction of generic compute over time. I think the debate is whether it needs to move there at the starting line, or to go there over time. As @JoeJ ends here, we took our learnings already from compute so why take the step back. I definitely see the argument from both sides, and imo there is no clear winner here that I can see, perhaps it just comes down to preference.

Perhaps it's not for us to debate, this is probably something that IHV may need to chime in. Nvidia knows precisely why they went one route, AMD another, and MS attempting trying to create an API to support all IHVs. I wish Max McCullen was still here, perhaps he could chime in.

Agree on your post, nicely written. I wasnt debating software vs hardware, cause obviously software is better if possible, but we dont have 200TF GPU's just yet. Its like PS2 vs GF4 Ti4600....
Again what i didnt agree on is that 'console RT is better than PC RT', which is the first time i heard someone claiming this btw, not to withstand what the actual results show.

Im sure Nvidia will eventually go the direction of RT on compute aswell, just like with any tech before it that went HW first on nv hardware.

Imagining the hypothetical scenario where you could install and run full Windows10 on a PS5 or Series X, I doubt that the console APIs offer less performance than DXR for exactly the same hardware. Whatever additional flexibilty the console APIs have probably does not come with any overall performance cost. So I don't really know what people are arguing. We have the hardware. I don't know if there are features that are not exposed in DXR. I'm not exactly sure what extra options you have for manipulating the ray tracing data in the console APIs. I just think it's kind of plainly true that over time the DXR api will expose more access to hardware features and more opportunities to manipulate the data. This seems non controversial. I'm not talking about weird hypothetical scenarios like what if DXR never existed and we could have a do-over and redesign the ray tracing hardware to work differently and be a generalized compute function.

Been thinking this aswell, whos to say DXR/PC api's wont improve over time. Anyway, thanks for a civil/polite discussion ;)
 
Back
Top