RDNA 2 Ray Tracing

Jawed

Legend
Annoyingly, page 222 (230 of the PDF) has an incomplete sentence/paragraph:

The image_bvh_intersect_ray and image_bvh64_intersect_ray opcode do not support all of

I'm going to guess this is merely a reference to "texture" functionality that is not supported such as sampler mode, see the section "Restrictions on image_bvh instructions" on page 81 (89 of the PDF):
  • DMASK must be set to 0xf (instruction returns all four DWORDs)
  • D16 must be set to 0 (16 bit return data is not supported)
  • R128 must be set to 1 (256 bit T#s are not supported)
  • UNRM must be set to 1 (only unnormalized coordinates are supported)
  • DIM must be set to 0 (BVH textures are 1D)
  • LWE must be set to 0 (LOD warn is not supported)
  • TFE must be set to 0 (no support for writing out the extra DWORD for the PRT hit status)
  • SSAMP must be set to 0 (just a placeholder, since samplers are not used by the instruction)
It's notable that the query instruction isn't specialised for BVH or triangle testing, the node provided as the starting point determines whether the result consists of multiple box nodes or a single triangle node. This might imply that a single "texture" contains nodes that are either boxes or triangles. Or it could imply that there's one texture for boxes and another for triangles.

There's support for BVH data that is larger than 32GB, but the count of nodes can only be described by a 42-bit number.

There's support for float16 ray and inverse ray direction vectors, "for performance and power optimization", saving 3 VGPRs. I can imagine console developers being all over that optimisation. Though until someone captures the ray cast shaders, we won't really know much more.

There are some interesting flags:
  • Box sorting enable
  • Triangle_return_mode
  • big_page
I'm struggling to understand why a single ray is producing multiple box hit results. "For box nodes the results contain the 4 pointers of the children boxes in intersection time sorted order." That is the order that the machine produced the results, not necessarily "box sorted". So the zeroth result, when sorted, would be a "first hit". But a first box hit doesn't mean there was a triangle to hit.

So I think this is merely a way to "speed-up" box traversal, by allowing the ray to progress into multiple boxes per query and then to determine which of those boxes should be queried further. Or rather, start to evaluate the boxes, since a triangle hit could be in any of them. I'm struggling to come up with a fast algorithm here, so not sure if it's a speed-up technique or perhaps the 4-way result merely reflects how AMD has structured the boxes...

I can't find anything that describes what happens when there's less than 4 box results. If you start the query one level up from leaf nodes, then you might get less than 4 results?

Triangle return mode either returns a pair of triangle ID and hit status or i/j coordinates. I'm not sure how this relates to DXR's concept of instancing, but I presume the i/j pair is the barycentrics for the triangle. The first pair of floats seem to be "intersection time". I can't find any description of the use that intersection time might be put to.

It seems that if you want both the triangle ID and the barycentrics you have to run the query twice?

I can't find any explanation for the use of the big_page flag (either no override or pages that are >= 64KB in size), but I suppose this could be an optimisation that suits the total size of the BVH.

Box growing amount is an 8-bit description of "number of ULPs to be added during ray-box test, encoded as unsigned integer", so allowing some fuzziness and control of the fuzziness. I've been wondering whether fuzziness is a direct technique to speed-up ray queries, lots of fuzziness at the start of traversal, with reduced fuzziness as the depth increases. I'm not sure how traversal can evaluate depth though, except for a fuzzy depth derived from the count of queries issued for a ray direction. So the overall algorithm would use a single fuzzy ray instead of a group of, say, 32 rays and then increase the ray count with depth and reduced fuzziness.

I suppose it would also allow boxes to implicitly overlap each other. So far I've thought of overlapping boxes as being a design decision for the IHV, in terms of a fixed choice for how the hardware works. BVH data structures are way more complex than I first thought (irregular octree), so I'm lost.

I'm intrigued by the idea of a single traversal shader having access to multiple "BVH" textures concurrently. There's just a base address, so seemingly multiple BVHs could be available.

Along the way I've found this page:

DirectX Raytracing (DXR) Functional Spec | DirectX-Specs (microsoft.github.io)

which is quite detailed :) "AABB" is the best way to search the page to understand bounding boxes/volumes. BVH isn't really used.

There's a concept of a degenerate AABB, which is worrying!:

During traversal, degenerate AABBs may still report possible (false positive) intersections and invoke the intersection shader. The shader may check the validity of the hit by, for example, inspecting the bounds.
I've not heard of bounds being inspectable before. There's nothing in the RDNA 2 documentation which would appear to support the idea of inspecting bounds. It sounds to me like wishful thinking...

Overall, pretty interesting, but less conclusive than I was hoping to find.
 
Regarding AABB vs. Triangles:

A normal naive BVH consist of bounding boxes in the nodes, and a single triangle in the leaf(s). If you split the accelerating structure, then you have a "top" (TLAS) structure only consisting of nodes, the "leaf" contains pointer(s) to further hierarchie(s) at the "bottom" (BLAS) which is then the naive one from above. You can instance this way, or quickly hide and so on.

Now, as on the CPU naive implementation, you stop iterating when hitting leafs, and I would guess what you mention is the marker for that, because flow control is your responsibility.

If you can create the BVH yourself (and own the traversal), then you can make a directed acyclic graph (instead of a tree), basically this:

*
/ \
* * >> up
\ /
*

to achieve instancing without special handling. But to be honest, there are so many possibilities how you can deal with this, it's all over the place. The performance also changes from scene to scene and view to view, there is hardly a possibility to even consider making a optimal encoding/hierarchy convention.
 
Well, this looks like there could be quite bit of ways how one could use this.

I really do not have good understanding of specifics in this area, but seems like a lot of fun tricks could be possible.
Some instance based interior mapping for buildings, as there seem not to be instance limits repeated use of same assets within 'rooms' should be possible.
Cubemap like small bits of environments with added dynamic objects within for reflections etc..
 
Excellent stuff!

I had a good look at all of those. The fog in my brain relating to BVH has cleared somewhat.

The Psychopath blog entry has clarified for me that AMD has implemented a BVH4 structure. Earlier I didn't appreciate that it's required to query all four children (or however many are actually present, if less). You could only do less if you knew the bounds before doing the queries.

His article on Light Trees:

https://psychopath.io/post/2020_04_20_light_trees

is interesting to me since it's based upon augmented nodes:

A light tree is a pretty straightforward data structure: it's just a BVH of the lights in the scene, where each node in the BVH acts as an approximation of the lights under it. For it to be an effective approximation, that means we also need to store some extra data at each node.

The simplest kind of light tree would simply store the total energy of all the lights under each node. You could then treat each node as a light with the same size as the spatial bounds of the node, and emitting the same energy as the lights under it.
Would it be possible to augment each DXR BVH node? Would that be a CPU process to query the BVH that was built and then create a texture containing the per-node light energies?

He links to techniques such as stochastic light cuts:


which scales very impressively :)

The paper that describes Coherent Large Packet Traversal and Ordered Ray Stream Traversal is interesting again because of the augmented node data that's required for both. This time the data is quite large and complex, the principle idea being to classify child AABBs into relative orientations with respect to each other and assign that to their parent node.

ORST is particularly interesting because it gathers together diffuse rays by the signs of their direction vectors, e.g. all rays that run from far-north-east to close-south-west are evaluated as a packet.

Can't really tell whether any of these algorithms are practical on RDNA 2.
 
Sadly, in RTX/DXR there is no way to associate meta-data with BVH internals (like nodes). :cry:
Maybe you can do some abomination and make triangulates bboxes, then you get meta data on hit (on leafs you can, vertexid -> indirect lookup), and then you say it's transparent and traversal continues ... cringeworthy, albeit functional.

My assumption is that, if you get to the secret internals (no DXR, bare metal), that it might be possible. BVHs are regularly linearized in buffer, so if you know the position of a node, you have essential a unique id (distance from beginning) and can associate whatever you want with it, even without having any special meta-data attachment to the node-packet itself.
 
Sadly, in RTX/DXR there is no way to associate meta-data with BVH internals (like nodes). :cry:
Maybe you can do some abomination and make triangulates bboxes, then you get meta data on hit (on leafs you can, vertexid -> indirect lookup), and then you say it's transparent and traversal continues ... cringeworthy, albeit functional.

My assumption is that, if you get to the secret internals (no DXR, bare metal), that it might be possible. BVHs are regularly linearized in buffer, so if you know the position of a node, you have essential a unique id (distance from beginning) and can associate whatever you want with it, even without having any special meta-data attachment to the node-packet itself.
Honestly, I'm wondering what console devs will do with the bare metal.

I'm still struggling to understand how BVH build and update work. I get the impression that it's a split effort, with some kind of "pre-process" on the CPU which then sends the data to the GPU where more work is done to finalise it. I need to spend more time on the functional spec... The way that animation interacts with the BVH is a concern. e.g. a low-flying plane that flies rapidly through what were "empty" volumes which simply aren't present in the BVH before the plane arrived.

I'm assuming that the BVH is linear, as you describe. Since a custom traversal shader can read memory that isn't the BVH, a node augmentation buffer can be created, keyed upon node ID, as you describe.

So the problem then is to augment all of the nodes, which implies reading all of them and extracting their AABBs. On console this really shouldn't be an opaque data structure, at least on PS5.

One thing I've realised is that the settings in the Texture Descriptor (image resource), such as box growing amount, are fixed when the descriptor is created. So you can't vary the box growing amount as traversal progresses, unless several descriptors are associated with a single BVH and traversal issues queries against them separately as needed.
 
I'm not sure if this has been raised before but AMD have now exposed a DXR extension for their AGS 6.0 SDK ...

The ray tracing hit token can be either used for bypassing the traversal of the acceleration structure during ray intersections or be used to do hit group sorting to reduce the shading divergence ...

The new structs include AmdExtRtHitToken, AmdExtRtHitTokenIn, and AmdExtRtHitTokenOut. New intrinsics are AmdGetLastHitToken, AmdSetHitToken, and AmdTraceRay ...
 
It's interesting how that stuff will work with D3D and appears to provide some "close to the metal" functionality. I wasn't really expecting that, but then I've never looked at AGS before.

Lots of reverse engineering still required it seems, since there's no real documentation.
 
Instruction set is out, so i tried to demystify RT. To me it seems:
No traversal hardware. Intersection instructions work on BVH which is stored as 1D textures.
Bounds intersection takes 4 boxes and can return their order from hit distances.

That's all. And so i can conclude the following:
No BVH at all, meaning we can implement whatever data structure we need. There is not even a true constraint to use 4 children per node.
Addressing the 'BVH texture' only means to address bbox coords. We can store pointers there as well, but HW only cares about bbox coords.

In short: Total flexibility, and all my dreams became true. Current game benchmarks are biased because flexibility is not used. Discussion about 'Compute vs. FF HWRT' is not over, and has just started.
... somehow annoying, haha, and too good to be true. :)

Let me know if you think i got something wrong.

Please expose those 4 instructions and give some more specs, AMD! :D
 
Instruction set is out, so i tried to demystify RT. To me it seems:
No traversal hardware. Intersection instructions work on BVH which is stored as 1D textures.
Bounds intersection takes 4 boxes and can return their order from hit distances.

That's all. And so i can conclude the following:
No BVH at all, meaning we can implement whatever data structure we need. There is not even a true constraint to use 4 children per node.
Addressing the 'BVH texture' only means to address bbox coords. We can store pointers there as well, but HW only cares about bbox coords.

In short: Total flexibility, and all my dreams became true. Current game benchmarks are biased because flexibility is not used. Discussion about 'Compute vs. FF HWRT' is not over, and has just started.
... somehow annoying, haha, and too good to be true. :)

Let me know if you think i got something wrong.

Please expose those 4 instructions and give some more specs, AMD! :D

I told you, very flexible on consoles.
 
Instruction set is out, ...

In short: Total flexibility, and all my dreams became true. Current game benchmarks are biased because flexibility is not used. ...

But what about performances ? You need a nice balance between speed and flexibility. Being flexible, but with very low performances, or, kind or worse, no way to tape in this flexibility (hello DXR ?) is not a good thing...
 
Arguably a higher degree of flexibility in what you can have or do with the acceleration structure may result in higher performance which may be enough to hit the same ballpark performance as that of an overall faster h/w solution with less flexibility and thus a less optimal acceleration structure.

The problem is the quantity though. BVH isn't something dreamed up by NV or chosen by them via tossing a coin. It's the most generalized approach to RT AS which have been proposed so far. While it's totally possible that some other approach would net better results in some specific cases I'd wager that the majority of them will still use the same BVH. And even in cases where a more lean AS would help so much as to provide sizeable performance increases it's not at all a given that a faster RT h/w wouldn't get there just as well with the same BVH.

So demos, yeah, they are nice. If there's at least a chance of such outcome someone should probably do a demonstration of that.
 
Last edited:
To me it seems:
No traversal hardware. Intersection instructions work on BVH which is stored as 1D textures.
Bounds intersection takes 4 boxes and can return their order from hit distances.
Total flexibility, and all my dreams became true.
Let me know if you think i got something wrong.
If you recall our discussions on the AMD raytracing patent, it specifically claims that the arrangement of raytracing blocks can be flexible - while fixed function hardware in TMUs would be best suited for BVH traversal due to massive memory bandwidth available, universal shader processors can also contain instructions to help test custom BVH structures, and it's up to the driver implementation to decide which unit to use for walking each specific BVH tree.

https://forum.beyond3d.com/posts/2084929/
https://forum.beyond3d.com/posts/2084455/

Therefore new shader instructions for BVH traversal do not rule out BVH traversal by fixed-function hardware implemented in TMUs, unless this was specifically clarified by AMD.
 
Last edited:
Arguably a higher degree of flexibility in what you can have or do with the acceleration structure may result in higher performance which may be enough to hit the same ballpark performance as that of an overall faster h/w solution with less flexibility and thus a less optimal acceleration structure.

The problem is the quantity though. BVH isn't something dreamed up by NV or chosen by them via tossing a coin. It's the most generalized approach to RT AC which have been proposed so far. While it's totally possible that some other approach would net better results in some specific cases I'd wager that the majority of them will still use the same BVH. And even in cases where a more lean AC would help so much as to provide sizeable performance increases it's not at all a given that a faster RT h/w wouldn't get there just as well with the same BVH.

So demos, yeah, they are nice. If there's at least a chance of such outcome someone should probably do a demonstration of that.

Flexibility helps you work smarter by doing less work for the same result or to do things that are simply impossible due to constraints of hardware solutions. I suspect there isn’t much room to do the former as DXR seems to provide decent enough support for skipping work where it’s not needed.

The real benefits would be in doing even more advanced RT with more complex data structures and more detailed geometry. But that would be dog slow on today’s compute hardware anyway so it’s a moot point.
 
The Psychopath blog entry has clarified for me that AMD has implemented a BVH4 structure. Earlier I didn't appreciate that it's required to query all four children (or however many are actually present, if less). You could only do less if you knew the bounds before doing the queries.
The assumption of AMD using BVH4 is there since the TMU patent, but we did not know if the hardware relies on that data structure or not. They mentioned both options would be possible, but no decision was made yet.
Now, after isa docs not mentioning such dependencies (although they use the term BVH all the time, hmmm) it really seems the data structure is software, and only boxes and triangles have a specified format.
A little doubt remains, since isa is about instructions not data, but... there's no way to describe instructions without mentioning their arguments, so i'm pretty sure.
So we could use octree over BVH on their hardware, and their BVH4 processing is probably done with CS similar to Radeon Rays.

I'm still struggling to understand how BVH build and update work. I get the impression that it's a split effort, with some kind of "pre-process" on the CPU which then sends the data to the GPU where more work is done to finalise it.
You mentioned having more octree experience than BVH, so i guess that causes the confusion.
Octrees (even loose) are tied to a global regular grid. Thus the tree needs a complete rebuild as objects pass cells.
BVH is not tied to a grid. This gives the option to only refit the existing tree by adjusting the bounding volumes to new positions. It's more flexible - we can build once offline and only refit at runtime with animation.
And we can rebuild some parts of the tree, and refit others. In practice you end up precomputing a branch of tree per object / character / model. While the character moves around, limbs stay connected, so the quality of the tree remains ok even if we never rebuild it. Which then leads to the BLAS (build high quality on CPU once, refit on GPU per frame) and TLAS (build low quality on GPU per frame) approach as seen in DXR. (I just assume they use exactly this approach.)
We can conclude BVH is probably better for animated scenes because maintaining it is faster than rebuilding an octree from scratch each frame.
Ofc. there are downsides. E.g. bounds overlap, so you need to traverse multiple children (same for loose octree). And also: Unlike octree you can not make assumptions on spatial order of children, so you may want to test all child boxes and sort by distance to descend the closest first.

Would it be possible to augment each DXR BVH node?
He links to techniques such as stochastic light cuts:
which scales very impressively :)

Possible? Can we somehow hack and misuse blackboxed data structures so we can squeeze in and out some more usefulness?
Or would't it be easier to ditch such hardware data structues completely, and implement what we need without compromise in software, giving us all options?
What if it suddenly turns out sparse grids would be more suited for realtime RT than BVH, or whatever?

IDK, but i hope RDNA will allow us to explore such questions in practice... :)
 
The assumption of AMD using BVH4 is there since the TMU patent, but we did not know if the hardware relies on that data structure or not. They mentioned both options would be possible, but no decision was made yet.
I disagree. The explicit return of 4 results per query is specified, and since BVH traversal requires all child nodes to be queried simultaneously and since there is no way for the hardware to request query results for a subset of BVH nodes, it's not possible with the documented instructions to get results for more than 4 children in a parent node.

Clearly, if you wanted to run with two parallel BVHs to get more than 4 children per node, you could, but that would be pretty strange since 4 children is the optimal "power of two" child count for a BVH: you get the smallest count of traversal steps on average. There's a nice graph in the blog post that was linked earlier that shows that 3 is the optimal count of children.

If you wanted to use BVH2, I suppose you could, but you'd be throwing away half of the AABB intersection throughput, since it has been explicitly designed to produce four results per parent node.

Now, after isa docs not mentioning such dependencies (although they use the term BVH all the time, hmmm) it really seems the data structure is software, and only boxes and triangles have a specified format.
A little doubt remains, since isa is about instructions not data, but... there's no way to describe instructions without mentioning their arguments, so i'm pretty sure.
I don't understand what you're saying here, since the arguments are specified. The resource descriptor tells the fixed function hardware some other facts about how to process the query, such as whether to sort AABB results.

So we could use octree over BVH on their hardware, and their BVH4 processing is probably done with CS similar to Radeon Rays.
You could build a quadtree.

I don't know what you mean by BVH4 processing. AMD's been quite explicit that a shader that issues ray queries will make use of shared memory (LDS) when handling BVH traversal. So that could be a pixel shader or a compute shader etc. (if not using DXR 1.0's ray cast shader). The shader takes the results and decides whether to continue with traversal (e.g. ray length limit reached) or query each of the 4 children (or less, in theory, when they are leaves?).

I don't understand how triangles work though. It's almost as if there can only be one triangle per leaf AABB, because if there were multiple transparent triangles inside an AABB a ray could hit an arbitrary subset of them and I can't discern how a query would get past the first one. Unless the shader computes a new ray origin which is a delta-sized step along the ray direction past the intersection point.

In truth it would appear that there are tens if not hundreds of compute cycles available per query: we should now talk about the ALU:query ratio as we once did with ALU:TEX. So computing delta steps along the ray direction isn't so terrible and in theory after the first hit, the remaining triangles will come in clumps from memory for moderately fast delta-queries.

You mentioned having more octree experience than BVH, so i guess that causes the confusion.
I wouldn't call it experience. Simply that the concept of an octree was getting in the way of my thinking about BVHs.

Octrees (even loose) are tied to a global regular grid. Thus the tree needs a complete rebuild as objects pass cells.
BVH is not tied to a grid. This gives the option to only refit the existing tree by adjusting the bounding volumes to new positions. It's more flexible - we can build once offline and only refit at runtime with animation.
A refit still has to process all descendents of the top-most node affected by the refit. It doesn't sound trivial to me. How deep are these BVH trees in games? 10? 15?

We sort of have a datapoint for BVH animation cost: in Cyberpunk 2077 the character that you play as is never reflected in ray traced reflections specifically because "it's too expensive". Sounds like something that should have gone into the "Pscho" setting, but that's for another thread.

I don't know if the game shows the vehicle being driven by the player in ray traced reflections. That's theoretically a much lower-cost BVH animation update...

And we can rebuild some parts of the tree, and refit others. In practice you end up precomputing a branch of tree per object / character / model. While the character moves around, limbs stay connected, so the quality of the tree remains ok even if we never rebuild it. Which then leads to the BLAS (build high quality on CPU once, refit on GPU per frame) and TLAS (build low quality on GPU per frame) approach as seen in DXR. (I just assume they use exactly this approach.)
We can conclude BVH is probably better for animated scenes because maintaining it is faster than rebuilding an octree from scratch each frame.
Ofc. there are downsides. E.g. bounds overlap, so you need to traverse multiple children (same for loose octree). And also: Unlike octree you can not make assumptions on spatial order of children, so you may want to test all child boxes and sort by distance to descend the closest first.
Yes it seems that animation highly favours BVH, though BVH presumably slows down progressively as updates for animation are repeatedly applied.

Possible? Can we somehow hack and misuse blackboxed data structures so we can squeeze in and out some more usefulness?
Or would't it be easier to ditch such hardware data structues completely, and implement what we need without compromise in software, giving us all options?
What if it suddenly turns out sparse grids would be more suited for realtime RT than BVH, or whatever?

IDK, but i hope RDNA will allow us to explore such questions in practice... :)
DXR's BVH is opaque supposedly because it enables the graphics card companies to "do their own thing".

What will Intel's Xe do?
 
Therefore new shader instructions for BVH traversal do not rule out BVH traversal by fixed-function hardware implemented in TMUs, unless this was specifically clarified by AMD.
But if there were traversal units, there should be still some mentioned instruction to start them up? ...still, can't be sure, yes.

The problem is the quantity though. BVH isn't something dreamed up by NV or chosen by them via tossing a coin. It's the most generalized approach to RT AS which have been proposed so far. While it's totally possible that some other approach would net better results in some specific cases I'd wager that the majority of them will still use the same BVH. And even in cases where a more lean AS would help so much as to provide sizeable performance increases it's not at all a given that a faster RT h/w wouldn't get there just as well with the same BVH.

This holds if your goal is classical RT with static and skinned models and finite scenes. But if we add big worlds and lod to the mix it breaks down:
BVH (or any other tree - really does not matter which) is a hierarchical structure, so it is well suited to model LOD by turning leafs into internal nodes and vice versa, adding and removing geometry, change its resolotion and refit bounds or making sure they bound all LODs, etc.
And with DXR we can't do any of this (which does not mean NV has no way to do it technically.)
As a result, using discrete LODs is the only option, but currently this means to build independent BVH for each level of a model, so LOD likely gives no speed up in practice and not worth it.

Now we could say that current games do not really use LOD. Levels are still designed to limit draw distance. Cyberpunk feels indoors even if i'm outdoors.
The more details we add, the less acceptable those artificial restrictions appear. We have to address this finally and we could, but now DXR makes this almost impossible and does not help.
So no matter how many decades of expertise and state of the art DXR represents, it does not suffice for our goals, and it does not yet work. It seemingly has different and meanwhile outdated motivation.

I do not really care how we solve this - fixed function or programmable. Both is possible, and i'm fine with either.
 
So no matter how many decades of expertise and state of the art DXR represents, it does not suffice for our goals, and it does not yet work. It seemingly has different and meanwhile outdated motivation.
My theory is that NVidia made hardware accelerated ray tracing for the professional market and then went to Microsoft and said "hey this is realtime". That's why you have to run at 540 to 720p to get playable results when all the ray tracing effects are turned on, in games that do more than one ray traced effect simultaneously.
 
Did not know this. It likely means i only need to wait and see what they come up with... :D

There is everything to be as flexible as possible, just need to open more things on DXR side. And I suppose this is the same on Intel side. Give flexibility to developer they will do something with it if they have enough power.:D

We begins to see devs using software rasterization where the hardware fixed function is inefficient.
 
Back
Top