RDNA 2 Ray Tracing

Now knowing even RT cores are programmable, i see my previous view of 'flexible AMD' vs. 'fast but limited NV' is pointless - sorry for that.
Though, the problem remains quite the same:
It's difficult for both NV and AMD to expose flexibility to devs, because neither can guarantee such things will still work on future HW.
Proposal: Experimental extensions for research purposes. Can be used for games too, but then future HW requires to patch the games.
Currently i guess only the big fishes like Epic collaborate with HW vendors, discussing options, doing some tests. Results will lead to new standards and better APIs.
That's fine, but because the LOD problem has no standard solution, making such public extensions would make sense to spur more research and explore more options.

However, the situation seems much better than i have thought all the time. Chances we get traversal shaders with Turing support are pretty good i guess.
 
I disagree. The explicit return of 4 results per query is specified, and since BVH traversal requires all child nodes to be queried simultaneously and since there is no way for the hardware to request query results for a subset of BVH nodes, it's not possible with the documented instructions to get results for more than 4 children in a parent node.

Clearly, if you wanted to run with two parallel BVHs to get more than 4 children per node, you could, but that would be pretty strange since 4 children is the optimal "power of two" child count for a BVH: you get the smallest count of traversal steps on average. There's a nice graph in the blog post that was linked earlier that shows that 3 is the optimal count of children.

If you wanted to use BVH2, I suppose you could, but you'd be throwing away half of the AABB intersection throughput, since it has been explicitly designed to produce four results per parent node.
Not sure what you disagree about, but i meant the TMU patent gave two options: Either implement traversal loop in FF, or leave it to the shader cores. They did not say what they might use, but it was clear their intersectors operate on 4 boxes back then already.

With BVH there is no strict need to query all children simultaneously, nor do we have to process them in closest first order. Stackless traversal algorithms usually do neither. (Just to mention - the norm seems ordered traversal using some short stack.)
A research result of 3 being the optimal branching factor can't be general. It depends on the algorithm, quality of BVH, scene, etc. The only general argument i could come up with is this geometrically one: 'If we want to subdivide space so children have half the size of parents, in 3D space we get an ideal number of 8 children, like an octree.', which is a very weak argument as well.

Agree about BVH2 being no good idea. But i think we could implement octrees or sparse grids by calling the intersecton multiple times and refine order manually if necessary. That's not really what i'm personally interested in, but hey - i could raytrace my fluid simulator which uses sparse grid.
What i am interested in is to reuse my GI BVH for HW RT. I use 8 children at max, but at the surface it looks like a quadtree of surfels and has mostly 4 children. I guess it would just work.

I don't understand what you're saying here, since the arguments are specified. The resource descriptor tells the fixed function hardware some other facts about how to process the query, such as whether to sort AABB results.
What i mean is simply... writing some code:

struct Node
{
AABox bbox;
int childIndex;
int childCount;

int parentIndex?
int adjacencyIndices?
}

The intersection instructions take only pointers to bounding boxes, but they do not make assumptions about anything else. So it's completely up to us what acceleration structure we use, and how we traverse it.
All my talk here is about the idea to not use DXR at all, but implement data structure and traversal in compute. This makes sense because AMD lacks traversal fixed hardware. So we should get the same or better performance than their generic DXR implemenation, with extra features like LOD.

Did not pay attention to triangles, but only one triangle per node sounds like BVH will take a lot of memory. I assume they may iterate over a list of triangles serially, but no idea.


A refit still has to process all descendents of the top-most node affected by the refit. It doesn't sound trivial to me. How deep are these BVH trees in games? 10? 15?
Parent - child dependencies on refit can be avoided by enlarging boxes so they still bound children under animation. That's useful at least for the lowest levels - we only need to transform the boxes with the geometry.
To further reduce dependencies, we can at least process all levels x of all BLAS we have, then sync and continue with level x+1. So, ideally we get one dispatch per tree level.
Alternatively, NV often does research on doing it all in one go at the cost of doing more work like searching and sorting. I'm not up to date with this. Assuming modern NV is good with async compute the number of dispatches should be no more big problem if other work keeps the GPU saturated.
What i personally do as well is to build multiple top levels in a single wide workgroup, which moves the barriers from API level into the compute program. That's quite a win for me.

How deep the tree has to be depends mostly on branching factor. Binary tree needs too many levels.
With my BVH8 i need about 20 for open world, but my surfels are 10cm so larger than game triangles.
For fluid sim i use sparse grid (to me that's just octree but 4^3 instead 2^3 children), and IIRC i use only two internal levels for grid sizes of 1024^3. The fact how aggressively a larger branching factor can reduce tree levels is very interesting and not so obvious at first.
 
But if there were traversal units, there should be still some mentioned instruction to start them up? ...still, can't be sure, yes.
No, it's the other way round. In the raytracing pipeline, generated rays are submitted to the traversal units, which walk each ray through the BVH tree and start specific raytracing shaders (intersection, any-hit, closest hit) when they find ray-triangle instersection(s) or a ray miss.
 
I believe you could also have multiple BVH structures and decide when casting a ray that which BVH to use.
 
how about inline tracing? If there were traversal units, there would be at least one traversal instruction for that
Inline raytracing makes no difference. The whole raytracing pipeline is data-driven - you shoot rays from screen space using ray generation shader, and its output implicitly flows into traversal/intersection units (no matter fixed function or shader-based), which walk the BVH tree to find ray-triangle intersections and spawn 'hit' or 'miss' shaders, which modify each ray's attributes to define the final color of your pixel.

Inlining combines these separate stages into one 'uber-shader', but it's just an abstraction to simplify programming of low-res effects that do not require full performance. You don't really need flow instructions which interrrupt your ray shader program and start BVH traversal - scheduling multiple work units and controlling fixed-function processing is graphics driver's job (which may be aided by hardware scheduler to reduce driver overhead).
 
Last edited:
Hmm, isn't Specialized TraceRayInline control flow pretty simplified?
Microsoft said:
The image below depicts how a particular choice of template flags used with the initial declaration of RayQuery can prune down the full flow graph (depicted above). There is no shader participation needed in the search for intersections here. Further, the search is configured to end upon the first hit. Simplifications like this can free the system to generate more performant inline raytracing code.

 
The whole raytracing pipeline is data-driven
I'm still not convinced. It may look so from the API side, but if we use inline tracing from compute, there should be some related instruction.
I can imagine it would also work to put ray parameters to some memory / registers, then call a generic sync instruction to notify work is waiting. But then i would expect at least some related RT tokens to be mentioned for that certain instruction.

I believe you could also have multiple BVH structures and decide when casting a ray that which BVH to use.
Which would give us stochastic LOD. Overcoming DXR instancing limitations would be another application.
 
I'm still not convinced. It may look so from the API side, but if we use inline tracing from compute, there should be some related instruction.
I can imagine it would also work to put ray parameters to some memory / registers, then call a generic sync instruction to notify work is waiting. But then i would expect at least some related RT tokens to be mentioned for that certain instruction.


Which would give us stochastic LOD. Overcoming DXR instancing limitations would be another application.

Shouldn't this work pretty well
float4 MyPixelShader(float2 uv : TEXCOORD) : SV_Target0
{
...
// Instantiate ray query object.
// Template parameter allows driver to generate a specialized
// implementation.
RayQuery<RAY_FLAG_CULL_NON_OPAQUE |
RAY_FLAG_SKIP_PROCEDURAL_PRIMITIVES |
RAY_FLAG_ACCEPT_FIRST_HIT_AND_END_SEARCH> q;

// Set up a trace. No work is done yet.
q.TraceRayInline(
myAccelerationStructure,
myRayFlags, // OR'd with flags above
myInstanceMask,
myRay);
...
...
https://devblogs.microsoft.com/directx/dxr-1-1/
 
I'm still not convinced. It may look so from the API side, but if we use inline tracing from compute, there should be some related instruction.
I can imagine it would also work to put ray parameters to some memory / registers, then call a generic sync instruction to notify work is waiting. But then i would expect at least some related RT tokens to be mentioned for that certain instruction.

The instruction(s) take a node-pointer, and some parameters. While you still potentially can provide a node in situo, you would have to export it to memory, as these are pure memory instructions.
It's a similar case to trying to think of providing a texture content in the shader: It seems possible, but you have to spill, as the fetch/sample instruction is a pure memory instruction.
The bit-pattern of the node/leaf itself is not specified. The instruction, though, returns actual new node-pointers, which only need to be passed to the next iteration (in any distinct order you like), except if you hit a leaf, then it's whatever is configured in the T# uop field.

Edit: BTW it seems easy enough to peek into the DXR buffers and get an understand what precisely is up ... it's a public buffer, so suit yourself. :)
 
Last edited:
The instruction(s) take a node-pointer, and some parameters. While you still potentially can provide a node in situo, you would have to export it to memory, as these are pure memory instructions.
It's a similar case to trying to think of providing a texture content in the shader: It seems possible, but you have to spill, as the fetch/sample instruction is a pure memory instruction.
The bit-pattern of the node/leaf itself is not specified. The instruction, though, returns actual new node-pointers, which only need to be passed to the next iteration (in any distinct order you like), except if you hit a leaf, then it's whatever is configured in the T# uop field.

My goal would be to generate the 1D-BVH textures in video memory myself, not to create temporary nodes during traversal, so this would be fine.
But it reminds me i often think it would be nice to have texture filter support for some pixels in LDS memory.

Edit: BTW it seems easy enough to peek into the DXR buffers and get an understand what precisely is up ... it's a public buffer, so suit yourself. :)
I'll do this. ...If i ever get to that bottom of todo list where RT support is :D
 
the node/leaf patterns easily could have hw dependent format and even compression :)

It certainly is hardware dependent. So far, I think, every one new generation of AMD changed the texture descriptor bit-field.
For sure it's possible to hack the coding and patch it together yourself, but's extremely limited use even just for fun.

When I worked out the advantages of an embedded system (as the console is) vs. the PC in the other thread, this is it. You'll never be able to be forward compatible (as PC games/drivers need to be) if you tailor it for some specific architecture revision. And the API by definition wants to guarantee absolute compatibility, vertically and laterally. Which means the only thing you can do is the stuff defined in the API, and nothing else. Normally programming means you can do whatever you want, can produce infinite permutations of your idea. In this case you have precisely 2 broadly defined approaches (pipeline and inline), and you can't deviate at all. It happens often enough that a single line of change within a specific setup can have major performance implications (I'm not saying this is the case here generally, or RT specifically, but regularly in tight spot optimization), but this possibility simply doesn't exist with DXR. Obviously Microsoft or Khronos could offer more specific and neat instructions, identifying smarts ways of breaking our of the box, but at the cost of generality (forward compatibility means it can't be taken out later, or changed, see CPU SIMD instructions), and if at all it's an extreme long extension process. Take as an example the lane-instructions in the Direct3D Shader Models. It's already taking years to get the most simple of common low-level features into the spec, the drivers and ultimately the public. On the consoles you can be a RT programmer, on the PC with RT as it is, you're a manager, not a programmer. At least it feels like that. :)
 
With BVH there is no strict need to query all children simultaneously, nor do we have to process them in closest first order. Stackless traversal algorithms usually do neither. (Just to mention - the norm seems ordered traversal using some short stack.)
"Simultaneously" is strictly speaking incorrect, agreed. But all children need to be evaluated.

In fixed function hardware, given that the number of children is small (2 or 4 or even 8) and that the evaluation of each sibling is independent of the others, it makes sense to evaluate them all simultaneously.

What i am interested in is to reuse my GI BVH for HW RT. I use 8 children at max, but at the surface it looks like a quadtree of surfels and has mostly 4 children. I guess it would just work.

What i mean is simply... writing some code:

struct Node
{
AABox bbox;
int childIndex;
int childCount;

int parentIndex?
int adjacencyIndices?
}

The intersection instructions take only pointers to bounding boxes, but they do not make assumptions about anything else. So it's completely up to us what acceleration structure we use, and how we traverse it.
All my talk here is about the idea to not use DXR at all, but implement data structure and traversal in compute. This makes sense because AMD lacks traversal fixed hardware. So we should get the same or better performance than their generic DXR implemenation, with extra features like LOD.
I think we can come up with a minimal node definition in AMD's BVH4:

Code:
struct Node {
AABox bbox;
nodeIndex[] childIndices;
}

Until we see the generated ISA for a traversal shader, we can't know how a path/stack is used though (e.g. is there a "nodeIndex parentIndex" in the struct?).

We can assume that bbox can be set to all zeroes to indicate a degenerate node (one that can't be queried) for scenarios where there's less than four children, or some similar coding, since a hardware implementation would have zero time overhead while avoiding the storage overhead per child. Though AABox might implement some kind of compression scheme (assuming 6 floats for diagonally opposed cuboid corners) in which case the coding could be more obscure.

Anyway, what I'm trying to do here is separate the minimal conceptual model the hardware could use (because we do not have access to the data structure of the BVH nodes) from algorithmic layers that interact with hardware-accelerated BVH4 queries.

What i personally do as well is to build multiple top levels in a single wide workgroup, which moves the barriers from API level into the compute program. That's quite a win for me.
Presumably if you build a queue of top levels, you can then feed them into running workgroups as some rays expire, a sort of "hybrid persistent compute" traversal model.

To me the interesting idea here is to opportunistically generate rays deep in the BVH, benefitting from cache locality and taking decisions about the degree of ray coherence or what happens in a ray's diffuse bounce.

How deep the tree has to be depends mostly on branching factor. Binary tree needs too many levels.
With my BVH8 i need about 20 for open world, but my surfels are 10cm so larger than game triangles.
For fluid sim i use sparse grid (to me that's just octree but 4^3 instead 2^3 children), and IIRC i use only two internal levels for grid sizes of 1024^3. The fact how aggressively a larger branching factor can reduce tree levels is very interesting and not so obvious at first.
This picture:

upload_2021-1-2_16-30-36.png

(X-axis is the branch factor B, and Y-axis is the number of node tests. Note that although this particular graph assumes a fixed N, the patterns hold for all values of N.)

from: https://psychopath.io/post/2017_08_03_bvh4_without_simd
 
I think we can come up with a minimal node definition in AMD's BVH4:

Code:
struct Node {
AABox bbox;
nodeIndex[] childIndices;
}

We could likely figure that out. But even if it shows up those child pointers are there (which is likely), it does not tell us if they are used only from software (AMD's DXR implementation, and outer traversal loop),
or if there is a hardware mechanism which uses them too (traversal outer loop processed by some fixed function block).

After having learned about potential micro code (making NV RT cores partially programmable from their side, eventually), i realize my view on compute vs. fixed function might be naive.
But my assumption would be like this: AMD did some work with Radeon Rays, and algorithms were taken from there to implement DXR. Those programs are not much different from the compute shaders we can use within our gfx APIs.
If that's true, it becomes very likely we could beat their performance with a customized solution in many cases. Unlimited flexibility would not be the only advantage.

To me the interesting idea here is to opportunistically generate rays deep in the BVH, benefitting from cache locality and taking decisions about the degree of ray coherence or what happens in a ray's diffuse bounce.
You mean to start rays at the surface from a known BVH node, so the traversal from the root down just to find the start can be avoided?
Yeah, that's a nice idea for such custom optimizations!
I remember i thought about this some time back, probably when working on a path tracer for reference renders, but did not try it out. For paths this could be a big win, because we can reuse the nodes at the hitpoint.
But i see a big problem for the game application. No paths, and if we start rays from pixels we don't know any starting node yet. We could use a texture to fetch it from there, but generating and storing this, dealing with LOD, etc., not easy. But would be more practical if we would use primary rays too, no more rasterization.
 
We could likely figure that out.
I've realised that you can shift the storage of bounding boxes upwards by one level in the tree:

Code:
Node
{
BBox[] childBoxes;
NodeIndex[] childIndices;
}

and this could present an opportunity to increase the compression ratio for bounding boxes. With 24 floats there could be shared exponents and some rule-breaking with respect to IEEE 754 (a variant of denormal formatting). Or run length encoding. Or both combined. In the end you need a compression algorithm that results in conservatively erroneous bounds (i.e. the errors produce larger boxes, not smaller). This is similar to how texture formats rely upon fairly complex bitwise operations, which of course are trivial for hardware to decompress. The key is to make a node a fixed size and smaller than the raw data.

So each node is "larger" but now there would be one less level to reach an arbitrary leaf. And since all four children need to be tested, it makes sense to perform only a single memory access to then run four tests, instead of four incoherent memory accesses before running four tests.

But even if it shows up those child pointers are there (which is likely), it does not tell us if they are used only from software (AMD's DXR implementation, and outer traversal loop), or if there is a hardware mechanism which uses them too (traversal outer loop processed by some fixed function block).
We know that the test of a ray against four AABBs is in hardware: there's a dedicated instruction for that. We know that there is no fixed function traversal in hardware.

After having learned about potential micro code (making NV RT cores partially programmable from their side, eventually), i realize my view on compute vs. fixed function might be naive.
But my assumption would be like this: AMD did some work with Radeon Rays, and algorithms were taken from there to implement DXR. Those programs are not much different from the compute shaders we can use within our gfx APIs.
If that's true, it becomes very likely we could beat their performance with a customized solution in many cases. Unlimited flexibility would not be the only advantage.
I get the sense that algorithms are evolving rapidly...

You mean to start rays at the surface from a known BVH node, so the traversal from the root down just to find the start can be avoided?
Yeah, that's a nice idea for such custom optimizations!
I remember i thought about this some time back, probably when working on a path tracer for reference renders, but did not try it out. For paths this could be a big win, because we can reuse the nodes at the hitpoint.
But i see a big problem for the game application. No paths, and if we start rays from pixels we don't know any starting node yet. We could use a texture to fetch it from there, but generating and storing this, dealing with LOD, etc., not easy. But would be more practical if we would use primary rays too, no more rasterization.
I'm thinking about how in global illumination you have per triangle locality when generating bounce rays. So you can start with extremely sparse rays to find triangles that are candidates for GI value and then spawn bounce rays at each triangle.
 
I've realised that you can shift the storage of bounding boxes upwards by one level in the tree:
Makes sense.
Though it's also possible they store children sequentially in memory and need only firstChild with two bits for childCount.

I just thought about something else: They store all BVH in texture, so we could use virtual texturing feature to stream BLAS very easily. And with their support for large BVH there is no need to resolve indirection about pointers.
I really think consoles go that route. Still requesting for PC too...

I'm thinking about how in global illumination you have per triangle locality when generating bounce rays. So you can start with extremely sparse rays to find triangles that are candidates for GI value and then spawn bounce rays at each triangle.
Sounds a bit like photon mapping? However, the locality you aim for is a must have IMO, because modern hardware really wants it. Things like QuakeRTX or Metro miss to utilize it, and path tracing does so in general. Big reason why this classical RT with DXR is so fishy for games.
 
They must be fooling us, Quake ray traced performance isnt what it could have been, people stuck with bad performance.
People should sue them whos behind dxr and ray tracing.
You failed on realizing my sarcasm in the other post, but i don't fail to detect yours here ;)

Seriously, there's nothing wrong with brainstorming some optimization ideas about new HW.
But if you think DXR is just fine as is, then enlighten us about how you think UE5 could support it so easily, or tell me how i should use it on geometry that constantly changes due to LOD...
 
Back
Top