I disagree. The explicit return of 4 results per query is specified, and since BVH traversal requires all child nodes to be queried simultaneously and since there is no way for the hardware to request query results for a subset of BVH nodes, it's not possible with the documented instructions to get results for more than 4 children in a parent node.
Clearly, if you wanted to run with two parallel BVHs to get more than 4 children per node, you could, but that would be pretty strange since 4 children is the optimal "power of two" child count for a BVH: you get the smallest count of traversal steps on average. There's a nice graph in the blog post that was linked earlier that shows that 3 is the optimal count of children.
If you wanted to use BVH2, I suppose you could, but you'd be throwing away half of the AABB intersection throughput, since it has been explicitly designed to produce four results per parent node.
Not sure what you disagree about, but i meant the TMU patent gave two options: Either implement traversal loop in FF, or leave it to the shader cores. They did not say what they might use, but it was clear their intersectors operate on 4 boxes back then already.
With BVH there is no strict need to query all children simultaneously, nor do we have to process them in closest first order. Stackless traversal algorithms usually do neither. (Just to mention - the norm seems ordered traversal using some short stack.)
A research result of 3 being the optimal branching factor can't be general. It depends on the algorithm, quality of BVH, scene, etc. The only general argument i could come up with is this geometrically one: 'If we want to subdivide space so children have half the size of parents, in 3D space we get an ideal number of 8 children, like an octree.', which is a very weak argument as well.
Agree about BVH2 being no good idea. But i think we could implement octrees or sparse grids by calling the intersecton multiple times and refine order manually if necessary. That's not really what i'm personally interested in, but hey - i could raytrace my fluid simulator which uses sparse grid.
What i am interested in is to reuse my GI BVH for HW RT. I use 8 children at max, but at the surface it looks like a quadtree of surfels and has mostly 4 children. I guess it would just work.
I don't understand what you're saying here, since the arguments are specified. The resource descriptor tells the fixed function hardware some other facts about how to process the query, such as whether to sort AABB results.
What i mean is simply... writing some code:
struct Node
{
AABox bbox;
int childIndex;
int childCount;
int parentIndex?
int adjacencyIndices?
}
The intersection instructions take only pointers to bounding boxes, but they do not make assumptions about anything else. So it's completely up to us what acceleration structure we use, and how we traverse it.
All my talk here is about the idea to not use DXR at all, but implement data structure and traversal in compute. This makes sense because AMD lacks traversal fixed hardware. So we should get the same or better performance than their generic DXR implemenation, with extra features like LOD.
Did not pay attention to triangles, but only one triangle per node sounds like BVH will take a lot of memory. I assume they may iterate over a list of triangles serially, but no idea.
A refit still has to process all descendents of the top-most node affected by the refit. It doesn't sound trivial to me. How deep are these BVH trees in games? 10? 15?
Parent - child dependencies on refit can be avoided by enlarging boxes so they still bound children under animation. That's useful at least for the lowest levels - we only need to transform the boxes with the geometry.
To further reduce dependencies, we can at least process all levels x of all BLAS we have, then sync and continue with level x+1. So, ideally we get one dispatch per tree level.
Alternatively, NV often does research on doing it all in one go at the cost of doing more work like searching and sorting. I'm not up to date with this. Assuming modern NV is good with async compute the number of dispatches should be no more big problem if other work keeps the GPU saturated.
What i personally do as well is to build multiple top levels in a single wide workgroup, which moves the barriers from API level into the compute program. That's quite a win for me.
How deep the tree has to be depends mostly on branching factor. Binary tree needs too many levels.
With my BVH8 i need about 20 for open world, but my surfels are 10cm so larger than game triangles.
For fluid sim i use sparse grid (to me that's just octree but 4^3 instead 2^3 children), and IIRC i use only two internal levels for grid sizes of 1024^3. The fact how aggressively a larger branching factor can reduce tree levels is very interesting and not so obvious at first.