DirectX Ray-Tracing [DXR]


Time for some Titan XP SLI benchmarking...

giphy.gif
 
AMD are going to have DXR acceleration of some form just like they have hardware support for the all the other DX features.
 
He can't occlusion cull during the BVH traversal though, he has to build a list of every bounding volume hit before he can start doing triangle intersection tests ... that's a bit shitty for large scenes.
 
There is one question which was puzzling me for a while...

Why did NVidia opt for non-uniform BVH with overlapping bounding boxes rather than a uniform, densely packed, uniform distribution oct-tree?

Especially considering that an oct-tree can be compacted down to a really hardware friendly bitset representation for quick hardware filtering in correct traversal order? There are some extremely fast, stack-free data layouts and algorithms for traversing uniform trees. E.g. uncle/sibling list for pure horizontal traversal, reducing amortized runtime from O(n * log(m)) down to O(n + log(m)) for "n" near hits and "m" total leaf nodes. Or bit-mask based, higher-order nodes like 4³ or 8³ cubes which can be tested with a simple shift+mask operation as fast as the memory goes (compile conservative rasterized hit pattern, shift and mask by x/y/z offset, on near-hit second lookup phase for next node tier).

The catch is geometry / transform updates.
You can update each leaf node in a self-intersecting BVH idenpendently and the only change which propagates upwards is that the individual bounding boxes widen / shift. The BVH is effectively degenerating (as the bounding boxes gain more overlap), but it's not changing structure. Moving anything in a uniform distribution tree requires effectively to rebuild parts of the tree. Quite tricky to do that in parallel, while a parallel update of all composed bounding boxes marching from leaf nodes towards to the root node is almost trivial (on each level, all nodes may be updated in parallel without synchronization).

A distinguishable bucket per level and you have all you need for a GPU side update routine (only updating transforms though). Insertions into arbitrary leaf nodes are also possible, even though they imply a rapid degeneration. (So there is one "insert + remove" pass, followed by a parallel "update transforms" pass updating leaf node bounding boxes, followed by a parallel "update BVH" pass propagating bounding boxes.) And that's btw. where the GPU accelerated BVH updates come into play ;) Update structure on CPU, compute bounding boxes on GPU.

So what does that imply?
First of all, the acceleration structure used in RTX is a trade-off. In terms of pure rendering performance, it's easily an estimated factor 5-10x behind what a hardware accelerated uniform distributed acceleration structure could achieve at same hardware complexity. But it's a necessary trade-off, if it's to be used for anything except static, non-animated scene geometry. In contrast, it is not the ideal solution if your goal is quality >> realtime updates, in that case the cost saved by trading easy updates for a sub-ideal data structure is easily eaten by the vastly higher traversal cost due to both more intermediate steps, adding on top of backtracking and vast number of unnecessary any-hits due to undefined traversal order.
 
Last edited:
We don't know how their tree looks like. If you treat BVH8 as a generalization of octree, you can get many of the advantages you list and still call it BVH. (BTW, AMDs TMU patent says their tree has 4 chlidren, IIRC. But NV is unknown?)
But i have similar thoughts and assume ImgTecs HW builder could be a lot like you say.

Beside no rebuild being possible as you already listed, i see some more disadvantages of octree:
If you link each triangle to only one node, and take the advantage to store no node positions and bounding volumes (which makes octree so attractive), then you have to assume a loose octree, with each bounding box having 8 times the volume than the grid cell of the node.
Likely this hurts traversal performance a lot, remembering how big the win usually is from using SAH for example.
But the huge wins in reduced BW and HW intersection may outweight this. It all seems hardware friendly.

And second, if the scene is quite dense, you will need more nodes and you come closer to the costs of volume processing, while what you want is the cost of 2D surface.
Though, the difference to BVH can't be that large here.

I guess they tried all those options in their simulators and picked the best - whatever it is. But might change...
 
If you look at the traversal gains to be had simply by making the boxes of unused space larger in AABBs I'm not sure straight octrees are a good idea.
 
If you link each triangle to only one node, and take the advantage to store no node positions and bounding volumes (which makes octree so attractive), then you have to assume a loose octree, with each bounding box having 8 times the volume than the grid cell of the node.
If you go for an uniform tree, you have hardly any choice but to record triangles as leaf nodes to more than one intermediate tree-node. It's a trade-off between space saved on the leaf nodes and complexity of traversal / near misses. Loose octtree and geometry on intermediate nodes complicates traversal significantly. Especially loose octtree is going to amplify necessary memory transactions.

New If you look at the traversal gains to be had simply by making the boxes of unused space larger in AABBs I'm not sure straight octrees are a good idea.
That's not something I'm worried about at all. You can compress it down to 1bit per cell for encoding empty spaces, and then index by hardwired pop-count into linear storage of occupied cells only. Each child node should be usually unique within the list of children, so no further compression applies. Should work ideally for a 8x8x4 (the 4 rotating axis by level) node size if you wanted to exactly fit a 32 byte cache line for the bitset, followed by packed indices. If you like, you can also encode metadata for the next child within the packed field, in order to cut on waste on the second memory access.
We don't know how their tree looks like. [...] But NV is unknown?
Nvidia's optimization guidelines mentioned something about degeneration of the acceleration structures, so you were urged to rebuild occasionally. In hindsight, that gave it away. Not many options for acceleration structures which can degenerate on an update to begin with, and even less which would be suitable for GPU accelerated updates. Only part we can't deduce for now is the precise data layout for each node, but we can safely say it contains a bounding box (unclear if all 3 or only 2 ordinates are encoded per level?) and 2-4 child indices, plus some metadata to distinguish tree nodes from leaf nodes. Metadata may be encoded in index, rather than inside the node. Possible layout may be 6x4byte for BB, plus 2x4byte for child indices, with 1-2 bit per index reserved for metadata encoding child properties.

For a 64byte cache line, 4 children would be plausible, with some additional data encoded for earlier child rejection.
 
If you go for an uniform tree, you have hardly any choice but to record triangles as leaf nodes to more than one intermediate tree-node.
How would you do this step efficiently? Sounds very expensive. Assuming we already know which leaf nodes are necessary, two steps remain:
1. Run over all triangles, intersect them with the grid, traverse for occupied leafs, increase counter per leaf.
2. Prefix sum over the counts to get indices for a compacted list.
3. All the work from step 1 again to insert the triangles to the list.
How to do better than this? Each step is a huge workload.

But for static world this would be no problem. Could be worth to use octree for static and BVH for dynamic? Not sure what i want - the option we could try such ideas, or fixed function and no more worries. :)
 
Do any GPU/cuda ray tracing engines *not* use BVH? Pretty much every piece of literature and source I've seen in the GPU-RT space seems to prefer BVH.
 
Do any GPU/cuda ray tracing engines *not* use BVH? Pretty much every piece of literature and source I've seen in the GPU-RT space seems to prefer BVH.

BVH is a fairly generalized form. There are papers regarding k-DOPs (basically k-plane convex hulls) which have a shifted overhead. The BVH is at a good spot for "fast" permutation and "fast" traversal, while others (say a OctTree which is a less general sub-set of a specific BVH expression, or k-DOPs which is a more general expression of BVHs) are ugly to update and faster to traversal, or very easy to update and less efficient to traverse.
 
Khronos will discuss "standardised ray tracing" at GDC - AMD and Nvidia will be there
To date, only one retail game has delivered raytracing support through the Vulkan API, Wolfenstein Youngblood. If you want to include more games in this list, you'll need to include Vulkan-based projects like Quake II RTX and other non-retail releases.

As it stands, the Vulkan API lacks support for raytracing. Yes, Nvidia has released RTX specific extensions for Vulkan raytracing, but this support is hardware exclusive and lacks the multi-platform/multi-vendor raytracing support that Microsoft's DXR (DirectX Raytracing) delivers.

Sadly, we do not know what form Khronos' ray tracing implementation will take. Still, given Khronos' recent moves with Vulkan 1.2, we guess that Vulkan will closely align with Microsoft and its DXR implementation. Vulkan 1.2 already supports HLSL (DirectX's Shading Language) with support for up to Shader Model 6.2. Support for Shader Model 6.3 will bring with it support for DXR HLSL code, and this code should be usable with Vulkan's planned raytracing implementation.
https://www.overclock3d.net/news/so...acing_at_gdc_-_amd_and_nvidia_will_be_there/1
 
Back
Top