DirectX Ray-Tracing [DXR]

Discussion in 'Rendering Technology and APIs' started by BRiT, Mar 19, 2018.

  1. Shortbread

    Shortbread Island Hopper
    Veteran

    Joined:
    Jul 1, 2013
    Messages:
    4,337
    Likes Received:
    2,648
  2. SlmDnk

    Regular

    Joined:
    Feb 9, 2002
    Messages:
    614
    Likes Received:
    278
    I know this is a DXR thread but due to the latest GeForce driver release, Quake 2 Vulkan Path Tracer now works on GTX cards as well.

    http://brechpunkt.de/q2vkpt/
     
    Lightman, Heinrich4, OCASM and 3 others like this.
  3. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,631
    Likes Received:
    1,002
    Location:
    France
    Do you think AMD will follow soon ?
     
  4. A1xLLcqAgt0qc2RyMz0y

    Veteran Regular

    Joined:
    Feb 6, 2010
    Messages:
    1,501
    Likes Received:
    1,260
  5. Unknown Soldier

    Veteran

    Joined:
    Jul 28, 2002
    Messages:
    2,794
    Likes Received:
    459
    Well if rumours are to be believed.
     
  6. iroboto

    iroboto Daft Funk
    Legend Regular Subscriber

    Joined:
    Mar 6, 2014
    Messages:
    11,515
    Likes Received:
    12,373
    Location:
    The North
    Yes.
     
  7. Shifty Geezer

    Shifty Geezer uber-Troll!
    Moderator Legend

    Joined:
    Dec 7, 2004
    Messages:
    43,576
    Likes Received:
    16,031
    Location:
    Under my bridge
    AMD are going to have DXR acceleration of some form just like they have hardware support for the all the other DX features.
     
  8. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    604
    Likes Received:
    320
  9. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,205
    Likes Received:
    607
    He can't occlusion cull during the BVH traversal though, he has to build a list of every bounding volume hit before he can start doing triangle intersection tests ... that's a bit shitty for large scenes.
     
  10. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    604
    Likes Received:
    320
    I guess there a tons of way to remedy to it.. Clustered shading anyone?
     
  11. A1xLLcqAgt0qc2RyMz0y

    Veteran Regular

    Joined:
    Feb 6, 2010
    Messages:
    1,501
    Likes Received:
    1,260
    Ray Tracing On Nvidia’s Pascal Architecture, Benchmarked

    https://www.tomshardware.com/reviews/nvidia-pascal-ray_tracing-tested,6085.html

     
  12. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    393
    Likes Received:
    404
    There is one question which was puzzling me for a while...

    Why did NVidia opt for non-uniform BVH with overlapping bounding boxes rather than a uniform, densely packed, uniform distribution oct-tree?

    Especially considering that an oct-tree can be compacted down to a really hardware friendly bitset representation for quick hardware filtering in correct traversal order? There are some extremely fast, stack-free data layouts and algorithms for traversing uniform trees. E.g. uncle/sibling list for pure horizontal traversal, reducing amortized runtime from O(n * log(m)) down to O(n + log(m)) for "n" near hits and "m" total leaf nodes. Or bit-mask based, higher-order nodes like 4³ or 8³ cubes which can be tested with a simple shift+mask operation as fast as the memory goes (compile conservative rasterized hit pattern, shift and mask by x/y/z offset, on near-hit second lookup phase for next node tier).

    The catch is geometry / transform updates.
    You can update each leaf node in a self-intersecting BVH idenpendently and the only change which propagates upwards is that the individual bounding boxes widen / shift. The BVH is effectively degenerating (as the bounding boxes gain more overlap), but it's not changing structure. Moving anything in a uniform distribution tree requires effectively to rebuild parts of the tree. Quite tricky to do that in parallel, while a parallel update of all composed bounding boxes marching from leaf nodes towards to the root node is almost trivial (on each level, all nodes may be updated in parallel without synchronization).

    A distinguishable bucket per level and you have all you need for a GPU side update routine (only updating transforms though). Insertions into arbitrary leaf nodes are also possible, even though they imply a rapid degeneration. (So there is one "insert + remove" pass, followed by a parallel "update transforms" pass updating leaf node bounding boxes, followed by a parallel "update BVH" pass propagating bounding boxes.) And that's btw. where the GPU accelerated BVH updates come into play ;) Update structure on CPU, compute bounding boxes on GPU.

    So what does that imply?
    First of all, the acceleration structure used in RTX is a trade-off. In terms of pure rendering performance, it's easily an estimated factor 5-10x behind what a hardware accelerated uniform distributed acceleration structure could achieve at same hardware complexity. But it's a necessary trade-off, if it's to be used for anything except static, non-animated scene geometry. In contrast, it is not the ideal solution if your goal is quality >> realtime updates, in that case the cost saved by trading easy updates for a sub-ideal data structure is easily eaten by the vastly higher traversal cost due to both more intermediate steps, adding on top of backtracking and vast number of unnecessary any-hits due to undefined traversal order.
     
    #272 Ext3h, Jan 1, 2020
    Last edited: Jan 1, 2020
    Lightman, 3dcgi, pharma and 3 others like this.
  13. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    990
    Likes Received:
    1,122
    We don't know how their tree looks like. If you treat BVH8 as a generalization of octree, you can get many of the advantages you list and still call it BVH. (BTW, AMDs TMU patent says their tree has 4 chlidren, IIRC. But NV is unknown?)
    But i have similar thoughts and assume ImgTecs HW builder could be a lot like you say.

    Beside no rebuild being possible as you already listed, i see some more disadvantages of octree:
    If you link each triangle to only one node, and take the advantage to store no node positions and bounding volumes (which makes octree so attractive), then you have to assume a loose octree, with each bounding box having 8 times the volume than the grid cell of the node.
    Likely this hurts traversal performance a lot, remembering how big the win usually is from using SAH for example.
    But the huge wins in reduced BW and HW intersection may outweight this. It all seems hardware friendly.

    And second, if the scene is quite dense, you will need more nodes and you come closer to the costs of volume processing, while what you want is the cost of 2D surface.
    Though, the difference to BVH can't be that large here.

    I guess they tried all those options in their simulators and picked the best - whatever it is. But might change...
     
    Lightman, 3dcgi, pharma and 2 others like this.
  14. MfA

    MfA
    Legend

    Joined:
    Feb 6, 2002
    Messages:
    7,205
    Likes Received:
    607
    If you look at the traversal gains to be had simply by making the boxes of unused space larger in AABBs I'm not sure straight octrees are a good idea.
     
  15. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    393
    Likes Received:
    404
    If you go for an uniform tree, you have hardly any choice but to record triangles as leaf nodes to more than one intermediate tree-node. It's a trade-off between space saved on the leaf nodes and complexity of traversal / near misses. Loose octtree and geometry on intermediate nodes complicates traversal significantly. Especially loose octtree is going to amplify necessary memory transactions.

    That's not something I'm worried about at all. You can compress it down to 1bit per cell for encoding empty spaces, and then index by hardwired pop-count into linear storage of occupied cells only. Each child node should be usually unique within the list of children, so no further compression applies. Should work ideally for a 8x8x4 (the 4 rotating axis by level) node size if you wanted to exactly fit a 32 byte cache line for the bitset, followed by packed indices. If you like, you can also encode metadata for the next child within the packed field, in order to cut on waste on the second memory access.
    Nvidia's optimization guidelines mentioned something about degeneration of the acceleration structures, so you were urged to rebuild occasionally. In hindsight, that gave it away. Not many options for acceleration structures which can degenerate on an update to begin with, and even less which would be suitable for GPU accelerated updates. Only part we can't deduce for now is the precise data layout for each node, but we can safely say it contains a bounding box (unclear if all 3 or only 2 ordinates are encoded per level?) and 2-4 child indices, plus some metadata to distinguish tree nodes from leaf nodes. Metadata may be encoded in index, rather than inside the node. Possible layout may be 6x4byte for BB, plus 2x4byte for child indices, with 1-2 bit per index reserved for metadata encoding child properties.

    For a 64byte cache line, 4 children would be plausible, with some additional data encoded for earlier child rejection.
     
    3dcgi and BRiT like this.
  16. JoeJ

    Regular Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    990
    Likes Received:
    1,122
    How would you do this step efficiently? Sounds very expensive. Assuming we already know which leaf nodes are necessary, two steps remain:
    1. Run over all triangles, intersect them with the grid, traverse for occupied leafs, increase counter per leaf.
    2. Prefix sum over the counts to get indices for a compacted list.
    3. All the work from step 1 again to insert the triangles to the list.
    How to do better than this? Each step is a huge workload.

    But for static world this would be no problem. Could be worth to use octree for static and BVH for dynamic? Not sure what i want - the option we could try such ideas, or fixed function and no more worries. :)
     
  17. hughJ

    Regular

    Joined:
    Feb 7, 2002
    Messages:
    800
    Likes Received:
    331
    Do any GPU/cuda ray tracing engines *not* use BVH? Pretty much every piece of literature and source I've seen in the GPU-RT space seems to prefer BVH.
     
  18. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    873
    Likes Received:
    278
    BVH is a fairly generalized form. There are papers regarding k-DOPs (basically k-plane convex hulls) which have a shifted overhead. The BVH is at a good spot for "fast" permutation and "fast" traversal, while others (say a OctTree which is a less general sub-set of a specific BVH expression, or k-DOPs which is a more general expression of BVHs) are ugly to update and faster to traversal, or very easy to update and less efficient to traverse.
     
    pharma and Dictator like this.
  19. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,730
    Likes Received:
    2,584
    Khronos will discuss "standardised ray tracing" at GDC - AMD and Nvidia will be there
    https://www.overclock3d.net/news/so...acing_at_gdc_-_amd_and_nvidia_will_be_there/1
     
  20. Alessio1989

    Regular Newcomer

    Joined:
    Jun 6, 2015
    Messages:
    604
    Likes Received:
    320
    Khronos Gruop... Khronos Group never changes...
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...