RDNA 2 Ray Tracing

Discussion in 'Architecture and Products' started by Jawed, Dec 8, 2020.

  1. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,266
    Likes Received:
    1,524
    Location:
    London
    You're thinking 10 or 20GB of BVH kept on disk? Streaming in a few hundred MB as needed? LOD strategies?

    In all these techniques, to do them in real time, requires a huge amount of denoising. So when I talk about opportunistic bounce ray generation, I'm expecting it to produce a noisy result. In general with real time ray tracing, we're seeing very low sample counts per pixel.
     
  2. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,053
    Likes Received:
    1,239
    Yeah, but there might be just not enough storage, depending on the game.
    But then we could at least build 'nearby' BVH during startup or as a background task, and manage a cache on disk to avoid constant rebuilds of the same models as they pop in and out.
    Yes, but this maximizes incoherence.
    For GI we we can avoid this. Examples would be DDGI probes, where we have many rays originating from the same probe center. So i thought you think about doing the same but making the emitter a center, instead the receiver.
     
  3. PSman1700

    Veteran Newcomer

    Joined:
    Mar 22, 2019
    Messages:
    4,270
    Likes Received:
    1,915
    Then i did a good job :)
    Ofcourse DXR needs refining, just as any other new technology that came recently. I think it will be ok, just as any other 'great/huge' new feature out there now. All these combined should help alot in improving on the 'lack' of generational jumps these days.
     
  4. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,266
    Likes Received:
    1,524
    Location:
    London
    milk likes this.
  5. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    913
    Likes Received:
    347
    I'm kind of surprised there are no hardware based numerical integrators, it should be one of the more decisive cases where hardware just beats software by some orders. Lets say we can get sums out of texture fetches instead of averages/filtered, and then we can get masked sums (either by another texture, or a polynomial/spherical discreet function), and then we get hierarchical texture sums (in the mips), and then we extend it to 3D textures and the true fun with voxel hierarchy integration starts.
    This would solve more lighting equation performance and precision problems than what's brought up currently. What's being done now on GPUs is a 4 decade old technique, and you can predict how this goes fairly well, just looking at offline raytracing tech history. It would also raise productivity of programmers instantly, as all the MoBlur, DoF, Bloom, SSS and myriad of effects, which are sum-based, suddenly would be a couple of lines large instead of 4k code blobs.
     
    milk and JoeJ like this.
  6. JoeJ

    Veteran Newcomer

    Joined:
    Apr 1, 2018
    Messages:
    1,053
    Likes Received:
    1,239
    Nice idea. Personally i see two applications, one is building env map probes for GI, the other is fluid simulations. But in both cases it would only help me if it works with fragments of texture living in LDS memory.
     
    Ethatron likes this.
  7. Lurkmass

    Newcomer

    Joined:
    Mar 3, 2020
    Messages:
    227
    Likes Received:
    226
    Some reverse engineering on RDNA2 raytracing implementation:

    https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/-/wikis/RDNA2-hardware-BVH
    https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/-/wikis/Raytracing
    https://github.com/GPUOpen-LibrariesAndSDKs/AGS_SDK/commit/4f2442c13eec0e8cb4feeef1185861ae99a13a9d

    Putting this altogether, could these raytracing hit tokens represent a stack ? It would potentially implicate that AMD are exposing explicit functionality with their AGS driver extension to store the stack information into the LDS memory for non-inline ray tracing ...
     
    JoeJ, Lightman, Ethatron and 3 others like this.
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,266
    Likes Received:
    1,524
    Location:
    London
    Ooh, some interesting facts there:
    • Box nodes on RDNA2 can have 4 boxes (and hence children), and come in fp16 and fp32 variants. The fp16 variant is smaller (only 1 cacheline) but has all the precision and range drawbacks from using fp16
    • Noteworthy is that this [fp16 AABBs] node is exactly a 64-byte line without any holes so it is hard to fit any extra information in there
    • 32-bit node is 112 bytes with 16 unused
    It appears that there is no compression of node data.
     
    Krteq, Rodéric, Lightman and 2 others like this.
  9. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    913
    Likes Received:
    347
    One thing he forgot in his rejection of stackless is, that he can interleave custom data with the node-data, so after each node there's unlimited private space until the next node, where he can put the parent-pointer. Just pretend the private space is also nodes and ommit the linear ids occupied by them (e.g. just use node_ids divisible by 2, 3, 4, 5 and so on).

    Edit: Intel presented short stacks at HPG'19, that is pretty amazing: https://www.embree.org/papers/2019-HPG-ShortStack.pdf
     
    #49 Ethatron, Jan 6, 2021
    Last edited: Jan 6, 2021
  10. andermans

    Newcomer

    Joined:
    Sep 11, 2020
    Messages:
    24
    Likes Received:
    38

    A problem is that the nodes have to be 64-byte aligned, so if you add data after the fp16-box node you essentially get a 128 byte structure. Might as well use the fp32-box node then. (assuming no performance difference in processing of fp16/fp32 in the RT hardware). Of course you could put them in a second array to avoid alignment, but then you have data in multiple cachelines which is going to be hard on the memory hierarchy
     
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,266
    Likes Received:
    1,524
    Location:
    London
    It looks like we have an ISA sample of the inner loop for BVH traversal from a TraceRays shader call:

    [​IMG]

    which comes from:

    Radeon™ GPU Profiler 1.9 introduces support for Radeon™ RX 6000 Series - GPUOpen

    Amongst the statistics reported for ISA is "Stack size", alongside things like LDS or Instruction cost. I've not seen stack size reported as an explicit statistic before (I'm years out of touch though). I suspect this is merely the call stack for functions that are assembled into multiple ISA functions and has nothing to do with a stack used for BVH traversal.
     
    Lightman and BRiT like this.
  12. cho

    cho
    Regular

    Joined:
    Feb 9, 2002
    Messages:
    418
    Likes Received:
    11
    19 clocks? It's 15 clocks in my case.
     
    Lightman and BRiT like this.
  13. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,266
    Likes Received:
    1,524
    Location:
    London
    The page I linked seems to imply that there's a variety of code that's produced, so perhaps not that surprising.

    I'm unclear if the ISA in the picture is from a SphereIntersect or something else, which adds to the uncertainty.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...