GART: Games and Applications using RayTracing

Discussion in 'Rendering Technology and APIs' started by BRiT, Jan 1, 2019.

  1. DegustatoR

    DegustatoR Veteran

    You need to access the ray accelerators somehow I assume.
     
    PSman1700 likes this.
  2. JoeJ

    JoeJ Veteran

    hmm, could be the instruction performs traversal over multiple tree levels, not just a single one...?
    But no, ISA doc says this:
    The bolded parts make clear the instruction processes one level only. Returning sorted children only makes sense if we add them to a traversal stack. So both traversal and stack management is left to the calling code.


    This also makes little sense from NV perspective: Once we have a ray hit, we always want to process custom code. In DXR 1.0 that's what the traversal shader is for, which is a regular single threaded program like pixel or vertex shaders.
    If the RT core could decide what to do after the hit, it would need to execute this program on it's own, and NV would need to implement a mini SM into there RT core to run such general purpose program to get its result, which might be tracing another ray, but also generic stuff like writing to VRAM.
    No way they do that. Hit shaders are surely executed on SM like any other shaders too. SM is built for that, not RT cores.
     
    Jawed likes this.
  3. CarstenS

    CarstenS Legend Subscriber

    In the tech briefing with Mike Mantor and Andrew Pominaowski, AMD wrote explicitly on the slide
    "[...]
    - The Ray Accelerator handles intersection of rays within the BVH, an sorting of ray intersection times.
    - It provides an order of magnituge increase in intersection performance compared to a software implementation.
    - Traversal of the BVH and shading of ray results is handled by shader code running on the Compute Units.[...]"
    [actually their bold]
    I am not aware of any update to this bit of information.
     
    Jawed likes this.
  4. DegustatoR

    DegustatoR Veteran

    "Traversal" is hit evaluation here, deciding what to do with ray once it hit something. AMD's RT h/w returns hits to the shader on each crossing of BVH and triangle, NV's RT h/w do this only when triangle is hit (although there is a way to make it work in a similar way with any hit I believe).
    Intersection of BVH is still handled by ray accelerators - what's the point of having them even if this is not what they are doing?
     
    PSman1700 likes this.
  5. JoeJ

    JoeJ Veteran

    Pretty sure you interpret the term 'traversal' wrong, causing some confusion on your side.
    This is how i see it, using the simplest example of a shadow ray:

    Code:
    // regular shader code
    Ray ray;
    ray.origin = shadingPoint;
    ray.direction = lightPos - shadingPoint;
    ray.length = length(ray.direction);
    ray.direction /= ray.length;
    
    bool miss = true;
    
    // traversal loop. NV RT core processes the whole loop; AMD does it still on regular shader
    Stack stack;
    stack.push(rootNode);
    while (!stack.empty())
    {
        Node node = stack.pop();
        if (isLeaf(node))
        {
            if (IntersectRayTriangle(ray, node.triangle)) // AMD intersection instruction
            {
                miss = false;
                break;
            }
        }
        foreach (Node childNode : node.children)
        {
            if (IntersectRayBBox (ray, childNode.bbox) // AMD intersection instruction
            {
                stack.push(childNode);
            }         
        }
    }
    // traversal done. NV RT core returns hit or miss result
    // regular shader continues peocessing
    
    if (miss)
    {
        pixel += light;
    }
    AFAIK AMD never said anywhere they have more than those intersection instructions, while NV said their RT cores do the traversal loop as well.
    Looking at Radeon rays we see they tested o lot of traversal loop variants. Short stack, stackless, etc. Likely they were used to implement DXR, but replaced intersection math with this new RDNA2 instruction.
    It's pretty interesting they get good enough performance with this simple solution. It's slower but future proof. Curious how Intel will handle this...
     
  6. JoeJ

    JoeJ Veteran

    Adding to this, even if we made a synthetic benchmark only about RT and nothing else, etc., we would still need to take die area into account too.
    Let's say Turing is 10 times faster in our benchmark. That's a solid number then, but obviously AMDs intersection instruction is just a fraction of die area, which AMD then has available for other tasks, no matter if RT is used or not. Contrary, NV RT core becomes just a nice cooling pad while no RT is happening.

    So it's really hard to come up with fair numbers. In the end it's still total fps for games which matters more than RT perf in isolation.
     
    CarstenS likes this.
  7. DegustatoR

    DegustatoR Veteran

    Yeah, maybe, there isn't a lot of info on how h/w handles this around so many things are just left to your interpretation.
     
    JoeJ likes this.
  8. trinibwoy

    trinibwoy Meh Legend

    It’s also about more than just the traversal
    Implementation. If Nvidia’s implementation is anything like their patent there’s also a dedicated memory subsystem feeding the RT hardware. I wonder how AMD is handling caching and compression of the BVH.
     
    PSman1700 likes this.
  9. JoeJ

    JoeJ Veteran

    AFAIK, BVH is accessed from TMUs memory path, so treated like texture memory. They have two versions of the instruction, one is 64 bits to support very large BVH, the other then is the 'compressed' variant. But can't remember if box coords are fp16 or 32. ISA docs would tell this as well. Infinity Cache likely helps with caching.

    I have never seen an NV patent. Does it mention some reordering / sorting hits spatially or some stuff like that? Would be interesting. I remember one or two papers from people trying to make some conclusions from synthetic test cases for Turing, and they concluded NV likely has such reordering. Would be very advanced then.
     
  10. trinibwoy

    trinibwoy Meh Legend

    NV has a bunch of new RT patents starting around 2018/2019. Good chance they describe the Turing implementation. I'm pretty sure we discussed them in this forum at length a while back. Been a while since i read them but don't recall any mention of reordering or sorting.

    In these patents the RT core is referred to as the TTU (tree traversal unit).

    US20160071310A1 - Beam tracing
    US20160071313A1 - Relative encoding for a block-based bounding volume hierarchy
    US10866990B2 - Block-based lossless compression of geometric data
    US10825230B2 - Watertight ray triangle intersection
    US10235338B2 - Short stack traversal of tree data structures
    US10580196B1 - Method for continued bounding volume hierarchy traversal on intersection without shader intervention

    SM is a client of the TTU:
    [​IMG]

    TTU internals showing caches and intersection units:
    [​IMG]
     
    Last edited: Jul 5, 2021
    Krteq, jlippo, HLJ and 4 others like this.
  11. Ethatron

    Ethatron Regular Subscriber

    https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/-/wikis/RDNA2-hardware-BVH
    https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/-/wikis/Raytracing
    https://blog.froggi.es/bringing-vulkan-raytracing-to-older-amd-hardware/
     
    Krteq, CarstenS and JoeJ like this.
  12. JoeJ

    JoeJ Veteran

    At least some ground work for reordering seems done. The block compression paper divides BVH into branches, which reminds me on Nvidias 'treelets' paper. (Can't find it anymore, but guys like Laine / Karras were involved iirc.)
    My own plans on software reordering were to make such branches which fit into LDS, bin sets of rays to them and then brute force intersect all this without VRAM access in inner loops. It would solve the divergent memory access to BVH. And as i understood it, the treelets paper was about the same idea. (Making such Beams to bound sets of rays for quick rejections would be a nice optimization here too, probably.)
    However, while we solve BVH memory issues, rays now become a big such issue on the other hand. To have optimal reordering, we would need to rebin rays in every traversal iteration. So they move from registers into VRAM, requiring huge prefix sum for the binning, and then optionally reordering them in memory to get nice packets for the next step. Super heavy, and no win for a software implementation i guess. For RT cores it would mean rays have to move from one core to the other. Likely one core alone has not enough rays in flight to make reordering a win either.
    There might be a proper compromise with doing reorder only each Nth iteration, and using only sets of rays small enough so all work remains on chip and no VRAM roundtrip for binning / reordering is needed.

    No matter what - with proper reordering the traceRay function is no longer some atomic small thing, but becomes a global workload like a big compute dispatch.
    DXR 1.0 seems not really designed for this, inline tracing would just break it, and potential traversal shaders would go out of reach as well (even more). So again i conclude there is no reordering yet. Also it just seems not worth it yet either.

    Interesting for my requests: I would need to handle their block compression on my side. So that's a vendor specialization but no unexpected problem. It's even likely other vendors would do the same.

    This answers my question about pipelining. I did not know if the instruction is parallel or serial, but considering the unit seems tiny, it's likely serial.
    So when CU calls the instruction the wavefront goes idle and another wavefront takes the SIMD. After RT unit is done with all 32/64 intersections, wavefront can continue.
    Now, AMD could make RT faster easily just by making the unit wider for future GPUs? Traversal and stack stuff would not be affected. This gives me some hope they stick at this flexible solution. :)
     
  13. Ethatron

    Ethatron Regular Subscriber

    It's as pipelined as a texture fetch.
     
    JoeJ, Krteq and BRiT like this.
  14. pharma

    pharma Veteran

    July 8, 2021
     
    PSman1700 likes this.
  15. pharma

    pharma Veteran

    Forza Horizon 5 Uses Audio Ray Tracing To Make "The World Feel Alive" | SegmentNext
    July 13, 2021
     
    Dictator, pjbliverpool, Krteq and 3 others like this.
  16. DegustatoR

    DegustatoR Veteran

    https://developer.nvidia.com/blog/new-ray-tracing-sdk-improves-memory-allocation-for-games/
    https://developer.nvidia.com/blog/reducing-acceleration-structure-memory-with-nvidia-rtxmu/

    + https://github.com/NVIDIAGameWorks/RTXMU

    I don't really get this paragraph though:
    Is this due to how AMD drivers handle BVH structures memory footprint?
    Also wouldn't 75% of 200% be ~75% compared to 50% which isn't really 3.26x smaller?
     
    Last edited: Jul 19, 2021
    Krteq, PSman1700 and pharma like this.
  17. Dictator

    Dictator Regular

    I do not want to take a guess at their numbers there, but it would imply that the acceleration structure in AMD driver is more bloated I guess.
     
  18. CarstenS

    CarstenS Legend Subscriber

    Applying Occam's Razor, the numbers roughly work out when you calculate with "only to 75%" here: Compaction then also reduces the NVIDIA memory by another 50% on average while AMD tends to reduce memory only by 75%.

    Start with Nvidia using 2 MB, compacation removes half: 1 MB. AMD uses twice as much, so 4 MB, compaction to 75% leaves 3 MB, roughly a factor of three.
     
    BRiT and DegustatoR like this.
  19. DegustatoR

    DegustatoR Veteran

    So it's a typo then which would make sense.
     
  20. trinibwoy

    trinibwoy Meh Legend

    I'm not sure I understand the need for this SDK or developer intervention. Under DXR the acceleration structure is a black box so why wouldn't Nvidia just implement this suballocation scheme in their driver? AMD obviously already supports compaction as part of the standard api. Nvidia is adding suballocation as an additional enhancement. Question is why wouldn't they just do this in the driver since they have access to the necessary inputs (i.e. size of each BLAS).

    "Suballocation tells a slightly different story here in which scenes with many small acceleration structures like Zero Day benefit greatly. The average memory savings from suballocation ends up being 123 MB but the standard deviation is rather large at 153 MB. From this data, we can assert that suballocation is highly dependent on the scene geometry and benefits from thousands of small triangle count BLAS geometry."
     
Loading...

Share This Page

Loading...