GART: Games and Applications using RayTracing

Status
Not open for further replies.
Oh, interesting! Where do you have this information from?
The intersection instruction is documented in ISA docs, but if they had traversal units, there would be no direct need to expose such instruction to compute?
You need to access the ray accelerators somehow I assume.
 
You need to access the ray accelerators somehow I assume.
hmm, could be the instruction performs traversal over multiple tree levels, not just a single one...?
But no, ISA doc says this:
8.2.10. Ray Tracing Ray Tracing support includes the following instructions:
• IMAGE_BVH_INTERSECT_RAY
• IMAGE_BVH64_INTERSECT_RAY
These instructions receive ray data from the VGPRs and fetch BVH (Bounding Volume Hierarchy) from memory.
• Box BVH nodes perform 4x Ray/Box intersection, sorts the 4 children based on intersection distance and returns the child pointers and hit status.
• Triangle nodes perform 1 Ray/Triangle intersection test and returns the intersection point and triangle ID.
The bolded parts make clear the instruction processes one level only. Returning sorted children only makes sense if we add them to a traversal stack. So both traversal and stack management is left to the calling code.


The difference between NV and AMD is in triangle hits where RT core can decide what to do with rays on it's own while RDNA2 has to run a shader
This also makes little sense from NV perspective: Once we have a ray hit, we always want to process custom code. In DXR 1.0 that's what the traversal shader is for, which is a regular single threaded program like pixel or vertex shaders.
If the RT core could decide what to do after the hit, it would need to execute this program on it's own, and NV would need to implement a mini SM into there RT core to run such general purpose program to get its result, which might be tracing another ray, but also generic stuff like writing to VRAM.
No way they do that. Hit shaders are surely executed on SM like any other shaders too. SM is built for that, not RT cores.
 
You're confusing ray traversal with hit evaluation. RDNA2 does hit evaluation (what to do when ray hit a BVH volume or a triangle) on shading h/w but ray traversal is handled by dedicated RT h/w.
In the tech briefing with Mike Mantor and Andrew Pominaowski, AMD wrote explicitly on the slide
"[...]
- The Ray Accelerator handles intersection of rays within the BVH, an sorting of ray intersection times.
- It provides an order of magnituge increase in intersection performance compared to a software implementation.
- Traversal of the BVH and shading of ray results is handled by shader code running on the Compute Units.[...]"
[actually their bold]
I am not aware of any update to this bit of information.
 
- Traversal of the BVH and shading of ray results is handled by shader code running on the Compute Units.[...]"
"Traversal" is hit evaluation here, deciding what to do with ray once it hit something. AMD's RT h/w returns hits to the shader on each crossing of BVH and triangle, NV's RT h/w do this only when triangle is hit (although there is a way to make it work in a similar way with any hit I believe).
Intersection of BVH is still handled by ray accelerators - what's the point of having them even if this is not what they are doing?
 
"Traversal" is hit evaluation here, deciding what to do with ray once it hit something.
Pretty sure you interpret the term 'traversal' wrong, causing some confusion on your side.
This is how i see it, using the simplest example of a shadow ray:

Code:
// regular shader code
Ray ray;
ray.origin = shadingPoint;
ray.direction = lightPos - shadingPoint;
ray.length = length(ray.direction);
ray.direction /= ray.length;

bool miss = true;

// traversal loop. NV RT core processes the whole loop; AMD does it still on regular shader
Stack stack;
stack.push(rootNode);
while (!stack.empty())
{
    Node node = stack.pop();
    if (isLeaf(node))
    {
        if (IntersectRayTriangle(ray, node.triangle)) // AMD intersection instruction
        {
            miss = false;
            break;
        }
    }
    foreach (Node childNode : node.children)
    {
        if (IntersectRayBBox (ray, childNode.bbox) // AMD intersection instruction
        {
            stack.push(childNode);
        }         
    }
}
// traversal done. NV RT core returns hit or miss result
// regular shader continues peocessing

if (miss)
{
    pixel += light;
}

AFAIK AMD never said anywhere they have more than those intersection instructions, while NV said their RT cores do the traversal loop as well.
Looking at Radeon rays we see they tested o lot of traversal loop variants. Short stack, stackless, etc. Likely they were used to implement DXR, but replaced intersection math with this new RDNA2 instruction.
It's pretty interesting they get good enough performance with this simple solution. It's slower but future proof. Curious how Intel will handle this...
 
[EDIT] There are also the "overlapping" factor to be considered, as some of the RT computations can be done concurrently with traditional rendering works, but we have even less information on this and it probably reasonable to assume the overlapping portion is more or less proportional to the non-overlapping portion.

Adding to this, even if we made a synthetic benchmark only about RT and nothing else, etc., we would still need to take die area into account too.
Let's say Turing is 10 times faster in our benchmark. That's a solid number then, but obviously AMDs intersection instruction is just a fraction of die area, which AMD then has available for other tasks, no matter if RT is used or not. Contrary, NV RT core becomes just a nice cooling pad while no RT is happening.

So it's really hard to come up with fair numbers. In the end it's still total fps for games which matters more than RT perf in isolation.
 
AFAIK AMD never said anywhere they have more than those intersection instructions, while NV said their RT cores do the traversal loop as well.
Looking at Radeon rays we see they tested o lot of traversal loop variants. Short stack, stackless, etc. Likely they were used to implement DXR, but replaced intersection math with this new RDNA2 instruction.
It's pretty interesting they get good enough performance with this simple solution. It's slower but future proof. Curious how Intel will handle this...

It’s also about more than just the traversal
Implementation. If Nvidia’s implementation is anything like their patent there’s also a dedicated memory subsystem feeding the RT hardware. I wonder how AMD is handling caching and compression of the BVH.
 
I wonder how AMD is handling caching and compression of the BVH.
AFAIK, BVH is accessed from TMUs memory path, so treated like texture memory. They have two versions of the instruction, one is 64 bits to support very large BVH, the other then is the 'compressed' variant. But can't remember if box coords are fp16 or 32. ISA docs would tell this as well. Infinity Cache likely helps with caching.

I have never seen an NV patent. Does it mention some reordering / sorting hits spatially or some stuff like that? Would be interesting. I remember one or two papers from people trying to make some conclusions from synthetic test cases for Turing, and they concluded NV likely has such reordering. Would be very advanced then.
 
AFAIK, BVH is accessed from TMUs memory path, so treated like texture memory. They have two versions of the instruction, one is 64 bits to support very large BVH, the other then is the 'compressed' variant. But can't remember if box coords are fp16 or 32. ISA docs would tell this as well. Infinity Cache likely helps with caching.

I have never seen an NV patent. Does it mention some reordering / sorting hits spatially or some stuff like that? Would be interesting. I remember one or two papers from people trying to make some conclusions from synthetic test cases for Turing, and they concluded NV likely has such reordering. Would be very advanced then.

NV has a bunch of new RT patents starting around 2018/2019. Good chance they describe the Turing implementation. I'm pretty sure we discussed them in this forum at length a while back. Been a while since i read them but don't recall any mention of reordering or sorting.

In these patents the RT core is referred to as the TTU (tree traversal unit).

US20160071310A1 - Beam tracing
US20160071313A1 - Relative encoding for a block-based bounding volume hierarchy
US10866990B2 - Block-based lossless compression of geometric data
US10825230B2 - Watertight ray triangle intersection
US10235338B2 - Short stack traversal of tree data structures
US10580196B1 - Method for continued bounding volume hierarchy traversal on intersection without shader intervention

SM is a client of the TTU:
nvidia-TTU.png


TTU internals showing caches and intersection units:
nvidia-TTU-detail.png
 
Last edited:
AFAIK, BVH is accessed from TMUs memory path, so treated like texture memory. They have two versions of the instruction, one is 64 bits to support very large BVH, the other then is the 'compressed' variant. But can't remember if box coords are fp16 or 32. ISA docs would tell this as well. Infinity Cache likely helps with caching.

https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/-/wikis/RDNA2-hardware-BVH
https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/-/wikis/Raytracing
https://blog.froggi.es/bringing-vulkan-raytracing-to-older-amd-hardware/
 

At least some ground work for reordering seems done. The block compression paper divides BVH into branches, which reminds me on Nvidias 'treelets' paper. (Can't find it anymore, but guys like Laine / Karras were involved iirc.)
My own plans on software reordering were to make such branches which fit into LDS, bin sets of rays to them and then brute force intersect all this without VRAM access in inner loops. It would solve the divergent memory access to BVH. And as i understood it, the treelets paper was about the same idea. (Making such Beams to bound sets of rays for quick rejections would be a nice optimization here too, probably.)
However, while we solve BVH memory issues, rays now become a big such issue on the other hand. To have optimal reordering, we would need to rebin rays in every traversal iteration. So they move from registers into VRAM, requiring huge prefix sum for the binning, and then optionally reordering them in memory to get nice packets for the next step. Super heavy, and no win for a software implementation i guess. For RT cores it would mean rays have to move from one core to the other. Likely one core alone has not enough rays in flight to make reordering a win either.
There might be a proper compromise with doing reorder only each Nth iteration, and using only sets of rays small enough so all work remains on chip and no VRAM roundtrip for binning / reordering is needed.

No matter what - with proper reordering the traceRay function is no longer some atomic small thing, but becomes a global workload like a big compute dispatch.
DXR 1.0 seems not really designed for this, inline tracing would just break it, and potential traversal shaders would go out of reach as well (even more). So again i conclude there is no reordering yet. Also it just seems not worth it yet either.

Interesting for my requests: I would need to handle their block compression on my side. So that's a vendor specialization but no unexpected problem. It's even likely other vendors would do the same.


VALU budget
Per CU per cycle the GPU can process 1 BVH node (aka 1 lane of BVH) and 64 lanes of VALU instructions.

This answers my question about pipelining. I did not know if the instruction is parallel or serial, but considering the unit seems tiny, it's likely serial.
So when CU calls the instruction the wavefront goes idle and another wavefront takes the SIMD. After RT unit is done with all 32/64 intersections, wavefront can continue.
Now, AMD could make RT faster easily just by making the unit wider for future GPUs? Traversal and stack stuff would not be affected. This gives me some hope they stick at this flexible solution. :)
 
Forza Horizon 5 Uses Audio Ray Tracing To Make "The World Feel Alive" | SegmentNext
July 13, 2021
Forza Horizon 5 will be using ray tracing to achieve not only realistic lighting but also high-definition audio for increased immersion.
...
During a new Let’s iGo! episode (via VG247) earlier today, creative director Mike Brown and lead audio designer Fraser Strachan discussed how developer Playground Games designed a system to send ray traces all over the in-game world of Forza Horizon 5 to detect objects in real time.

The received data is then used to deliver an ultra-enhanced audio to give players an impression that sounds are actually bouncing off objects as they pass by.

Forza Horizon 5 will feature all types of environments and audio ray tracing will ensure that players are able to detect those environments with just sound. Foliage for example will give off a more muffled sound compared to buildings which will sound off in a more meaty and profound manner.
 
https://developer.nvidia.com/blog/new-ray-tracing-sdk-improves-memory-allocation-for-games/
https://developer.nvidia.com/blog/reducing-acceleration-structure-memory-with-nvidia-rtxmu/

+ https://github.com/NVIDIAGameWorks/RTXMU

I don't really get this paragraph though:
When enabling compaction on NVIDIA and AMD HW, the memory savings on NVIDIA HW is much improved compared to AMD. NVIDIA ends up being on average 3.26x smaller than AMD for acceleration structure memory when enabling compaction. The reason for such a huge reduction in memory footprint on NVIDIA is that AMD without compaction uses double the memory as is when compared to NVIDIA. Compaction then also reduces the NVIDIA memory by another 50% on average while AMD tends to reduce memory only by 75%.
Is this due to how AMD drivers handle BVH structures memory footprint?
Also wouldn't 75% of 200% be ~75% compared to 50% which isn't really 3.26x smaller?
 
Last edited:
https://developer.nvidia.com/blog/new-ray-tracing-sdk-improves-memory-allocation-for-games/
https://developer.nvidia.com/blog/reducing-acceleration-structure-memory-with-nvidia-rtxmu/

+ https://github.com/NVIDIAGameWorks/RTXMU

I don't really get this paragraph though:

Is this due to how AMD drivers handle BVH structures memory footprint?
Also wouldn't 75% of 200% be ~75% compared to 50% which isn't really 3.26x smaller?
I do not want to take a guess at their numbers there, but it would imply that the acceleration structure in AMD driver is more bloated I guess.
 
Is this due to how AMD drivers handle BVH structures memory footprint?
Also wouldn't 75% of 200% be ~75% compared to 50% which isn't really 3.26x smaller?
Applying Occam's Razor, the numbers roughly work out when you calculate with "only to 75%" here: Compaction then also reduces the NVIDIA memory by another 50% on average while AMD tends to reduce memory only by 75%.

Start with Nvidia using 2 MB, compacation removes half: 1 MB. AMD uses twice as much, so 4 MB, compaction to 75% leaves 3 MB, roughly a factor of three.
 
Applying Occam's Razor, the numbers roughly work out when you calculate with "only to 75%" here: Compaction then also reduces the NVIDIA memory by another 50% on average while AMD tends to reduce memory only by 75%.

Start with Nvidia using 2 MB, compacation removes half: 1 MB. AMD uses twice as much, so 4 MB, compaction to 75% leaves 3 MB, roughly a factor of three.
So it's a typo then which would make sense.
 

I'm not sure I understand the need for this SDK or developer intervention. Under DXR the acceleration structure is a black box so why wouldn't Nvidia just implement this suballocation scheme in their driver? AMD obviously already supports compaction as part of the standard api. Nvidia is adding suballocation as an additional enhancement. Question is why wouldn't they just do this in the driver since they have access to the necessary inputs (i.e. size of each BLAS).

"Suballocation tells a slightly different story here in which scenes with many small acceleration structures like Zero Day benefit greatly. The average memory savings from suballocation ends up being 123 MB but the standard deviation is rather large at 153 MB. From this data, we can assert that suballocation is highly dependent on the scene geometry and benefits from thousands of small triangle count BLAS geometry."
 
Status
Not open for further replies.
Back
Top