RDNA 2 Ray Tracing

I just thought about something else: They store all BVH in texture, so we could use virtual texturing feature to stream BLAS very easily. And with their support for large BVH there is no need to resolve indirection about pointers.
I really think consoles go that route. Still requesting for PC too...
You're thinking 10 or 20GB of BVH kept on disk? Streaming in a few hundred MB as needed? LOD strategies?

Sounds a bit like photon mapping? However, the locality you aim for is a must have IMO, because modern hardware really wants it. Things like QuakeRTX or Metro miss to utilize it, and path tracing does so in general. Big reason why this classical RT with DXR is so fishy for games.
In all these techniques, to do them in real time, requires a huge amount of denoising. So when I talk about opportunistic bounce ray generation, I'm expecting it to produce a noisy result. In general with real time ray tracing, we're seeing very low sample counts per pixel.
 
You're thinking 10 or 20GB of BVH kept on disk? Streaming in a few hundred MB as needed? LOD strategies?
Yeah, but there might be just not enough storage, depending on the game.
But then we could at least build 'nearby' BVH during startup or as a background task, and manage a cache on disk to avoid constant rebuilds of the same models as they pop in and out.
In general with real time ray tracing, we're seeing very low sample counts per pixel.
Yes, but this maximizes incoherence.
For GI we we can avoid this. Examples would be DDGI probes, where we have many rays originating from the same probe center. So i thought you think about doing the same but making the emitter a center, instead the receiver.
 
You failed on realizing my sarcasm in the other post, but i don't fail to detect yours here

Then i did a good job :)
Ofcourse DXR needs refining, just as any other new technology that came recently. I think it will be ok, just as any other 'great/huge' new feature out there now. All these combined should help alot in improving on the 'lack' of generational jumps these days.
 
I'm kind of surprised there are no hardware based numerical integrators, it should be one of the more decisive cases where hardware just beats software by some orders. Lets say we can get sums out of texture fetches instead of averages/filtered, and then we can get masked sums (either by another texture, or a polynomial/spherical discreet function), and then we get hierarchical texture sums (in the mips), and then we extend it to 3D textures and the true fun with voxel hierarchy integration starts.
This would solve more lighting equation performance and precision problems than what's brought up currently. What's being done now on GPUs is a 4 decade old technique, and you can predict how this goes fairly well, just looking at offline raytracing tech history. It would also raise productivity of programmers instantly, as all the MoBlur, DoF, Bloom, SSS and myriad of effects, which are sum-based, suddenly would be a couple of lines large instead of 4k code blobs.
 
I'm kind of surprised there are no hardware based numerical integrators
Nice idea. Personally i see two applications, one is building env map probes for GI, the other is fluid simulations. But in both cases it would only help me if it works with fragments of texture living in LDS memory.
 
Some reverse engineering on RDNA2 raytracing implementation:

https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/-/wikis/RDNA2-hardware-BVH
https://gitlab.freedesktop.org/bnieuwenhuizen/mesa/-/wikis/Raytracing
https://github.com/GPUOpen-LibrariesAndSDKs/AGS_SDK/commit/4f2442c13eec0e8cb4feeef1185861ae99a13a9d

Putting this altogether, could these raytracing hit tokens represent a stack ? It would potentially implicate that AMD are exposing explicit functionality with their AGS driver extension to store the stack information into the LDS memory for non-inline ray tracing ...
 
Ooh, some interesting facts there:
  • Box nodes on RDNA2 can have 4 boxes (and hence children), and come in fp16 and fp32 variants. The fp16 variant is smaller (only 1 cacheline) but has all the precision and range drawbacks from using fp16
  • Noteworthy is that this [fp16 AABBs] node is exactly a 64-byte line without any holes so it is hard to fit any extra information in there
  • 32-bit node is 112 bytes with 16 unused
It appears that there is no compression of node data.
 
It would potentially implicate that AMD are exposing explicit functionality with their AGS driver extension to store the stack information into the LDS memory for non-inline ray tracing ...

One thing he forgot in his rejection of stackless is, that he can interleave custom data with the node-data, so after each node there's unlimited private space until the next node, where he can put the parent-pointer. Just pretend the private space is also nodes and ommit the linear ids occupied by them (e.g. just use node_ids divisible by 2, 3, 4, 5 and so on).

Edit: Intel presented short stacks at HPG'19, that is pretty amazing: https://www.embree.org/papers/2019-HPG-ShortStack.pdf
 
Last edited:
One thing he forgot in his rejection of stackless is, that he can interleave custom data with the node-data, so after each node there's unlimited private space until the next node, where he can put the parent-pointer. Just pretend the private space is also nodes and ommit the linear ids occupied by them (e.g. just use node_ids divisible by 2, 3, 4, 5 and so on).

Edit: Intel presented short stacks at HPG'19, that is pretty amazing: https://www.embree.org/papers/2019-HPG-ShortStack.pdf


A problem is that the nodes have to be 64-byte aligned, so if you add data after the fp16-box node you essentially get a 128 byte structure. Might as well use the fp32-box node then. (assuming no performance difference in processing of fp16/fp32 in the RT hardware). Of course you could put them in a second array to avoid alignment, but then you have data in multiple cachelines which is going to be hard on the memory hierarchy
 
It looks like we have an ISA sample of the inner loop for BVH traversal from a TraceRays shader call:

image2020-11-12_17-24-5.jpg


which comes from:

Radeon™ GPU Profiler 1.9 introduces support for Radeon™ RX 6000 Series - GPUOpen

Amongst the statistics reported for ISA is "Stack size", alongside things like LDS or Instruction cost. I've not seen stack size reported as an explicit statistic before (I'm years out of touch though). I suspect this is merely the call stack for functions that are assembled into multiple ISA functions and has nothing to do with a stack used for BVH traversal.
 
19 clocks? It's 15 clocks in my case.
The page I linked seems to imply that there's a variety of code that's produced, so perhaps not that surprising.

I'm unclear if the ISA in the picture is from a SphereIntersect or something else, which adds to the uncertainty.
 
For a few weeks now, on and off, I've been playing with the GPSnoopy RayTracingInVulkan application:

RayTracingInVulkan.r7%206900XT%20Summary.png


Here I've charted the scenes at 8 samples with varying bounces:

RayTracingInVulkan.r7%206900XT%20Scene%20Comparison.png



I've used giga rays per second as the metric, instead of frames per second from the application, to make comparisons easier.

I have also provided the spreadsheet xlsx here which will make viewing the data easier:

https://github.com/JawedAshraf/B3D/raw/master/RayTracingInVulkan.r7/RayTracingInVulkan.r7 6900XT.xlsx

There's lots more graphs in the document.

A key factor that's generally hard to understand in a ray tracing benchmark like this is the length of each ray measured in bounces.

Looking at the numbers I guess that the average bounce counts per scene are as follows:
  1. 3 bounces
  2. 3 bounces
  3. 4 bounces
  4. 6 bounces
  5. 10 bounces
A serious problem with the benchmark scenes is that the Cornell box in scenes 4 and 5 effectively only captures about 60% of the rays. So this really makes a direct comparison of ray rates in scenes 3 versus 5 invalid.

Scenes 4 and 5 are more comparable. The difference in the average bounce counts is hard to disentangle from the difference in triangle counts.

Scene 3, when compared with scene 2, gets relatively slower and slower as the bounce counts are allowed to go higher.
 
Last edited:
Here is a comparison of the "heatmap" for scenes 4 and 5, using the same "scaling factor" of 3.16:

Scene%204%20heatmap.png


Scene%205%20heatmap.png


What's peculiar, looking at these, are the "regular rectangles" that look like artefacts of scheduling on the GPU. There's 5 very strongly defined vertical stripes inside the box in the first picture and remnants of them can be seen in the second. Additionally, horizontally arranged, smaller rectangles can be seen top-to-bottom.

As resolution increases, the rectangles remain about the same size, in other words, there's more of them.
 

Still climbing a steep learning curve.

I think it's interesting that BVH builders often have no limit to the depth that they will construct. A graph of performance showed a case where the stack was 79 deep, which is really going some.

AMD's own traversal algorithm appears to use a 16-entry short stack.
 
Back
Top