AMD Radeon RDNA2 Navi (RX 6500, 6600, 6700, 6800, 6900 XT)

why exponential? Pretty much the majority of the BVh is just going to be the raw triangle data which would be a lower bound anyway (40 bytes for 9 floats + the triangle id).

The leaf BLAS doesn't necessarily contain raw triangle data but a range of indices referencing vertex/index buffers (stored outside the BVH).

D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC (d3d12.h) - Win32 apps | Microsoft Docs

Code:
typedef struct D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC {
  D3D12_GPU_VIRTUAL_ADDRESS            Transform3x4;
  DXGI_FORMAT                          IndexFormat;
  DXGI_FORMAT                          VertexFormat;
  UINT                                 IndexCount;
  UINT                                 VertexCount;
  D3D12_GPU_VIRTUAL_ADDRESS            IndexBuffer;
  D3D12_GPU_VIRTUAL_ADDRESS_AND_STRIDE VertexBuffer;
} D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC;

For a packing with N triangles (and assuming each box node has at least 2 children) you need N triangle nodes + N/2 box nodes + N/2/2 box nodes etc coming to N triangle nodes + N-1 box nodes.

Right. With N approaching 100k for character models in modern titles you're looking at a huge reduction in node count simply by allowing multiple triangles in leaf nodes. Not only is your memory footprint reduced significantly but your tree is shallower and traversal is faster. Given AMD is doing traversal on shaders it's very unlikely that they're building very deep trees.

One caveat is triangle intersection performance. It seems AMD's implementation is 4x faster at intersecting nodes than triangles. In this case maybe they would benefit overall from deeper trees that minimize triangle intersection testing but I doubt it.
 
The leaf BLAS doesn't necessarily contain raw triangle data but a range of indices referencing vertex/index buffers (stored outside the BVH).

D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC (d3d12.h) - Win32 apps | Microsoft Docs

Code:
typedef struct D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC {
  D3D12_GPU_VIRTUAL_ADDRESS            Transform3x4;
  DXGI_FORMAT                          IndexFormat;
  DXGI_FORMAT                          VertexFormat;
  UINT                                 IndexCount;
  UINT                                 VertexCount;
  D3D12_GPU_VIRTUAL_ADDRESS            IndexBuffer;
  D3D12_GPU_VIRTUAL_ADDRESS_AND_STRIDE VertexBuffer;
} D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC;



Right. With N approaching 100k for character models in modern titles you're looking at a huge reduction in node count simply by allowing multiple triangles in leaf nodes. Not only is your memory footprint reduced significantly but your tree is shallower and traversal is faster. Given AMD is doing traversal on shaders it's very unlikely that they're building very deep trees.

One caveat is triangle intersection performance. It seems AMD's implementation is 4x faster at intersecting nodes than triangles. In this case maybe they would benefit overall from deeper trees that minimize triangle intersection testing but I doubt it.

I think that is the input to the BVH building process though? The triangle nodes definitely contains the raw position data after the BVH is build.

The factor 2 was only meant as a minimal number of children to get an upper bound, but I'd indeed hope AMD on average gets closer to 4 children. Though the tree will still be comparatively deep. For example Intel can do 6 children/box node, which will result in a tree that is less deep.
 
I think that is the input to the BVH building process though?

That's right, DXR has no visibility into the contents of the acceleration structure. It doesn't even know that it's a BVH.

The triangle nodes definitely contains the raw position data after the BVH is build.

Why do you say definitely? Is that based on knowledge of AMD's internal representation? E.g. from this Nvidia patent it's not definitive at all that the BVH contains raw triangle data (just pointers to it).

"Again, a leaf node is a node that is associated with zero child nodes and includes an element of the data represented by the tree data structure. For example, a leaf node may include a pointer or pointers to one or more geometric primitives of a 3D model."

The factor 2 was only meant as a minimal number of children to get an upper bound, but I'd indeed hope AMD on average gets closer to 4 children. Though the tree will still be comparatively deep. For example Intel can do 6 children/box node, which will result in a tree that is less deep.

Yes, BVH4 helps.
 
Crytek's software RT is tested in the noir demo, the 3090 is 40% faster than 6900XT @4K, the 3080 is 20% faster.


Yikes, not much of an excuse for AMD there if it's all running on the shaders. Looks like the infinity cache just can't keep up with keeping both the rasterization and RT stages fed.
To paraphrase the immortal Jerry Sanders and his very non-PC statement he made years ago: "Real cards have real memory bandwidth"

For those unfamiliar - https://www.eetimes.com/real-men-have-fabsor-do-they/

... it's even slightly worse upon second inspection when you think that the 3070 is beating all 3 Navi cards in the 1% lows (worst case), given that Big Navi has more physical memory bandwidth than the 3070 does to play with.
 
Or Ampere's FP32 advantage actually showing up in a compute heavy workload.

That is true and a good point, although the 6800xt and 6900xt both have a theoretical FP32 TFLOPs advantage over the 3070 in their pocket as well, not to mention an enormous theoretical advantage in ROP/TMU/FP16 throughput.
 
That is true and a good point, although the 6800xt and 6900xt both have a theoretical FP32 TFLOPs advantage over the 3070 in their pocket as well, not to mention an enormous theoretical advantage in ROP/TMU/FP16 throughput.

The Radeons also have twice the peak INT32 throughput of the 3070.

I tried pointing Nvidia's DX11 profiler at Neon Noir but it threw up with some error about a pure function call.
 
Yes this was my initial thought too.

Interesting to note that Nanite is supposed to scale quite well relative to compute power so Ampere may well shine in UE5 based games that use the tech.

Nanite is also so fast that it's probably dominate by texturing anyway, and the UE5 demo's performance was dominated by their SDFGI, not nanite. So nanite performance isn't really interesting as it's apparently so efficient.

The software RT demo on the other hand does show interesting numbers, but I'd hesitate to pin down exactly what it's bottlenecked by. The performance differences between each GPU scale almost perfectly regardless of resolution, so I'm not sure its bandwidth as you'd expect the 6800 to catch up with the xt variants. But at the same time it might be latency issues. If the acceleration structure isn't present in the cache, the high clockspeeds of RDNA2 might be relatively useless, as compared to Ampere's ultra wide arch allowing more parallel access. Which makes the most sense to me for chasing pointers through memory.
 
I think it would be useful if reviewers attach a CPU/GPU utilization median and spead near each FPS min/max.
 
The software RT demo on the other hand does show interesting numbers, but I'd hesitate to pin down exactly what it's bottlenecked by. The performance differences between each GPU scale almost perfectly regardless of resolution, so I'm not sure its bandwidth as you'd expect the 6800 to catch up with the xt variants. But at the same time it might be latency issues. If the acceleration structure isn't present in the cache, the high clockspeeds of RDNA2 might be relatively useless, as compared to Ampere's ultra wide arch allowing more parallel access. Which makes the most sense to me for chasing pointers through memory.

Wouldn't put too much thought into Neon Noir since it's using D3D11 which can negatively affect how the shaders are optimized by the compiler. The fxc compiler is deprecated so every vendor now does translation from DXBC to DXIL which adds overhead and DXBC is also a suboptimal match for AMD HW/drivers too ...
 
Not sure about the big navi, but on vega the Noir RT demo is heavily bottlenecked somewhere not really related to the compute and (possibly) memory bandwidth - it barely goes above 200W in consumption (and fps stays the same even if I tank memory clock by 30%) while in various games the same settings usually produce over 300W+. Same goes for World of Tanks RT, super high clocks and very modest power consumption. I guess they never bothered optimizing for GCN/RDNA and their architectural peculiarities
 

Finally got time to look at it, and my first impression is that in general, SAM potentially helps more if there is a memory bottleneck. Hence, why it has greater benefit in general at higher resolutions.

Also this might help explain why it has regressions in performance at lower resolutions in some titles. Less memory pressure means that whatever extra is being done to facilitate direct access to GPU memory becomes a net detriment rather than a benefit. And at higher resolutions where memory pressure is higher it mostly evens back out again.

Just a theory.

Regards,
SB
 
Last edited by a moderator:
Back
Top