AMD Radeon RDNA2 Navi (RX 6500, 6600, 6700, 6800, 6900 XT)

T2098 · Dec 30, 2020

Some good SAM benchmarks here, including with some older cards:
https://www.computerbase.de/2020-12/amd-smart-access-memory-test-radeon-rx-6800/

The 5700XT numbers in Cyberpunk + 6800XT numbers in Tomb Raider are nuts (they're close to the bottom)
Interesting to see in Tomb Raider's profiler how all the gains are on the CPU side.

trinibwoy · Dec 31, 2020

andermans said:
why exponential? Pretty much the majority of the BVh is just going to be the raw triangle data which would be a lower bound anyway (40 bytes for 9 floats + the triangle id).

The leaf BLAS doesn't necessarily contain raw triangle data but a range of indices referencing vertex/index buffers (stored outside the BVH).

D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC (d3d12.h) - Win32 apps | Microsoft Docs

Code:

typedef struct D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC {
  D3D12_GPU_VIRTUAL_ADDRESS            Transform3x4;
  DXGI_FORMAT                          IndexFormat;
  DXGI_FORMAT                          VertexFormat;
  UINT                                 IndexCount;
  UINT                                 VertexCount;
  D3D12_GPU_VIRTUAL_ADDRESS            IndexBuffer;
  D3D12_GPU_VIRTUAL_ADDRESS_AND_STRIDE VertexBuffer;
} D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC;

For a packing with N triangles (and assuming each box node has at least 2 children) you need N triangle nodes + N/2 box nodes + N/2/2 box nodes etc coming to N triangle nodes + N-1 box nodes.

Right. With N approaching 100k for character models in modern titles you're looking at a huge reduction in node count simply by allowing multiple triangles in leaf nodes. Not only is your memory footprint reduced significantly but your tree is shallower and traversal is faster. Given AMD is doing traversal on shaders it's very unlikely that they're building very deep trees.

One caveat is triangle intersection performance. It seems AMD's implementation is 4x faster at intersecting nodes than triangles. In this case maybe they would benefit overall from deeper trees that minimize triangle intersection testing but I doubt it.

trinibwoy · Dec 31, 2020

Frenetic Pony said:
Turns out it's 4 per leaf, which makes sense. (RDNA2 docs)

The ISA doc mentions 4 children per box node but doesn't really go into the structure of triangle (leaf) nodes.

andermans · Dec 31, 2020

trinibwoy said:
The leaf BLAS doesn't necessarily contain raw triangle data but a range of indices referencing vertex/index buffers (stored outside the BVH).

D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC (d3d12.h) - Win32 apps | Microsoft Docs

Code:

typedef struct D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC { D3D12_GPU_VIRTUAL_ADDRESS Transform3x4; DXGI_FORMAT IndexFormat; DXGI_FORMAT VertexFormat; UINT IndexCount; UINT VertexCount; D3D12_GPU_VIRTUAL_ADDRESS IndexBuffer; D3D12_GPU_VIRTUAL_ADDRESS_AND_STRIDE VertexBuffer; } D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC;

Right. With N approaching 100k for character models in modern titles you're looking at a huge reduction in node count simply by allowing multiple triangles in leaf nodes. Not only is your memory footprint reduced significantly but your tree is shallower and traversal is faster. Given AMD is doing traversal on shaders it's very unlikely that they're building very deep trees.

One caveat is triangle intersection performance. It seems AMD's implementation is 4x faster at intersecting nodes than triangles. In this case maybe they would benefit overall from deeper trees that minimize triangle intersection testing but I doubt it.

I think that is the input to the BVH building process though? The triangle nodes definitely contains the raw position data after the BVH is build.

The factor 2 was only meant as a minimal number of children to get an upper bound, but I'd indeed hope AMD on average gets closer to 4 children. Though the tree will still be comparatively deep. For example Intel can do 6 children/box node, which will result in a tree that is less deep.

trinibwoy · Dec 31, 2020

andermans said:
I think that is the input to the BVH building process though?

That's right, DXR has no visibility into the contents of the acceleration structure. It doesn't even know that it's a BVH.

The triangle nodes definitely contains the raw position data after the BVH is build.

Why do you say definitely? Is that based on knowledge of AMD's internal representation? E.g. from this Nvidia patent it's not definitive at all that the BVH contains raw triangle data (just pointers to it).

"Again, a leaf node is a node that is associated with zero child nodes and includes an element of the data represented by the tree data structure. For example, a leaf node may include a pointer or pointers to one or more geometric primitives of a 3D model."

The factor 2 was only meant as a minimal number of children to get an upper bound, but I'd indeed hope AMD on average gets closer to 4 children. Though the tree will still be comparatively deep. For example Intel can do 6 children/box node, which will result in a tree that is less deep.

Yes, BVH4 helps.

Wesker · Jan 4, 2021

Some cheap rumours, but still may be interesting to discuss:
https://videocardz.com/newz/amd-navi-2x-xtxh-and-nashira-point-gpus-spotted

DavidGraham · Jan 4, 2021

Crytek's software RT is tested in the noir demo, the 3090 is 40% faster than 6900XT @4K, the 3080 is 20% faster.

T2098 · Jan 4, 2021

DavidGraham said:
Crytek's software RT is tested in the noir demo, the 3090 is 40% faster than 6900XT @4K, the 3080 is 20% faster.

Yikes, not much of an excuse for AMD there if it's all running on the shaders. Looks like the infinity cache just can't keep up with keeping both the rasterization and RT stages fed.
To paraphrase the immortal Jerry Sanders and his very non-PC statement he made years ago: "Real cards have real memory bandwidth"

For those unfamiliar - https://www.eetimes.com/real-men-have-fabsor-do-they/

... it's even slightly worse upon second inspection when you think that the 3070 is beating all 3 Navi cards in the 1% lows (worst case), given that Big Navi has more physical memory bandwidth than the 3070 does to play with.

DegustatoR · Jan 4, 2021

T2098 said:
Looks like the infinity cache just can't keep up with keeping both the rasterization and RT stages fed.

Or Ampere's FP32 advantage actually showing up in a compute heavy workload.

T2098 · Jan 4, 2021

DegustatoR said:
Or Ampere's FP32 advantage actually showing up in a compute heavy workload.

That is true and a good point, although the 6800xt and 6900xt both have a theoretical FP32 TFLOPs advantage over the 3070 in their pocket as well, not to mention an enormous theoretical advantage in ROP/TMU/FP16 throughput.

pjbliverpool · Jan 4, 2021

DegustatoR said:
Or Ampere's FP32 advantage actually showing up in a compute heavy workload.

Yes this was my initial thought too.

Interesting to note that Nanite is supposed to scale quite well relative to compute power so Ampere may well shine in UE5 based games that use the tech.

trinibwoy · Jan 5, 2021

T2098 said:
That is true and a good point, although the 6800xt and 6900xt both have a theoretical FP32 TFLOPs advantage over the 3070 in their pocket as well, not to mention an enormous theoretical advantage in ROP/TMU/FP16 throughput.

The Radeons also have twice the peak INT32 throughput of the 3070.

I tried pointing Nvidia's DX11 profiler at Neon Noir but it threw up with some error about a pure function call.

Frenetic Pony · Jan 5, 2021

pjbliverpool said:
Yes this was my initial thought too.

Interesting to note that Nanite is supposed to scale quite well relative to compute power so Ampere may well shine in UE5 based games that use the tech.

Nanite is also so fast that it's probably dominate by texturing anyway, and the UE5 demo's performance was dominated by their SDFGI, not nanite. So nanite performance isn't really interesting as it's apparently so efficient.

The software RT demo on the other hand does show interesting numbers, but I'd hesitate to pin down exactly what it's bottlenecked by. The performance differences between each GPU scale almost perfectly regardless of resolution, so I'm not sure its bandwidth as you'd expect the 6800 to catch up with the xt variants. But at the same time it might be latency issues. If the acceleration structure isn't present in the cache, the high clockspeeds of RDNA2 might be relatively useless, as compared to Ampere's ultra wide arch allowing more parallel access. Which makes the most sense to me for chasing pointers through memory.

Ethatron · Jan 5, 2021

I think it would be useful if reviewers attach a CPU/GPU utilization median and spead near each FPS min/max.

Lurkmass · Jan 5, 2021

Frenetic Pony said:
The software RT demo on the other hand does show interesting numbers, but I'd hesitate to pin down exactly what it's bottlenecked by. The performance differences between each GPU scale almost perfectly regardless of resolution, so I'm not sure its bandwidth as you'd expect the 6800 to catch up with the xt variants. But at the same time it might be latency issues. If the acceleration structure isn't present in the cache, the high clockspeeds of RDNA2 might be relatively useless, as compared to Ampere's ultra wide arch allowing more parallel access. Which makes the most sense to me for chasing pointers through memory.

Wouldn't put too much thought into Neon Noir since it's using D3D11 which can negatively affect how the shaders are optimized by the compiler. The fxc compiler is deprecated so every vendor now does translation from DXBC to DXIL which adds overhead and DXBC is also a suboptimal match for AMD HW/drivers too ...

chris1515 · Jan 5, 2021

https://videocardz.com/newz/amd-radeon-rx-6700-series-rumored-to-launch-by-the-end-of-march

tsa1 · Jan 5, 2021

Not sure about the big navi, but on vega the Noir RT demo is heavily bottlenecked somewhere not really related to the compute and (possibly) memory bandwidth - it barely goes above 200W in consumption (and fps stays the same even if I tank memory clock by 30%) while in various games the same settings usually produce over 300W+. Same goes for World of Tanks RT, super high clocks and very modest power consumption. I guess they never bothered optimizing for GCN/RDNA and their architectural peculiarities

trinibwoy · Jan 6, 2021

https://www.techspot.com/article/2178-amd-smart-access-memory/

Today we’re taking a detailed look at how AMD’s Smart Access Memory (SAM) technology influences performance in a wide range of games. All in all, we plan to benchmark 36 games at 1080p, 1440p and 4K.

Silent_Buddha · Jan 7, 2021

trinibwoy said:
https://www.techspot.com/article/2178-amd-smart-access-memory/

Finally got time to look at it, and my first impression is that in general, SAM potentially helps more if there is a memory bottleneck. Hence, why it has greater benefit in general at higher resolutions.

Also this might help explain why it has regressions in performance at lower resolutions in some titles. Less memory pressure means that whatever extra is being done to facilitate direct access to GPU memory becomes a net detriment rather than a benefit. And at higher resolutions where memory pressure is higher it mostly evens back out again.

Just a theory.

Regards,
SB

Deleted member 2197 · Jan 7, 2021

Puget Systems Professional Application reviews:

Agisoft Metashape 1.7.0 vs 1.6.5 Performance - Featuring Radeon RX 6900 XT
Written on January 8, 2021 by William George

DaVinci Resolve Studio - AMD Radeon RX 6900 XT Performance
Written on January 6, 2021 by Matt Bach

Adobe Premiere Pro - AMD Radeon RX 6900 XT Performance
Written on January 5, 2021 by Matt Bach

Adobe After Effects - AMD Radeon RX 6900 XT Performance
Written on January 4, 2021 by Matt Bach

Adobe Photoshop - AMD Radeon RX 6900 XT Performance
Written on January 4, 2021 by Matt Bach

AMD Radeon RDNA2 Navi (RX 6500, 6600, 6700, 6800, 6900 XT)

T2098

trinibwoy

Meh

trinibwoy

Meh

andermans

trinibwoy

Meh

Wesker

DavidGraham

T2098

DegustatoR

T2098

pjbliverpool

B3D Scallywag

trinibwoy

Meh

Frenetic Pony

Ethatron

Lurkmass

chris1515

tsa1

trinibwoy

Meh

Silent_Buddha

Deleted member 2197

Guest

Similar threads