AMD Radeon RDNA2 Navi (RX 6500, 6600, 6700, 6800, 6900 XT)

Discussion in 'Architecture and Products' started by BRiT, Oct 28, 2020.

  1. T2098

    Newcomer

    Joined:
    Jun 15, 2020
    Messages:
    55
    Likes Received:
    115
    pharma, Lightman and pjbliverpool like this.
  2. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    The leaf BLAS doesn't necessarily contain raw triangle data but a range of indices referencing vertex/index buffers (stored outside the BVH).

    D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC (d3d12.h) - Win32 apps | Microsoft Docs

    Code:
    typedef struct D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC {
      D3D12_GPU_VIRTUAL_ADDRESS            Transform3x4;
      DXGI_FORMAT                          IndexFormat;
      DXGI_FORMAT                          VertexFormat;
      UINT                                 IndexCount;
      UINT                                 VertexCount;
      D3D12_GPU_VIRTUAL_ADDRESS            IndexBuffer;
      D3D12_GPU_VIRTUAL_ADDRESS_AND_STRIDE VertexBuffer;
    } D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC;
    
    Right. With N approaching 100k for character models in modern titles you're looking at a huge reduction in node count simply by allowing multiple triangles in leaf nodes. Not only is your memory footprint reduced significantly but your tree is shallower and traversal is faster. Given AMD is doing traversal on shaders it's very unlikely that they're building very deep trees.

    One caveat is triangle intersection performance. It seems AMD's implementation is 4x faster at intersecting nodes than triangles. In this case maybe they would benefit overall from deeper trees that minimize triangle intersection testing but I doubt it.
     
    Lightman, pharma and BRiT like this.
  3. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    The ISA doc mentions 4 children per box node but doesn't really go into the structure of triangle (leaf) nodes.
     
  4. andermans

    Newcomer

    Joined:
    Sep 11, 2020
    Messages:
    28
    Likes Received:
    43
    I think that is the input to the BVH building process though? The triangle nodes definitely contains the raw position data after the BVH is build.

    The factor 2 was only meant as a minimal number of children to get an upper bound, but I'd indeed hope AMD on average gets closer to 4 children. Though the tree will still be comparatively deep. For example Intel can do 6 children/box node, which will result in a tree that is less deep.
     
  5. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    That's right, DXR has no visibility into the contents of the acceleration structure. It doesn't even know that it's a BVH.

    Why do you say definitely? Is that based on knowledge of AMD's internal representation? E.g. from this Nvidia patent it's not definitive at all that the BVH contains raw triangle data (just pointers to it).

    "Again, a leaf node is a node that is associated with zero child nodes and includes an element of the data represented by the tree data structure. For example, a leaf node may include a pointer or pointers to one or more geometric primitives of a 3D model."

    Yes, BVH4 helps.
     
  6. Wesker

    Regular

    Joined:
    May 3, 2008
    Messages:
    299
    Likes Received:
    186
    Location:
    Oxford, UK
  7. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    Crytek's software RT is tested in the noir demo, the 3090 is 40% faster than 6900XT @4K, the 3080 is 20% faster.

     
  8. T2098

    Newcomer

    Joined:
    Jun 15, 2020
    Messages:
    55
    Likes Received:
    115
    Yikes, not much of an excuse for AMD there if it's all running on the shaders. Looks like the infinity cache just can't keep up with keeping both the rasterization and RT stages fed.
    To paraphrase the immortal Jerry Sanders and his very non-PC statement he made years ago: "Real cards have real memory bandwidth"

    For those unfamiliar - https://www.eetimes.com/real-men-have-fabsor-do-they/

    ... it's even slightly worse upon second inspection when you think that the 3070 is beating all 3 Navi cards in the 1% lows (worst case), given that Big Navi has more physical memory bandwidth than the 3070 does to play with.
     
    xpea and DavidGraham like this.
  9. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,242
    Likes Received:
    3,405
    Or Ampere's FP32 advantage actually showing up in a compute heavy workload.
     
    tinokun, Lightman, Rootax and 4 others like this.
  10. T2098

    Newcomer

    Joined:
    Jun 15, 2020
    Messages:
    55
    Likes Received:
    115
    That is true and a good point, although the 6800xt and 6900xt both have a theoretical FP32 TFLOPs advantage over the 3070 in their pocket as well, not to mention an enormous theoretical advantage in ROP/TMU/FP16 throughput.
     
    Frenetic Pony likes this.
  11. pjbliverpool

    pjbliverpool B3D Scallywag
    Legend

    Joined:
    May 8, 2005
    Messages:
    9,236
    Likes Received:
    4,259
    Location:
    Guess...
    Yes this was my initial thought too.

    Interesting to note that Nanite is supposed to scale quite well relative to compute power so Ampere may well shine in UE5 based games that use the tech.
     
    tinokun and PSman1700 like this.
  12. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
    The Radeons also have twice the peak INT32 throughput of the 3070.

    I tried pointing Nvidia's DX11 profiler at Neon Noir but it threw up with some error about a pure function call.
     
    Frenetic Pony likes this.
  13. Frenetic Pony

    Regular

    Joined:
    Nov 12, 2011
    Messages:
    807
    Likes Received:
    478
    Nanite is also so fast that it's probably dominate by texturing anyway, and the UE5 demo's performance was dominated by their SDFGI, not nanite. So nanite performance isn't really interesting as it's apparently so efficient.

    The software RT demo on the other hand does show interesting numbers, but I'd hesitate to pin down exactly what it's bottlenecked by. The performance differences between each GPU scale almost perfectly regardless of resolution, so I'm not sure its bandwidth as you'd expect the 6800 to catch up with the xt variants. But at the same time it might be latency issues. If the acceleration structure isn't present in the cache, the high clockspeeds of RDNA2 might be relatively useless, as compared to Ampere's ultra wide arch allowing more parallel access. Which makes the most sense to me for chasing pointers through memory.
     
    pjbliverpool likes this.
  14. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    946
    Likes Received:
    413
    I think it would be useful if reviewers attach a CPU/GPU utilization median and spead near each FPS min/max.
     
  15. Lurkmass

    Regular

    Joined:
    Mar 3, 2020
    Messages:
    565
    Likes Received:
    711
    Wouldn't put too much thought into Neon Noir since it's using D3D11 which can negatively affect how the shaders are optimized by the compiler. The fxc compiler is deprecated so every vendor now does translation from DXBC to DXIL which adds overhead and DXBC is also a suboptimal match for AMD HW/drivers too ...
     
    Frenetic Pony and chris1515 like this.
  16. tsa1

    Newcomer

    Joined:
    Oct 8, 2020
    Messages:
    89
    Likes Received:
    97
    Not sure about the big navi, but on vega the Noir RT demo is heavily bottlenecked somewhere not really related to the compute and (possibly) memory bandwidth - it barely goes above 200W in consumption (and fps stays the same even if I tank memory clock by 30%) while in various games the same settings usually produce over 300W+. Same goes for World of Tanks RT, super high clocks and very modest power consumption. I guess they never bothered optimizing for GCN/RDNA and their architectural peculiarities
     
  17. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,058
    Likes Received:
    3,116
    Location:
    New York
  18. Silent_Buddha

    Legend

    Joined:
    Mar 13, 2007
    Messages:
    19,423
    Likes Received:
    10,317
    Finally got time to look at it, and my first impression is that in general, SAM potentially helps more if there is a memory bottleneck. Hence, why it has greater benefit in general at higher resolutions.

    Also this might help explain why it has regressions in performance at lower resolutions in some titles. Less memory pressure means that whatever extra is being done to facilitate direct access to GPU memory becomes a net detriment rather than a benefit. And at higher resolutions where memory pressure is higher it mostly evens back out again.

    Just a theory.

    Regards,
    SB
     
  19. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,891
    Likes Received:
    4,539
    #2100 pharma, Jan 7, 2021
    Last edited: Jan 9, 2021
    Lightman and PSman1700 like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...