GPU Ray Tracing Performance Comparisons [2021] *spawn*

Discussion in 'Architecture and Products' started by DavidGraham, Mar 29, 2021.

  1. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    Lightman, pjbliverpool and PSman1700 like this.
  2. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,397
    https://www.anandtech.com/show/16895/a-sneak-peek-at-intels-xe-hpg-gpu-architecture
     
    Lightman, PSman1700 and pharma like this.
  3. PSman1700

    Legend

    Joined:
    Mar 22, 2019
    Messages:
    7,118
    Likes Received:
    3,090
    Lightman and Jupiter like this.
  4. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    [​IMG]

    Nice, I like "core" much better than "subslice". Intel is sticking to its guns and referring to self contained execution units as cores. This is arguably more accurate but pretty useless for comparison to AMD's and Nvidia's "cores". This picture also makes it seem like the RT and texture units can be accessed from any of the cores. That would be really interesting but the picture is probably just misleading. Rasterizer and ROPs inside the slice are very similar to Ampere and RDNA.

    Intel Slice = Nvidia GPC = AMD Shader Array
    Intel Core = Nvidia SM = AMD WGP
    Intel Vector engine = Nvidia Partition = AMD SIMD

    Did I get that right?
     
    PSman1700 and DegustatoR like this.
  5. TopSpoiler

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    74
    Likes Received:
    176
    If rumors are true that AMD will be reusing the Navi 2x for their next-gen mid-range SKUs, I think the RT implementation for RDNA3 will not be different from RDNA2.
     
  6. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    Why?
     
    pharma, DegustatoR and PSman1700 like this.
  7. TopSpoiler

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    74
    Likes Received:
    176
    For keeping architectural/strategic consistency. I think AMD is against on special purpose compute units in the GPUs and they probably thought improving efficiency of general purpose compute units are better way. Massive L3 cache is one of it's execution.
     
  8. nAo

    nAo Nutella Nutellae
    Veteran

    Joined:
    Feb 6, 2002
    Messages:
    4,400
    Likes Received:
    440
    Location:
    San Francisco
    AMD already has special purpose units for RT. Sure, they don't have as much HW acceleration for RT as other GPUs (including Intel DG2, according what was disclosed earlier today..), but that doesn't necessarily mean they are not going to throw more fixed function at it.

    Consistency doesn't matter if you can't compete well on a progressively larger fraction of future workloads.
     
  9. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    And since it's using a driver abstraction, you can have consistency from a software POV anyway. You just need to provision a compute path, when you do something, the ff hw cannot do.
     
    Lightman and PSman1700 like this.
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,112
    Location:
    New York
    If anything Intel has the best setup for running divergent RT kernels on general compute units given their support for small wavefront sizes. 8 threads vs 32 on AMD/Nvidia. Yet Intel still decided to go with hardware traversal.
     
    DegustatoR and PSman1700 like this.
  11. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Geometry Shader is a lesson from history: it is essentially a complete waste of time and was when it was introduced.

    Also, triangle intersection test rate (which RDNA 2 seems to be bad at) is difficult to isolate as a bottleneck separate from BVH traversal incoherency. BVH traversal is bottlenecked by memory (just like texture filtering, in general), meaning that compressed BVH nodes and a rambo cache are vital.

    In my opinion NVidia has extremely good node-compression, based on the years-long research that NVidia has on the subject of BVH format.

    It's worth remembering that a workgroup of, say, 1024 work items intrinsically offers a speed-up for dynamic branching on current hardware: when an entire hardware thread goes dark, that subset of work items (e.g. 32 on RDNA 2) no longer occupies any execution slots. During traversal, those execution slots can run the closest-/any-/miss-shader for those work items instead, if the hardware is running an uber-ray-shader that combines traversal with closest-/any-/miss-shaders.

    Finally, if you can build a compute unit that can do conditional routing to mitigate the slow-down of incoherent branching, then hardware traversal is entirely pointless. RDNA 2 is not 10x or more slower even in the worst-case scenarios, so a 2-4x speed-up from such a solution is more than enough.

    Intel's variable-width hardware threads, if Arc has them, will be interesting in their own right as a solution to dynamic branching in shaders. In combination with hardware BVH traversal that will be interesting.

    Intel's hardware traversal may not be MIMD.
     
    Lightman likes this.
  12. Dictator

    Regular

    Joined:
    Feb 11, 2011
    Messages:
    681
    Likes Received:
    3,969
    From what I understand - it is!
     
    PSman1700 and BRiT like this.
  13. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    How so? Isn't it 1 tri/clk/RA?
     
  14. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    Apparently, Ampere is at least twice that. Is that right?
     
  15. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    603
    Likes Received:
    1,123
    nVidia claimed they have doubled triangle intersection test rate with Ampere.
     
    PSman1700 likes this.
  16. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    2x vs. Turing. Now we need to find "x". Even in the hotchips presentation from Burgess, I did not find something substantial. Only the "2x" and that (supposedly, but not explicitly) RTX 3080 vs. an unnamed Turing achieves "58 vs. 34 RT TFLOPS" (in Nvidia-style calculation). Additionally, they say, that Turing is 7x RT perf over Pascal, but that includes Ray/Box-intersection and Ray Traversal, which, allegedly, consumes "many thousands of instruction slots per ray" in a shader program.

    I could not find any definitive number, if Ray/Triangle intersection on Turing really is 1/RT core/clk. I'd be happy to see such a number. Or a benchmark I could run, that would not consume my whole free weekend. ;)

    Coreteks did analyze the perf drop of 3080 vs. 2080 Ti with RT based on data from HBU shortly after the . They come to the conclusion, that the drop only lessens by 6% between both cards on average in 4k and 8% in 1440p. Maybe that's an indication, or maybe Ray/Triangle intersection is not such an important performance metric? It should happen not too often for each ray (1 for solids, 2 with translucents?).

    And apart from all the numbers: If RDNA2 really is on par with Turing here, that's already "bad"? RX 6900XT would still be faster in this metrix than a 3070 Ti.
     
    #656 CarstenS, Aug 20, 2021
    Last edited: Aug 20, 2021
    Lightman and Jawed like this.
  17. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,397
    Gaming side of RT is limited by traditional shading and bandwidth more than the RT h/w of Ampere. HUB "analysis" is a joke.

    [​IMG]
    https://techgage.com/article/nvidia-turing-ampere-cuda-optix-rendering-performance/

    Here 3070 with it's 46 RT cores is 37% faster than Titan RTX with its 72 RT cores. Clocks are comparable.
     
    PSman1700, DavidGraham and xpea like this.
  18. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    It's true, that gaming workloads do obscure individual architectural traits.
    Are you sure, that this is not the doubled L1's with twice the transfer rate at work? I could dismiss that whole benchmark on missing scaling between 2080S and Titan RTX alone (the latter should be 46% faster, is only 5%) and that would have more merit than your blatant dismissal of HBUs data (it was data from HBU, analysis was from coreteks).

    So I take it, you don't have concrete data either?
     
    Lightman likes this.
  19. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    I think it's very much worthwhile to contrast "professional" ray-traced rendering with game ray-traced rendering.

    Optix is a black box, as far as I can tell. It's running on top of black box hardware.

    Black box inception makes it pretty hard to talk about the hardware.
     
  20. DegustatoR

    Veteran

    Joined:
    Mar 12, 2002
    Messages:
    3,240
    Likes Received:
    3,397
    ProViz benches are as concrete as it gets when comparing Turing with Ampere in RT.
     
    PSman1700 likes this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...