GPU Ray Tracing Performance Comparisons [2021] *spawn*

Discussion in 'Architecture and Products' started by DavidGraham, Mar 29, 2021.

  1. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    Great find. I've seen this patent before but never bothered to read it. Had no idea it was related to ray sorting.

    I wonder how effective it is in practice given the temporal component. It does seem more elegant (and simpler) than trying to sort rays based on origin and direction.
     
    PSman1700 likes this.
  2. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    Unevitably we'll have to wait until the first device with their IP is available for independent measurements; but my gut feeling tells me that since their primary target is actually low power/mobile they haven't much to worry about any PPA metrics. It'll be interesting to see how QCOM's future (Adreno) solutions will compare to those.
     
  3. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    Here is a relatively recent paper on the performance of various ray sorting techniques on a 2080 Ti. The ray sorting was done in a shader with actual ray traversal done in hardware. The "software" sorting resulted in up to 2x speedups of the traversal step which would imply that Turing isn't doing ray sorting in hardware or it's not very good at it.

    https://meistdan.github.io/publications/raysorting/paper.pdf
     
  4. TopSpoiler

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    74
    Likes Received:
    176
    I think ray grouping in TTU L0 cache is performed against only small group of rays allocated in SM so is less efficient than whole screen ray sorting/grouping.
     
    milk and PSman1700 like this.
  5. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    797
    Likes Received:
    1,622
    I think the HW/SW sorting is usually done at tile granularity since that's another tradeoff where you need to balance between keeping stuff in caches and spending optimal time for sorting which could have been otherwise spent on other computations.
     
    PSman1700 likes this.
  6. TopSpoiler

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    74
    Likes Received:
    176


    Software based global sorting: 3.66ms reordering overhead and 1.85x trace speedups.
    Local(block/tile) sorting only provide small benefits because it's mostly done in the TTU regardless pre-sorted or not.

    We really need fixed-function global sorting unit and, if Imagination done this job with reasonable die area, It's a good step forward in RTRT.
     
    iroboto, T2098, pharma and 2 others like this.
  7. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    797
    Likes Received:
    1,622
    Wonder what numbers would have been like if they had tested 3080 with 2x flops)
    Still, these tests don't look complete to me because tested shadows were relatively light on tracing time, there are games where tracing takes 10 ms and more in 4K just for reflections - https://forum.beyond3d.com/threads/...arisons-2021-spawn.62346/page-36#post-2220214

    This sorting unit had better be universal and not coupled to any particular graphics stage, there are other places where sorting can benefit performance by a lot, here is NVIDIA recommendation for material sorting in UE4 for example:
     
    T2098, PSman1700 and TopSpoiler like this.
  8. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    Global sorting seems like a non-starter given the amount of data you will need to move around the chip. You would need to run extremely expensive traces for it to be worthwhile and at that point performance will be in the gutter anyway.
     
    DavidGraham, OlegSH and TopSpoiler like this.
  9. TopSpoiler

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    74
    Likes Received:
    176
    Remember that the tests in the paper is performed in the isolated environment for eliminate various factor causing inaccuracy.
    And the typical offline path tracing renderer accumulates samples over time so it looking unrealistically computational-heavy on the final result.
     
  10. TopSpoiler

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    74
    Likes Received:
    176
    This would be the theoretical background of ray grouping implementation in the TTU.
    https://my.eng.utah.edu/~cs6958/papers/HWRT-seminar/a160-nah.pdf

    Imagination blogged about coherency gathering nearly 2 years ago, which is used by PowerVR's ray tracing implementation.
    https://www.imaginationtech.com/blo...racing-the-benefits-of-hardware-ray-tracking/
     
    Lightman, Krteq and PSman1700 like this.
  11. TopSpoiler

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    74
    Likes Received:
    176
    It seems the PowerVR's RTU is separate from compute clusters unlike Nvidia's RT core. Now it makes sense how they implemented global ray sorting without negative impact.
    https://ubm-twvideo01.s3.amazonaws.com/o1/vault/gdceurope2015/Davis_Joe_PowerVRGraphicsLatest.pdf
     
    Lightman, Man from Atlantis and Krteq like this.
  12. OlegSH

    Regular

    Joined:
    Jan 10, 2010
    Messages:
    797
    Likes Received:
    1,622
    It doesn't really matter where the RT cores are located, what matters is whether RT operations are decoupled from SIMDs or not.
    Both PowerVR's RTU and Nvidia's RT cores act as offload accelerators which do all tree-traversal and intersection ops on their own and return to SM only when certain shader-controlled decisions have to be made, such as whether to trace further in the any hit shader for example.
    The difference between the two comes down to what SMs or Tile infrastructure can be reused and to latencies.

    It doesn't. In order to do the global sort efficiently without travelling back and forth between caches and dram, you have to store state for all screen's rays/pixels in onchip sram, which you obviously can't because if it was the case there would have not been tiled processing in the first place :)
     
  13. jlippo

    Veteran

    Joined:
    Oct 7, 2004
    Messages:
    1,744
    Likes Received:
    1,090
    Location:
    Finland
    Nvidia also returns to SM in some cases of tree traversal, like multi layer instances etc.

    Would love to see synthetic benchmarks for such cases.
     
    Jawed, Lightman, DavidGraham and 4 others like this.
  14. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    The GR6500 (Wizard) is legacy IP by now and you're linking to presentation from 2015 which has absolutely nothing in common with their newly announced Photon architecture. You can find correct details on how and where RT has been integrated into their C Series GPU (called ALBIORIX which is 2 entire GPU generations later than Rogue; Rogue--> Furian-->Albiorix + 2 refresh generations ) amongst others here: https://www.imaginationtech.com/graphics-processors/architecture/powervr-photon-architecture/

    A good whitepaper for it is the PowerVR Photon whitepaper and one of the interesting tidbits would be:

    Else IMHO their implementation inherits similar advantages and disadvantages of their usual PowerVR architecture.
     
    #1334 Ailuros, Nov 28, 2021
    Last edited: Nov 28, 2021
  15. TopSpoiler

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    74
    Likes Received:
    176
    My bad that I didn't checked the recent whitepaper carefully.
    It seems they have chose to spread the RACs out all around(just like RT cores) and the coherency gathering occurs in each RAC independently.
    Therefore it's not a global sorting and sounds identical to the ray operation scheduling which is Nvidia already using:
    I hope this will be my final opinion on the PowerVR's coherency gathering. :-|
     
    pharma, PSman1700 and DavidGraham like this.
  16. Rootax

    Veteran

    Joined:
    Jan 2, 2006
    Messages:
    2,400
    Likes Received:
    1,845
    Location:
    France
    As much a I love PowerVR, we don't know how it translates in real life for the pc market. For all we know, the solution is smart on paper but doesn't work at all on a real product....
     
  17. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,511
    Likes Received:
    224
    Location:
    Chania
    It's OT anyway but it was only a couple of days ago when the first datacenter GPU appeared from a chinese vendor based on their B-series former generation GPU:

    https://www.tomshardware.com/news/chinese-xindong-fenghua-gpu-announced

    IF any manufacturer should manufacture a chip based on the C series GPU IP I guess it'll take at least 1.5 year from now until it appears on shelves and that probably in China only.
     
    Rootax likes this.
  18. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    12,055
    Likes Received:
    3,109
    Location:
    New York
    I always wondered how well TBDR translates into this new era of compute. If you have a compute shader writing to a UAV the hardware isn’t able to automatically “tile” the compute threads and UAV memory accesses like it can with a pixel shader and render target.
     
  19. Lurkmass

    Regular

    Joined:
    Mar 3, 2020
    Messages:
    565
    Likes Received:
    711
    Metal API introduced the concept of tile shading for the specific purpose of compute shaders being able to directly access tile memory in a render pass. The problem with exploiting tile-based architectures for compute is the limited amount of data set that you can hold in a tile. It can be very awkward to do post-processing filters like bloom or DoF in this case if you need to access data that lie outside of the tile boundary. On the face of it, bindless and ray tracing don't seem like a good fit to meet the ideal usage patterns for tile memory ...
     
    TopSpoiler likes this.
  20. TopSpoiler

    Newcomer

    Joined:
    Aug 18, 2020
    Messages:
    74
    Likes Received:
    176
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...