GPU Ray Tracing Performance Comparisons [2021-2022]

After some digging into Nvidia's TTU(RT Core) patents published in 2018, I've found that they actually do ray grouping in the hardware.

https://patents.justia.com/patent/11157414
https://www.freepatentsonline.com/11157414.pdf

Sorry for the long text!

Great find. I've seen this patent before but never bothered to read it. Had no idea it was related to ray sorting.

I wonder how effective it is in practice given the temporal component. It does seem more elegant (and simpler) than trying to sort rays based on origin and direction.
 
These tiers seem to be artificial at best without PPA and absolute performance metrics in games traces.

Unevitably we'll have to wait until the first device with their IP is available for independent measurements; but my gut feeling tells me that since their primary target is actually low power/mobile they haven't much to worry about any PPA metrics. It'll be interesting to see how QCOM's future (Adreno) solutions will compare to those.
 
After some digging into Nvidia's TTU(RT Core) patents published in 2018, I've found that they actually do ray grouping in the hardware.

https://patents.justia.com/patent/11157414
https://www.freepatentsonline.com/11157414.pdf

Here is a relatively recent paper on the performance of various ray sorting techniques on a 2080 Ti. The ray sorting was done in a shader with actual ray traversal done in hardware. The "software" sorting resulted in up to 2x speedups of the traversal step which would imply that Turing isn't doing ray sorting in hardware or it's not very good at it.

We evaluated the discussed ray reordering techniques in the context of wavefront path tracing using hardware accelerated RTX trace kernel accessed through DirectX 12 and OptiX 7. The path tracer uses next event estimation with two shadow rays per bounce with eight samples per pixel. We use seven scenes of various complexity with a single area light source with the size of 5% of the largest scene extent. All measurements were performed on the RTX 2080 Ti GPU with the image resolution of 1920 × 1080. We measured the discussed ray ordering strategies in terms of trace performance, hardware utilization, and ray coherence measures.

https://meistdan.github.io/publications/raysorting/paper.pdf
 
Here is a relatively recent paper on the performance of various ray sorting techniques on a 2080 Ti. The ray sorting was done in a shader with actual ray traversal done in hardware. The "software" sorting resulted in up to 2x speedups of the traversal step which would imply that Turing isn't doing ray sorting in hardware or it's not very good at it.

https://meistdan.github.io/publications/raysorting/paper.pdf

I think ray grouping in TTU L0 cache is performed against only small group of rays allocated in SM so is less efficient than whole screen ray sorting/grouping.
 
is less efficient than whole screen ray sorting/grouping
I think the HW/SW sorting is usually done at tile granularity since that's another tradeoff where you need to balance between keeping stuff in caches and spending optimal time for sorting which could have been otherwise spent on other computations.
 
The "software" sorting resulted in up to 2x speedups of the traversal step
you need to balance between keeping stuff in caches and spending optimal time for sorting which could have been otherwise spent on other computations.


Software based global sorting: 3.66ms reordering overhead and 1.85x trace speedups.
Local(block/tile) sorting only provide small benefits because it's mostly done in the TTU regardless pre-sorted or not.

We really need fixed-function global sorting unit and, if Imagination done this job with reasonable die area, It's a good step forward in RTRT.
 
Software based global sorting: 3.66ms reordering overhead and 1.85x trace speedups.
Wonder what numbers would have been like if they had tested 3080 with 2x flops)
Still, these tests don't look complete to me because tested shadows were relatively light on tracing time, there are games where tracing takes 10 ms and more in 4K just for reflections - https://forum.beyond3d.com/threads/...arisons-2021-spawn.62346/page-36#post-2220214

We really need fixed-function global sorting unit and, if Imagination done this job with reasonable die area, It's a good step forward in RTRT.
This sorting unit had better be universal and not coupled to any particular graphics stage, there are other places where sorting can benefit performance by a lot, here is NVIDIA recommendation for material sorting in UE4 for example:
NVIDIA said:
Material Sorting
Since ray traced reflections can be expensive, enabling material sorting may improve the cost of shading the reflected surfaces. Material sorting enables shading to be more efficient by sorting reflection shading work by material coherence. This comes with some cost overhead, so material shading is only a win when shading coherence is a limiter. Gains of 50% are not uncommon when enabling this feature.

r.RayTacing.Reflections.SortMaterials [0|1]

The Soul City sample is a good example of a case where the gains from material sorting are dramatic, improving performance by 3x in places.
 
We really need fixed-function global sorting unit and, if Imagination done this job with reasonable die area, It's a good step forward in RTRT.

Global sorting seems like a non-starter given the amount of data you will need to move around the chip. You would need to run extremely expensive traces for it to be worthwhile and at that point performance will be in the gutter anyway.
 
Still, these tests don't look complete to me because tested shadows were relatively light on tracing time, there are games where tracing takes 10 ms and more in 4K just for reflections

Remember that the tests in the paper is performed in the isolated environment for eliminate various factor causing inaccuracy.
And the typical offline path tracing renderer accumulates samples over time so it looking unrealistically computational-heavy on the final result.
 
This would be the theoretical background of ray grouping implementation in the TTU.
https://my.eng.utah.edu/~cs6958/papers/HWRT-seminar/a160-nah.pdf
6 Ray Accumulation Unit for Latency Hiding

In our approach, the rays that reference the same cache line are accumulated in the same row in an RA buffer. When the shape data requested by the accumulated rays is fetched from the L2 cache, these rays with the shape data are transferred from the RA buffer to the operation pipeline.


Imagination blogged about coherency gathering nearly 2 years ago, which is used by PowerVR's ray tracing implementation.
https://www.imaginationtech.com/blo...racing-the-benefits-of-hardware-ray-tracking/
Coherency gathering

What PowerVR’s implementation of ray tracing hardware acceleration does, which is unique compared to any other hardware ray tracing acceleration on the market today, is hardware ray tracking and sorting, which, transparently to the software, makes sure that parallel dispatches of rays do have similar underlying properties when executed by the hardware. We call that coherency gathering. Other ray tracing solutions in the industry do this crucial step in software, which inevitably will be slower and more inefficient.

The hardware maintains a database of rays in flight that the software has launched and is able to select and group them by where they’re heading off to in the acceleration structure, based on their direction. This means that when they’re processed, they’re more likely to share the acceleration structure data being accessed in memory, with the added bonus of being able to maximise the amount of parallel ray-geometry intersections being performed by the GPU as testing occurs afterwards.
 
Global sorting seems like a non-starter given the amount of data you will need to move around the chip. You would need to run extremely expensive traces for it to be worthwhile and at that point performance will be in the gutter anyway.

It seems the PowerVR's RTU is separate from compute clusters unlike Nvidia's RT core. Now it makes sense how they implemented global ray sorting without negative impact.
https://ubm-twvideo01.s3.amazonaws.com/o1/vault/gdceurope2015/Davis_Joe_PowerVRGraphicsLatest.pdf
 
It seems the PowerVR's RTU is separate from compute clusters unlike Nvidia's RT core.
It doesn't really matter where the RT cores are located, what matters is whether RT operations are decoupled from SIMDs or not.
Both PowerVR's RTU and Nvidia's RT cores act as offload accelerators which do all tree-traversal and intersection ops on their own and return to SM only when certain shader-controlled decisions have to be made, such as whether to trace further in the any hit shader for example.
The difference between the two comes down to what SMs or Tile infrastructure can be reused and to latencies.

Now it makes sense how they implemented global ray sorting without negative impact.
It doesn't. In order to do the global sort efficiently without travelling back and forth between caches and dram, you have to store state for all screen's rays/pixels in onchip sram, which you obviously can't because if it was the case there would have not been tiled processing in the first place :)
 
Both PowerVR's RTU and Nvidia's RT cores act as offload accelerators which do all tree-traversal and intersection ops on their own and return to SM only when certain shader-controlled decisions have to be made, such as whether to trace further in the any hit shader for example.
Nvidia also returns to SM in some cases of tree traversal, like multi layer instances etc.

Would love to see synthetic benchmarks for such cases.
 
It seems the PowerVR's RTU is separate from compute clusters unlike Nvidia's RT core. Now it makes sense how they implemented global ray sorting without negative impact.
https://ubm-twvideo01.s3.amazonaws.com/o1/vault/gdceurope2015/Davis_Joe_PowerVRGraphicsLatest.pdf

The GR6500 (Wizard) is legacy IP by now and you're linking to presentation from 2015 which has absolutely nothing in common with their newly announced Photon architecture. You can find correct details on how and where RT has been integrated into their C Series GPU (called ALBIORIX which is 2 entire GPU generations later than Rogue; Rogue--> Furian-->Albiorix + 2 refresh generations ) amongst others here: https://www.imaginationtech.com/graphics-processors/architecture/powervr-photon-architecture/

A good whitepaper for it is the PowerVR Photon whitepaper and one of the interesting tidbits would be:

PowerVR TBDR and Coherency Sorting
PowerVR pioneered tile-based deferred rendering (TBDR) as far back as 1996. The focus of TBDR
is efficiency, both in processing as well as bandwidth. Tile-based rendering does this by sorting all
the triangle geometry into screen-space tiled regions first before rendering. This is different from
immediate mode rendering (IMR) where every triangle is transformed and immediately drawn. The
benefit of sorting all geometry and then rendering per screen-space tile region (usually 16x16 or
32x32 pixels in size), is that we can complete the rendering of the tile region solely using on-chip
memory for the depth/stencil buffer as well as the colour buffer. IMRs push all this bandwidth
off-chip and depend on cache hits to reduce it, but as geometry submissions are not spatially
coherent in screen space this caching approach typically fails, leading to high bandwidth, latency
sensitivity and poor power efficiency.
Therefore, by sorting geometry first the cache hit rate effectively becomes 100%. Additionally,
depth and stencil buffers are often only used once and hence can be discarded. With GBuffer and
MRT rendering many of the MRT “colour” targets are only used for intermediate scratchpad data
and only one colour buffer is required to be written out to memory. With TBDR, all of this can be
done on chip, saving memory footprint and very significant amounts of bandwidth.
TBDR also offer significant benefits in handling anti-aliasing. As the oversampled buffers only
ever exist in on-chip memory, only the downsampled colour targets are written out, yet again
saving memory footprint and bandwidth.
The PowerVR Photon ray tracing architecture is in many ways identical to the PowerVR TBDR
architecture in that a spatial sort is also done, only rather than in 2D screen space we bin rays
into packets which travel along similar paths through the BVH. The benefits here are similar to
what we find with coherency sorting; namely significant cache efficiency and reduced bandwidth,
while processing remains in a SIMD/SIMT nature, ensuring high power efficiency of the logic and
overall processing.

Else IMHO their implementation inherits similar advantages and disadvantages of their usual PowerVR architecture.
 
Last edited:
The GR6500 (Wizard) is legacy IP by now and you're linking to presentation from 2015 which has absolutely nothing in common with their newly announced Photon architecture. You can find correct details on how and where RT has been integrated into their C Series GPU (called ALBIORIX which is 2 entire GPU generations later than Rogue; Rogue--> Furian-->Albiorix + 2 refresh generations ) amongst others here: https://www.imaginationtech.com/graphics-processors/architecture/powervr-photon-architecture/

My bad that I didn't checked the recent whitepaper carefully.
It seems they have chose to spread the RACs out all around(just like RT cores) and the coherency gathering occurs in each RAC independently.
Therefore it's not a global sorting and sounds identical to the ray operation scheduling which is Nvidia already using:
Photon:
rays are grouped into processing packets that will achieve high efficiency, not only in processing but also in memory access. This sorting gives us another benefit: rather than a MIMD architecture we return instead to the high-efficiency processing approach common inside the GPU: many units which all do the same thing.
As a result, we can exploit parallelism as we do not just check one ray against one box, we can check many rays against the same box. This brings significant efficiency gains and reduces stress on the cache and memory subsystems. The same is true for triangle intersections: we can check a ray against multiple triangles concurrently.

RTX:
The cache 750 thus imposes a time-coherency on the TTU 700's execution of any particular collection of currently-activated rays that happen to be currently waiting for the same data by essentially forcing the TTU to execute on all of those rays at about the same time. Because all of the rays in the group execute at about the same time and each take about the same time to execute, the cache 750 effectively bunches the rays into executing time-coherently by serving them at about the same time. These bunched rays go on to repetitively perform each iteration in a recursive traversal of the acceleration data structure in a time-coherent manner so long as the rays continue to request the same data for each iteration.

I hope this will be my final opinion on the PowerVR's coherency gathering. :-|
 
As much a I love PowerVR, we don't know how it translates in real life for the pc market. For all we know, the solution is smart on paper but doesn't work at all on a real product....
 
As much a I love PowerVR, we don't know how it translates in real life for the pc market. For all we know, the solution is smart on paper but doesn't work at all on a real product....

It's OT anyway but it was only a couple of days ago when the first datacenter GPU appeared from a chinese vendor based on their B-series former generation GPU:

https://www.tomshardware.com/news/chinese-xindong-fenghua-gpu-announced

IF any manufacturer should manufacture a chip based on the C series GPU IP I guess it'll take at least 1.5 year from now until it appears on shelves and that probably in China only.
 
As much a I love PowerVR, we don't know how it translates in real life for the pc market. For all we know, the solution is smart on paper but doesn't work at all on a real product....

I always wondered how well TBDR translates into this new era of compute. If you have a compute shader writing to a UAV the hardware isn’t able to automatically “tile” the compute threads and UAV memory accesses like it can with a pixel shader and render target.
 
I always wondered how well TBDR translates into this new era of compute. If you have a compute shader writing to a UAV the hardware isn’t able to automatically “tile” the compute threads and UAV memory accesses like it can with a pixel shader and render target.

Metal API introduced the concept of tile shading for the specific purpose of compute shaders being able to directly access tile memory in a render pass. The problem with exploiting tile-based architectures for compute is the limited amount of data set that you can hold in a tile. It can be very awkward to do post-processing filters like bloom or DoF in this case if you need to access data that lie outside of the tile boundary. On the face of it, bindless and ray tracing don't seem like a good fit to meet the ideal usage patterns for tile memory ...
 
Back
Top