Next gen lighting technologies - voxelised, traced, and everything else *spawn*

Coherency gathering in ray tracing: the benefits of hardware ray tracking
11 February 2020 -
Rys Sommefeldt


https://www.imgtec.com/blog/coheren...racing-the-benefits-of-hardware-ray-tracking/

The write-up brings up the point about the breakup of access locality with RT versus traditional rasterization, the latter which I've noted is very good at matching the underlying structure of cache and DRAM subsystems. That and other long-established efficiency and acceleration methods make it difficult to replace a physically robust--if incomplete graphically, solution.

How this solution specifically handles the coherence gathering isn't stated, but some googling turned up some mention of subdividing the acceleration structure into spatial and bounding volume levels. The spatial portion may coincide with the stated grouping of rays by their direction into the structure. Tracking how often rays with a given direction hit which sub-structures when going into a certain part of space can allow for the most frequently hit structures to be ordered first. There is probably an inflection point where a decently coherent bunch of rays can cache a good subset of the structure on-die or possibly share some calculations if the hardware can somehow determine multiple rays are going to have very similar results.
A mobile architecture is going to feel the memory pinch sooner, so this may place enough of a premium to justify hardware for ray-tracking much like how rasterization has accumulated in-built acceleration over time. I'd be curious if this finds uptake, and how significant a difference it makes.

It does seem like some of the other industry trends remain, in that there's likely a BVH or similar at some level in the structure, and the execution model is compatible with a stack-based model, which just seems to be what the silicon and DRAM play best with thus far. There's some reference to how other solutions may work to improve coherence with compute or coincidental capture of locality, likely by caches backing hardware that traverses at a different granularity than the wide fragment hardware.
 
Hmmm... grouping rays by direction and origin to clusters of BVH is pretty much the standard idea of reordering that exists for a long time.
The fact NV did not implement it (yet) although they did a lot of related research, and the fact that licensing ImgTec lists the coherency engine as optional, led me to the conclusion it is not worth it yet.
That's contrary to what i thought for all those years before, assuming reordering would be the only way to practical realtime RT at all.

Beside those facts above there are some assumptions supporting this conclusion, asking for: What do we want to achieve with RT for now?

I'd say area lights and shadows have highest priority, because shadow maps and point lights suck? We have not seen a lot of this yet, but that's because RT is still niche. Technically it should work well, becasue rays already are coherent, and no hit point shading is necessary.
Next is sharp reflections, because SSR sucks even more. Again rays are already coherent - no need for reordering.
Then AO. Rays have the same origin only. Direction is not coherent. But they are short range, and no hitpoint shading. HW reordering still seems not worth it.
Finally glossy reflections and GI. Only here reordering seems a win. But iirc, the win was only about a factor of two in the papers i remember? It's also the field where RT appears inefficient in general.

So that's why i say scratch it for now.
The worst limitation actually is missing LOD, so RT does not scale. (Assuming discrete LOD is no longer an option with the image quality we aim for and the high cost RT has, which i'm not sure about.)
If i could choose i'd pick traversal shaders over HW reordering. I doubt we'll get both with upcoming GPUs.
 
What i really would like to see instead of a black boxed reordering solution, is a way to let us implement this ourselves.
But i don't say this because i'm the fixed function hostile guy ;)

No, what i mean is this: A key requirement for reordering, but also to many many other applications also in games, is a need to sort or bin large amounts of data.
The article above mentions this too. Having hardware sorting unit but not exposing it to other use seems a waste and missed opportunity.
Programmable reordering for RT may appear unpractical now, but if we are not in a hurry it may look different in the future. Traversal shaders would be already a first step.
 
Not sure what you mean with data management, but sounds an application.
I really mean just sorting a huge array. Some application coming into mind:
Binning lights to a grid of tiles.
Binning hash from ray origin and direction to group rays.
Linking fluid particles to uniform grid.
Building trees (which can be simple sorting)
Resolving indirections
any kind of clustering / grouping problems

It's not that GPUs perform badly here, but if HW could give speedup of 10 it would be surely worth it, considering it's such a simple problem with so many applications.
But i lack a detailed idea of how this should look exactly. For example 'just binning' can be done faster than 'full sorting', and algortihms differ completely.
 
Hmmm... grouping rays by direction and origin to clusters of BVH is pretty much the standard idea of reordering that exists for a long time.
The fact NV did not implement it (yet) although they did a lot of related research, and the fact that licensing ImgTec lists the coherency engine as optional, led me to the conclusion it is not worth it yet.
That's contrary to what i thought for all those years before, assuming reordering would be the only way to practical realtime RT at all.
IMG's ray tracing media go back a number of years, so questions like whether this latest article applies to a coherency engine today versus a block of that name in the 2014 Wizard GPU might illuminate why this is seems to apply to upcoming IP.
Unlike Nvidia, there's also the presence of hardware for other elements of the acceleration structure build, traversal, and intersection testing process.
Having that much additional processing on-chip may provide more metadata to the coherence hardware, or can provide caching, tighter latency, or better prefetching of said metadata.
Since this is a TBDR versus a somewhat intermediate-sorted architecture, perhaps the initial decision to put the work up-front simplifies the thinking further down the pipeline, since there's hardware that is more aware of the residency and lifetime of such data rather than making a best-guess mid-stream with off-chip buffers, separate shaders, and less-tight synchronization.

It's also a mobile architecture where die area, memory bus, and power consumption can be more costly to expend in the pursuit of this method. After a review, I also was reminded that IMG's A-Series is much more coarse than other IP, at 128-thread granularity. The wider hardware and simplified ISA would experience a broader impact from kernels that experience partial utilization on 32-wide or narrower hardware.
 
IMG's ray tracing media go back a number of years, so questions like whether this latest article applies to a coherency engine today versus a block of that name in the 2014 Wizard GPU might illuminate why this is seems to apply to upcoming IP.
Do you think ImgTec continues to offer their RT blocks for other GPU makers? The question bothering me is how difficult this is to integrate. (Same question as when considering non AMD RT in PS5)
Or are their licensing offers only about a whole ImgTec GPU for SOC makers?

IMG's A-Series
Which has 2 TF at most.
Can we assume practical compute performance so could be similar to a 2TF desktop GPU or PS4?
I ask in context of comparing any brand mobile GPU vs. desktop GPU in general, so if anyone can share some experience or educated guess i'd appreciate.
It's hard to get any impression about mobile compute perf from sparse given specs and gfx. benchmarks. For example i do not know if mobile GPUs have reserved on chip LDS memory at all.
 
Do you think ImgTec continues to offer their RT blocks for other GPU makers? The question bothering me is how difficult this is to integrate. (Same question as when considering non AMD RT in PS5)
Or are their licensing offers only about a whole ImgTec GPU for SOC makers?


Which has 2 TF at most.
Can we assume practical compute performance so could be similar to a 2TF desktop GPU or PS4?
I ask in context of comparing any brand mobile GPU vs. desktop GPU in general, so if anyone can share some experience or educated guess i'd appreciate.
It's hard to get any impression about mobile compute perf from sparse given specs and gfx. benchmarks. For example i do not know if mobile GPUs have reserved on chip LDS memory at all.
even if they did offer blocks, it would still need to be heavily customized for this to work. As you noted, the previous GPU their RT worked on was for a mobile GPU. They would need to scale this up significantly while working with AMD tech.
 
Do you think ImgTec continues to offer their RT blocks for other GPU makers? The question bothering me is how difficult this is to integrate. (Same question as when considering non AMD RT in PS5)
Or are their licensing offers only about a whole ImgTec GPU for SOC makers?
I've only seen references to licensing the GPU to SOC makers, although the pool of those that don't make their own is smaller these days.
There was the recent announcement of some kind of IP agreement with Apple, after a period where Apple claimed it had weaned itself off of IMG.


Which has 2 TF at most.
Can we assume practical compute performance so could be similar to a 2TF desktop GPU or PS4?
I haven't found a good comparison point. The few benchmarks I've found are old and were implementations far below that range.

It's hard to get any impression about mobile compute perf from sparse given specs and gfx. benchmarks. For example i do not know if mobile GPUs have reserved on chip LDS memory at all.
PowerVR's Rogue architecture has a Common Store per shading cluster that holds workgroup shared memory. Details aren't as well-documented for the A-Series (or the B-Series that was listed under 2020 in the A-Series announcement roadmap).
 
So maybe i was more on point with initial assumption RTX does reordering. Though, not sure if one can draw such conclusions from given tests at all, but... i know nothing :)
Interesting also they speculate 'closely connected to TMU'.

Edit: Likely it's a compromise. In their diagrams it seems rays are reordered per RT core, so not globally. Makes sense - no need for huge binning workloads and sync across the whole chip. Grouping ray generation spatially (pixel quads) still expected to have big positive impact on performance.

Still hoping to see mentioned GPU work generation capabilities (also shown in task shaders) to become exposed for general compute soon... :)
 
Last edited:
(Among others) they come to the conclusion that RTX does use some sorting
Not backed by the measured results though, the conclusions are just wishful thinking. The boost they measured is indistinguishable from coincidental cache coherency, in an implementation which is balanced so well it has (for common use cases) only a factor of 4-10x between being ALU bound (best case) and being memory throughput bound (worst case).
No methodological measurements, which would had required specifically crafted synthetic scenes to deliberately trigger certain divergence patterns which could have provided interpret-able results.

No indication of "GPU work generation" either. In contrast, NVidia warns not to increase the recursion level too much, urging you to explicitly write it in iterative form. Indicating that this is actually just pure recursion, and when overdoing it, you get to pay the price of spilling stack to main memory. The API doesn't even allow you to formulate dynamic work generation, as you are reading back the results of each dispatch synchronously.

Neither is work-generation necessary at all. Dispatching another ray from closest-hit-shader is merely re-entering the ray traversal state machine with a new ray, while other rays of the same warp may still be in flight. (The individual threads stack depth diverges at that point, with both closest-hit and ray-generation shader on stack for the first thread to enter secondary ray, but doesn't matter for SIMD execution of the warp.)

As for "reordering", no indicator for that either. The traversal unit may just be marching over all threads (rays) of a warp in SIMD as far as we know. So far there is no indicator the scheduling is any finer than the usual "per-warp" tracking of in-flight memory accesses, stalling each a full warp until the memory dependencies for all threads are satisfied.

What isn't clear so far, is how any-hit shader invocations are batched.
Only when 2 threads of a warp happen to coincidentally get their hit in the same cycle (and otherwise instant shader invocation, effectively scalar)?
Or deferred until either "all threads have arrived in any-hit state" or "some threads have reached common state, but another thread wants to enter incompatible state"?
I expect it's some form of deferred invocation.
 
Last edited:
As for "reordering", no indicator for that either. The traversal unit may just be marching over all threads (rays) of a warp in SIMD as far as we know. So far there is no indicator the scheduling is any finer than the usual "per-warp" tracking of in-flight memory accesses, stalling each a full warp until the memory dependencies for all threads are satisfied.

As i understood it, they saw speed up when making paths like: generation->hit->hit shaders, but slower when doing just a loop inside a single generation shader.
Conclusion could be that the loop forcing the rays to return to the caller disables reordering, while changing shaders could allow to reorder rays freely.
Real or not, this allows to shuffle at least the rays in flight within a single SM, increasing both traversal and hit shader selection coherence.

No idea if this is practical and worth it. But if so, the advantage would vanish if we look towards the traversal shaders for stochastic LOD, which would be a problem.
At this point i would propose a hardware solution for stochastic LOD. Probably it would not be that hard to agree upon a common specification for this.
 
As i understood it, they saw speed up when making paths like: generation->hit->hit shaders, but slower when doing just a loop inside a single generation shader.
And they saw that "speedup" even for primary rays only, too. So no loop involved, indicating something bogus about their test setup. Also some really weird speedups when going from secondary to tertiary rays, achieving once again the performance level of fully coherent primary rays?!
The numbers for the Vulkan implementation are not plausible at all. And I seriously doubt they used comparable flags for AS build and BLAS instantiation for both DX12 and Vulkan implementations.
 
Also, comparing different implementations in different APIs seems not that wise if the goal is reverse engineering HW, i have to add :)
 
I'd say area lights and shadows have highest priority

I don't see how ray tracing helps much with accurate area lights for major light sources (the small stuff and GI secondary lighting where fudging visibility is less relevant are orthogonal). You can use ray tracing to do the exact same thing as irregular Z-buffer hard shadows, but will likely also just use the same image space fudging to soften the shadows ... because ray tracing does little to make soft shadows more efficient.

If the hardware could trace a cone or a frustum through the BVH so you could efficiently implement bitmasked soft shadows it would be a different matter.
 
because ray tracing does little to make soft shadows more efficient.
Huh? Why do you think so?

The only extra cost for RT area light is rays becoming a bit less coherent.
The only extra cost for many lights is nothing. Just pick one random light stochastically per sample.
Ofc. it depends on how well denoising works, but we saw it works pretty well. Just trade between IQ and perf.

Now comparing with SM alternatives, we need to pick one of many techniques, each having its own limitations and costs. Likely we end up implementing multiple.
Many lights is impossible because each light adds high constant cost.
It does not scale and it does not work well either, so extra work also for artists.
 
Back
Top