Game development presentations - a useful reference

Those result clips are very convincing and the scope to develop the ideas here seem broad. Isn't this completely novel?

I've been reading this and it sounds like an amazingly smart optimization. I didn't understand what was happening at first. I'm sure you understand it already, but for anyone that hasn't read it so far: Basically they took the concept of an occupancy map (OM), which was an already well understood voxel representation, and then I guess pre-calculate a distribution of candidate ray directions to generate a number of rotated occupancy maps (rotating the original OM) so all rays share the same direction, which is parallel the z-axis. Those occupancy maps are a scene representation called a ray-aligned occupancy map array (ROMA). As far as I'm read, I guess that makes it super-easy to sort rays to be cache friendly etc and have very efficient gpu occupancy. The ROMA can be generated very fast, faster than distance field representations, and tracing rays is O(1).

It has some limitations, similar to other voxel techniques: pure specular reflections and hard shadows are not supported, soft shadows and glossy reflections miss out on an optimization as they require precise visibility.

It's also not a GI solution. Interesting that they say constructing the ROMA is slower than BVH building, because of hardware acceleration, but tracing rays against the ROMA is faster. They suggest hardware acceleration to build the ROMA could allow it's construction to be similarly fast to a BVH. Also not suitable for large scenes, so they suggest partitioning a scene into multiple ROMAs with lower resolutions distant from the camera.

Interesting stuff.
 
Last edited:
Isn't this completely novel?
See given references, but i think it's a new variant of an old idea.
Which basically is to presort the scene along a given direction, and eventually projecting and binning it on a regular grid of the perpendicular plane.
If we have a max bound on depth complexity, we can then claim a O(1) TraceRay function along this direction. Instead traversal we can use binary or linear search with linear memory access.
That's awesome, but to support multiple directions, we now need to build one such acceleration structure for each. And if we have many unique directions, classical RT quickly becomes faster.

One older variant of this idea i remember is this realtime GI experiment:
The blog post seems gone, but iirc, he first rendered the scene with depth peeling from multiple directions with ortho projection.
The peeled depth layers give us all surface fragments along the direction in order. So with a max depth of say 32, we get a O(1) TraceRay function just like with the new paper.
We can also easily calculate light transport from one fragment to the next, which is what i think he has done.
Same idea, different implementation, similar data structures.

The problem in general is scaling. It's fast if we're happy with a 32^3 voxel box representing the whole scene, but if we need much higher resolution, the brute force generation of acceleration becomes the bottleneck quickly. Compressed offline data for static scene might make sense.
We can see in the paper that DXR generating BVH is already faster than voxelizing and rotating / projecting the result multiple times.

The specific problem of the paper is probably the lack of surface representation, which is why they have used reflective shadow maps to show GI (surface material in the SM, but thus limited to just the first bounce).
Due to that limitation, we can not use it to have a simple HWRT fallback on non RT HW. Because we do not know the surface material of ray hitpoints, we could not easily implement the same RT lighting model on the fallback branch as well.
Adding surface data means one bit no longer is enough.

So to me it seems most interesting to get some proper long range AO. But in case we need that, we're already fucked anyway. :D
But still interesting. : )
 
See given references, but i think it's a new variant of an old idea.
Which basically is to presort the scene along a given direction, and eventually projecting and binning it on a regular grid of the perpendicular plane.
If we have a max bound on depth complexity, we can then claim a O(1) TraceRay function along this direction. Instead traversal we can use binary or linear search with linear memory access.
That's awesome, but to support multiple directions, we now need to build one such acceleration structure for each. And if we have many unique directions, classical RT quickly becomes faster.

One older variant of this idea i remember is this realtime GI experiment:
The blog post seems gone, but iirc, he first rendered the scene with depth peeling from multiple directions with ortho projection.
The peeled depth layers give us all surface fragments along the direction in order. So with a max depth of say 32, we get a O(1) TraceRay function just like with the new paper.
We can also easily calculate light transport from one fragment to the next, which is what i think he has done.
Same idea, different implementation, similar data structures.

The problem in general is scaling. It's fast if we're happy with a 32^3 voxel box representing the whole scene, but if we need much higher resolution, the brute force generation of acceleration becomes the bottleneck quickly. Compressed offline data for static scene might make sense.
We can see in the paper that DXR generating BVH is already faster than voxelizing and rotating / projecting the result multiple times.

The specific problem of the paper is probably the lack of surface representation, which is why they have used reflective shadow maps to show GI (surface material in the SM, but thus limited to just the first bounce).
Due to that limitation, we can not use it to have a simple HWRT fallback on non RT HW. Because we do not know the surface material of ray hitpoints, we could not easily implement the same RT lighting model on the fallback branch as well.
Adding surface data means one bit no longer is enough.

So to me it seems most interesting to get some proper long range AO. But in case we need that, we're already fucked anyway. :D
But still interesting. : )
Not looked at the doc but does Intel arc already do ray reordering etc?
Would this still benefit GPUs that has that hardware?
Not saying not useful in general. Just question.
 

I really like this talk. He spends a lot of time talking about "simple" things that he doesn't think are easily available. Really just trying to cover their tech and also improve general availability of some common things.

The GI solution here seems interesting.
 
Not looked at the doc but does Intel arc already do ray reordering etc?
They group hit points of the same materials, so a thread group likely processes the same material shaders, afaik.
But they do no ray reordering during traversal, so a group of rays would traverse the same group of geometry, afaik.
The same likely applies to NV.
Notice material sorting does not make the TraceRay call faster, but only processing its results.
Current APIs are also not designed to support full reordering yet. If so, we would likely dispatch a large buffer of rays, GPU does traversal and reordering in the background, we get a large buffer of hit points back to process them in a later shader.
While traversal reordering is the most promising optimization we could do to make tracing faster in theory, a fully global reordering step happening multiple times during traversal would cause a lot of data movement across the chip.
It's not clear if it can be worth it or not. Iirc, (now dated) NV research on GPU simulation achieved only a 2x speedup in practice from reordering, and only if rays are very incoherent (GI or long range AO, but no benefit for shadows or reflections).
At the moment i expect bigger speedups from software research (Restir, Neural radiance cache, etc.) and don't expect global traversal reordering anytime soon.
But i would not wonder if NV already does / plans to add some local reordering, e.g. at the granluarity of one (or a cluster of) SMs.

Would this still benefit GPUs that has that hardware?
Maybe if the HW has RT support, but the chip is just too small to make it practically useful. Steam Deck for example.
Or if combining both HW and SW RT gives a net win ofc.

But it's really comparing apples with oranges. Both is RT, but the papers accuracy is way too limited to see it as an alternative for classical RT.
 
Yes. Ofc. the observations he made work in world space too, not just in screenspace. ; )

I'd like to see a more extensive talk on just the GI, comparing drawbacks vs other probe-based GI implementations. Like, does it still have significant light leaking, or anything like that?
 
It'll have light leaking to the length of the shortest ray. Or thicker shadowing to the same, I guess, depending on how you bias your shorter rays. It'll also miss small details -all his drawn blobs are chunky and not single pixels. Single pixels would be hit-and-miss whether they were detected.

Edit: That's not really expressed right. The rays don't have a length, but resolution is restricted to the spacing and sampling frequency of the lowest cascade.
Edit 2 : And that's also an issue with conventional RT where fewer rays will miss small geometry. The solution is stochastic sampling over time, which could be applied here.

As a GI solution using courser geometry, it looks very smart and applicable, plus compatible with HWRT. Ray count might not be insignificant for the first depth though. If you have a 64x64x64 volume of probes each sampling 8 rays for the first cascade, that's 2 million rays, which is a 100% sampling rate for 1080p. If you can get away with 32x32x32x8, that's a 1/8th sampling rate. I think each cascade adds 1/8th complexity, so your pretty much defined with that first cascade resolution. Depending how many rays you can afford, resolution, might be quite limited and you probably can't focus it on areas that need higher sampling unlike direct RT.

They can probably get away with a lot less in their top-down engine.
 
Last edited:
Did not see this talk posted ... little about FSR 3.
July 25, 2023
This presentation provides a deep dive into temporal upscaling, describing how different parts of the algorithm work together to generate the final image and what techniques are used in FSR to mitigate various common artifacts.

The presentation will also cover lessons learned from integrating temporal upscaling into various AAA games and will suggest best practices for integration, from quickly getting a working prototype to full integration into your engine using all bells and whistles.

Finally, this presentation will discuss the evolution of FSR 2 since initial release, as several internal features required a significant redesign due to special requirements of various titles.
 
Did not see this talk posted ... little about FSR 3.
July 25, 2023


The short of it here is that they do the obvious smart thing and wait until the control inputs for the next frame are known before reprojecting. This gives motion vectors and allows the in between frames to be in the correct direction, essentially cutting latency for camera movement (though not other controller input).

Kind of baffling that Nvidia didn't do this with DLSS3. Either way hope the results are good.
 
Here is a link to the presentation on the GI solution in Path of Exile 2:

Radiance Cascades:

A Novel Approach to Calculating Global

Illumination[WIP]



It's an interesting, if not exactly new, observation. Cascaded resolution has been around for a while, inverting the angular/linear is a solid idea buuut problems arise with smooth representations of say, perfectly specular materials. Take say, a mirror and you see both a continuous linear and angular representation, shove it through this caching seem and you'll see discontinuities as the representation "jumps" in between cache points. For their type of content, isometric top down mid range camera, it's probably not much of a concern.

But for generalized content. Caching as they propose is a good for diffuse and very approximate specular, but the spec you go the more the error becomes obvious. An inverted caching scheme, like say 1 meter ^3 low order spherical (whatever) and then when you get to 8 meter ^3 moving to a high order probe could smooth things out. But you'd still want something like hash grids to store radiance and then an acceleration structure to trace through for high spec, as well as near range realtime diffuse.
 
Back
Top