Game development presentations - a useful reference

DGF looks like it's not lossless compression. Is Nanite compression lossless? For UE5, is there a conversion between nanite cluster format and megageometry cluster? My thought here is if nanite has to do any conversion to megageometry?

Whatever UE5 decides to do could be a big win for adoption because it covers so much of the industry.
 
BVH construction is a big performance problem. CLAS construction forces one to transcode and then still build all of those Nanite clusters. With DGF neither is necessary, you put what you have on disk directly into the hierarchy.
You put that into the hierarchy - which you still need to build and then update as needed. Nothing changes aside from the fact that you load pre-made clusters of triangles into your CLASes. I doubt that this is saving much performance.
 
CLAS construction forces one to transcode and then still build all of those Nanite clusters. With DGF neither is necessary, you put what you have on disk directly into the hierarchy.
As I already said, transcoding from UE5's storage format has never been an issue. You won’t magically achieve a major speedup by accelerating a nonexistent portion of rendering time. Broad compatibility is far more important in that case.

And why exactly would DGF somehow remove the necessity of creating the CLAS? It does not encapsulate a BVH structure, nor would it ever be able to, because each vendor still has its own BVH format. It does not eliminate the CLAS or BVH construction, even for static geometry.
 
Last edited:
DGF looks like it's not lossless compression. Is Nanite compression lossless?
Both are lossy when quantization is in use, but this is adjustable, allowing the losses to be tweaked per content. In fact, both are quite similar in what they do and what can be achieved with them. The main difference is that DGF is optimized for hardware decoding, while Nanite is optimized for fast software decoding. IMO, it is at least questionable which is the faster approach - somewhat similar to the debate between the vertex pipeline with hardware triangle culling vs the software culling in mesh shaders.
 
As I already said, transcoding from UE5's storage format has never been an issue. You won’t magically achieve a major speedup by accelerating a nonexistent portion of rendering time. Broad compatibility is far more important in that case.
I agree that compatibility is more important in the near term but I think it might be overstepping to claim that being able to stream precomputed BVHs in a more native format like can be done on the consoles is not an advantage. It is absolutely desirable to get there eventually, as there are non-trivial benefits to that on consoles.

That said, we've seen the history of this stuff before and it's all over the place. We do have standard texture compression formats, but the attempt to get standard swizzle formats was basically a dud. The cost/quality tradeoff of on the fly texture compression is not very good compared to offline, but reswizzling things during upload is not as a big a deal. (Standard swizzle is of course a much larger benefit on UMA SoCs, but typically it is hard to get game developers to optimize a lot for those systems outside of consoles.)

It's difficult for me to accurately predict which case BVH building/streaming will fall into here. With DXR 1.0 it's clearly way too expensive to build things on PC, so from that point of view that would favor offline BVH. But CLAS/MegaGeometry/Nanite raytracing shifts that needly fairly significantly to the point that it may well be fast enough to again be able to ignore the BVH compatibility problem. That said, I recall it being noted in the MegaGeometry presentations that there is some amount of tracing penalty still for that representation vs. static offline BVHs so it doesn't seem like the debate is over on this yet.

And why exactly would DGF somehow remove the necessity of creating the CLAS? It does not encapsulate a BVH structure, nor would it ever be able to, because each vendor still has its own BVH format. It does not eliminate the CLAS or BVH construction, even for static geometry.
I agree that if all it does is what Nanite compression does it is not interesting. My assumption (admittedly having only looked at the article) is that the main reason to do this is as part of enabling offline BVH building and streaming on PC.

I also don't think the ability to directly rasterize the format is super relevant going forward, but I suppose if you standardize something you might as well make it orthogonal.
 
I agree that compatibility is more important in the near term but I think it might be overstepping to claim that being able to stream precomputed BVHs in a more native format like can be done on the consoles is not an advantage.
A. How are we jumping from DGF compression of triangles into "clusters of triangles" to "precomputed BVHs" exactly?
B. We're likely never getting that on PC for about the same reasons why we'll never get shader binaries being distributed with games code. It is both a disadvantage and an advantage though. You lose some performance, you gain flexibility, compatibility and the area to innovate.
 
I recall it being noted in the MegaGeometry presentations that there is some amount of tracing penalty still for that representation vs. static offline BVHs so it doesn't seem like the debate is over on this yet.
The Mega Geo samples you can download show increased trace time vs. Standard BVH but massively Speed Up the build time. They have a nice perd UI to Show it. I will grab some numbers tomorrow - it is interesting stuff
 
The Mega Geo samples you can download show increased trace time vs. Standard BVH but massively Speed Up the build time. They have a nice perd UI to Show it. I will grab some numbers tomorrow - it is interesting stuff

uNtNgdZ.jpeg


Resolution: 3070x1462
~1/10th the AS build cost
Render/Trace time (this is path traced) is 32.7% slower
 
Last edited:
Resolution: 3070x1462
~1/10th the AS build cost
Render/Trace time (this is path traced) is 32.7% slower
What GPU used for this test? Pre-Blackwell GPUs paying a tracing performance penalty because they don't have the Triangle Cluster Intersection Engine is expected, but if this is Blackwell these results are disappointing.
 
What GPU used for this test? Pre-Blackwell GPUs paying a tracing performance penalty because they don't have the Triangle Cluster Intersection Engine is expected, but if this is Blackwell these results are disappointing.
This should be on an RTX 5080 if I recall. So, Blackwell. IMO compile the samples if you can to try it out, perhaps we can throw some more data in this thread (or better yet the MegaGeo Thread that exists).
 
A. How are we jumping from DGF compression of triangles into "clusters of triangles" to "precomputed BVHs" exactly?
DGF best encodes connected triangles which share vertices. That is the "definition" of a cluster. It is also one of the motivations for the research: if one already has clustered geometry, let's take advantage of the inherent redundancy.

When you have a fixed size block containing upto 128 triangles, it's only a small step to develop a hardware block that simply brute forces through the block, instead of following a hierarchy. Or to develop hardware BVH construction.
 
When you have a fixed size block containing upto 128 triangles, it's only a small step to develop a hardware block that simply brute forces through the block, instead of following a hierarchy. Or to develop hardware BVH construction.
What do you mean by "brute forces through a block"? The BVH hierarchy exist for ray tracing acceleration not for anything related to triangles per se.
BVH is already constructed by the h/w.
And none of this has anything to do with streaming precomputed BVH.
 
And none of this has anything to do with streaming precomputed BVH.
Isn't it assumed that the leafs of NVIDIA's BVH contain more than 1 triangle? Which together with their higher branching factor makes their BVHs much smaller?

Obviously NVIDIA isn't going to adopt DGF at that level, but if AMD could somehow get some devs to use DGF as an engine format and they also use it at the hardware level for the BVH leafs they could stream them through untouched with an appropriate API.

Bit optimistic though.
 
That last stage is roughly equivalent to the occlusion culling and cluster selection which kicks off rasterization in Nanite, but instead it updates a BVH for ray tracing.

It's as straight a conversion of Nanite to ray tracing as possible, but in doing so it sits in the middle of rasterization and ray tracing. Inheriting things like occlusion culling, which ray tracing really should rid us off. "Real-Time Ray Tracing of Micro-Poly Geometry with Hierarchical Level of Detail" is a much more pure ray tracing solution to the same problem.
Yes, because you need some general pass that answers the question "what LODs do I want resident for my scene right now". The Intel paper has the same thing, and with very similar heuristics. I'll quote section 4.1:

This also sounds extremely similar and sensible as a base implementation right? i.e. for onscreen geometry you project it and compute rough triangle sizes in camera space. For offscreen geometry you use some sort of distance-scaled representation. Regarding occlusion culling, that's entirely up to the implementation whether you use that as part of your heuristic or not. I would imagine for practical reasons you do want to scale the quality of occluded geometry - just like offscreen geometry - but how much is entirely up to the heuristic. Obviously if you want to just stream all your geometry at high LODs based on purely solid angles or something similar you are welcome to, but you will almost certainly run into VRAM problems with that approach.

You could of course imagine in the future driving or augmenting some of this selection with feedback from secondary rays/differentials themselves; that would be the only real way to capture stuff like refraction and the like in sufficient detail, but is also brings its own can of worms that probably doesn't make sense for the near term.
Oops, I completely misread how Intel was descending the hierarchy :/ Well replace that with the AMD method of approximation using the BVH hierarchy then which raytracing fan mentioned :)

Occlusion culling is an inherent part of rasterization IMO and I'd rather not see all its foibles inherited in a ray tracing engine. With primary ray ray tracing, I was hoping it was to disappear.
This discussion on occlusion culling and LOD heuristics for ray tracing in the other thread made me think of this presentation from HPG 2024:


One major challenge for all real-time ray tracing techniques is management of acceleration structures in graphics memory. They are needed for fast intersection testing and have to represent the entire scene. This can quickly become a limiting factor, especially for large and highly dynamic scenes. Real-time renderers often employ culling strategies to handle such scenes. However, known strategies such as frustum or occlusion culling cannot directly be applied when using ray tracing. We present LiPaC, a new application-level method for efficient culling of ray traced scenes. The method divides the scene spatially and counts ray intersections in each cell to determine its importance for the lighting of the scene. Each object in the scene has a persistent state deciding over its ray tracing representation and culling. This state is changed by heuristics based on the previously encountered number of light paths
Our culling algorithm aims to remove all instances that have little impact on the lighting from the TLAS. It tracks how many light paths are encountered by each model instance during path tracing and then uses these counters for decisions during acceleration structure management. Every instance in the scene has an associated state based on the number of intersections encountered in previous frames that decides if and how it is represented in the TLAS. Areas of the scene that are rarely encountered by light paths are not added to the TLAS anymore and instead replaced with bounding boxes that only count light path intersections. Therefore, we call the method light path guided culling (LiPaC).
LiPaC realizes a culling strategy that has a much stronger foundation than traditional culling approaches in the context of ray tracing. Instead of relying on plausible heuristics based on distance to camera, bounding box size, solid angle or covered pixels on screen, our heuristic is built upon the number of encountered light paths as an estimator for the impact on the final lighting of the image. More sophisticated hit feedback information could improve the quality of this estimation further. An optimal culling algorithm would find the best trade-off between performance/memory overhead and impact to lighting of the scene. LiPaC typically achieves good results that are close to this optimal tradeoff

The general idea of using the number of light paths/rays that encounter an instance in the previous few frames to determine its LOD or if it should be culled entirely for subsequent frames is rather elegant. And with Mega Geometry, this strategy could be applied to cull entire partitions of a Partitioned TLAS or to determine culling/LOD selection at the cluster level.
 
GPU Friendly Laplacian Texture Blending

Abstract

Texture and material blending is one of the leading methods for adding variety to rendered virtual worlds, creating composite materials, and generating procedural content. When done naively, it can introduce either visible seams or contrast loss, leading to an unnatural look not representative of blended textures. Earlier work proposed addressing this problem through careful manual parameter tuning, lengthy per-texture statistics precomputation, look-up tables, or training deep neural networks. In this work, we propose an alternative approach based on insights from image processing and Laplacian pyramid blending. Our approach does not require any precomputation or increased memory usage (other than the presence of a regular, non-Laplacian, texture mipmap chain), does not produce ghosting, preserves sharp local features, and can run in real time on the GPU at the cost of a few additional lower mipmap texture taps.

 
Back
Top