Game development presentations - a useful reference

NVIDIA released a paper covering Floating Point and IEEE 754 Compliance for NVIDIA GPUs.

The key points covered are the following:

Use the fused multiply-add operator.
The fused multiply-add operator on the GPU has high performance and increases the accuracy of computations. No special flags or function calls are needed to gain this benefit in CUDA programs. Understand that a hardware fused multiply-add operation is not yet available on the CPU, which can cause differences in numerical results.

Compare results carefully.
Even in the strict world of IEEE 754 operations, minor details such as organization of parentheses or thread counts can affect the final result. Take this into account when doing comparisons between implementations.

Know the capabilities of your GPU.
The numerical capabilities are encoded in the compute capability number of your GPU. Devices of compute capability 2.0 and later are capable of single and double precision arithmetic following the IEEE 754 standard, and have hardware units for performing fused multiply-add in both single and double precision.

Take advantage of the CUDA math library functions.
These functions are documented in the CUDA C++ Programming Guide [7]. The math library includes all the math functions listed in the C99 standard [3] plus some additional useful functions. These functions have been tuned for a reasonable compromise between performance and accuracy. We constantly strive to improve the quality of our math library functionality. Please let us know about any functions that you require that we do not provide, or if the accuracy or performance of any of our functions does not meet your needs. Leave comments in the NVIDIA CUDA forum 1 or join the Registered Developer Program 2 and file a bug with your feedback.

 
I am going to do a practical summary "sort of interesting tidbits" of all those NVIDIA presentations.

Advances in RTX - Full Session Replay
-Path Tracing is going to be a be available in a lot more games. NVIDIA is working with many developers who want their games to look their best, and thus are choosing to implement path tracing in their games.

-NVIDIA is working with Bethesda on RTX Neural Faces (with early demos on Starfield).

Path Tracing Nanite in Nvidia Zorah
RTX Mega Geometry doesn't support World Position Offset (WPO), skinned meshes, and tessellated Nanite just yet. NVIDIA is working to add all of these features. This could be the reason why Mega Geometry is still yet to be part of DXR just yet.

Scale Up Ray Tracing in Games with RTX Mega Geometry
-Alan Wake 2 uses Mesh Shaders, the geometry is comprised mainly of meshlets but the developer changed it to clustered geometry to gain higher performance with Mega Geometry.

-Overall, meshlets cost 6ms to build BVH and ray trace the scene, while clusters took 5ms.
 
RTX Mega Geometry doesn't support World Position Offset (WPO), skinned meshes, and tessellated Nanite just yet. NVIDIA is working to add all of these features. This could be the reason why Mega Geometry is still yet to be part of DXR just yet.
My understanding of what was said is that the NvRTX branch of UE5 hasn't added support for these features yet, not that Mega Geometry in its current state is incapable of supporting these features.
 
More GDC 2025 presentations ...

Crossing the Uncanny Valley With RTX Neural Face Rendering (featuring demo on Starfield).


-Neural Faces took 7ms for inference and rendering in Starfield with native 1440p on a 4090, using 1.2GB of VRAM, using FP16 AutoEncoder. FP8/FP4 should be a lot faster and require way less VRAM. All of this is early work.

-The tech works purely on screen space data for now, with no awareness of animation or 3D space. NVIDIA is working to integrate more data into the model.

Creating Next-Gen Agents in Krafton’s inZOI

-inZOi needed small language model AI to run on device because cloud would be too slow (the game can run at 5X the speed and AI has to be very responsive), the cloud would also be too costly (servers running 24/7).

-Liama.CPP/GGML is used to run the model on Windows, DirectML was considered, but other options worked much better.

-NVIDA uses CUDA in Graphics (CiG) to quickly switch context without latency or execution bubbles.

-RTX exclusive.


Achieving AI Teammates in 'NARAKA: BLADEPOINT' PC Version

-On device inference for companion AI, voice controlled by the player. The companions are also able to freely chat with the player.

-CUDA in Graphics (CiG) is used to maintain good performance.

-There is a cloud version available but with higher latency and less capabilities.

 
Last edited:
 
Per frame texture based on Performance capture is... interesting. At least that is what it reads like. But It also sounds like a ridiculous amount of memory that would have trouble scaling to real-time... also scaling to arbitrary frame-rates?
 
Not a presentation but I read the book Behind the Scenes at PlayStation by Masayuki Chatani. Totally pointless book that tells us nothing new. There is ZERO actual behind the scenes info.
 
Every now and then we get asked what a beginner-friendly website is for learning graphics programming. We’d love to recommend GPUOpen of course, but the truth is, the main target audience for GPUOpen is intermediate or advanced graphics programmers. For someone who just started to dive into the world of graphics, there are surely other websites more suitable for them.

As with so many things, there is no one right way to get into graphics. It mostly depends on potential pre-existing knowledge, how you like to learn, personal preference, available hardware, etc. Hence, this guide is more a collection of websites that we think are useful for beginners, and a small discussion weighing the pro and cons of the websites and what they teach.

 
AMD - Using Neural Networks for Geometric Representation

Introduction

Monte Carlo ray tracing is a cornerstone of physically based rendering, simulating the complex transport of light in 3D environments to achieve photorealistic imagery. Central to this process is ray casting which determines and computes intersections between rays and scene geometry. Due to the computational cost of these intersection tests, spatial acceleration structures such as bounding volume hierarchies (BVHs) are widely employed to reduce the number of candidate primitives a ray must test against.


Despite decades of research and optimization, BVH-based ray tracing still poses challenges on modern hardware, particularly on Single-Instruction Multiple-Thread (SIMT) architectures like GPUs. BVH traversal is inherently irregular: it involves divergent control flow and unpredictable memory access patterns. These characteristics make it difficult to fully utilize the parallel processing power of GPUs, which excel at executing uniform, data-parallel workloads. As a result, even with the addition of specialized ray tracing hardware, such as RT cores, the cost of BVH traversal remains a bottleneck in high-fidelity rendering workloads.


In contrast, neural networks, especially fully connected networks, offer a regular and predictable computational pattern, typically dominated by dense matrix multiplications. These operations map well to GPU hardware, making neural network inference highly efficient on SIMT platforms. This contrast between the irregularity of BVH traversal and the regularity of neural network computation raises an intriguing question: Can we replace the BVH traversal in ray casting with a neural network to better exploit the GPU’s architecture?


This idea is beginning to gain traction as researchers explore alternative spatial acceleration strategies that leverage learned models. In this post, we dive into the motivation behind this approach, examine the challenges and opportunities it presents, and explore how our invention, Neural Intersection Function, might reshape the future of real-time and offline ray tracing.


 
Using a neural network only for intersection calculations would be a criminally poor use of resources. Let the network prefilter too.

Thinking there is value in sticking close to Monte-Carlo/PBR purity is an expensive fantasy. Everything in real time is a hack and purity will often have negative gains.
 
Intel is working on its own neural denoiser and its own version of Mega Geometry's Partitioned TLAS.

Performance and image quality are proportional to the number of rays at each stage of the path tracing. To save on compute and memory traffic we use 1spp and 1 ray on every bounce. Due to the stochastic nature of path tracing, the rendered image has significant noise. Each pixel is determined by a single random light path, causing extreme fluctuations in brightness and color, especially in complex lighting scenarios such as indirect illumination, caustics, soft shadows, etc. To remove noise and reconstruct details, we use our spatiotemporal joint neural denoising and supersampling model.
The large-scale open world scene featured in Jungle Ruins poses further challenges for path tracing, due to its geometric complexity. Millions of dynamic instances of meshes need to be animated, which requires updating the acceleration structures prior to ray tracing. The two-level accelerating structures defined by modern ray tracing APIs do not scale well with this complexity. While the animation of the foliage can be efficiently amortized on a per-mesh level (BLAS), the high number of instances make the full update of the top-level acceleration structure (TLAS) prohibitively costly. To this end, we demonstrate a solution that partitions the TLAS into subsets (AS fragments) that could be updated independently at a fraction of the global TLAS update.
 
Back
Top