Guided denoising standardization (aka ray reconstruction)

MfA

Legend
AMD and Intel (and Apple and mobile hardware developers, to extent they want to cooperate) should really get together and create a standard for guided denoising and upscaling/upconversion.

FSR and XeSS could still be handled separately because they just slotted in for TAA, but this will require more invasive work in the engine. A common standard for the G buffer to be passed to the guided spatiotemporal postprocessing will help (maybe also pass in a shader, so it doesn't have to be passed back to the game).

AMD and Intel want to sell NPUs, this can sell NPUs.

PS. this also goes back to DirectSR being way too unambituous.
 
Last edited:
Apple doesn't care about AAA gaming and mobile graphics hardware vendors don't ever see RT itself moving past the tech demo stage and their biggest engine (Unity) doesn't support HDRP (RT is feature gated behind it) on Android. That leaves AMD for which we have a good idea about how they feel towards RT in general and Intel whose future is entirely uncertain. Some NPU implementations (particularly AMD & Intel) are architected for streaming computations tasks like the Cell processor so their performance conditions are highly sensitive to external memory roundtrips (DRAM/PCIE) which isn't conducive for running complex multi-pass neural networks such as DLSS/XeSS ...
 
Last edited:
this also goes back to DirectSR being way too unambituous.
It's unambituous by design. Denoising would require fast texture sampling and vector hardware on the NPU, which it lacks. Additionally, passing buffers mid-frame back and forth between the NPU and GPU would be a disaster for latency, performance optimization, and more. And there's really no need for this, as Lunar Lake's iGPU, for example, features faster XMX blocks compared to its NPU.
 
Texture sampling has perspective transformation and texture filtering ... this needs to dereference pointers with xy coordinates, calling that texture sampling is a bit exaggerated. It's just 2D postprocessing.

Interpolative framerate conversion is a latency disaster, this is over an order of magnitude less relevant.

There is a lot of general purpose vector processing in those NPU because off all the funky architectures they have to deal with, starting with activation functions.

Regardless how you implement it, it digs very deep into the game engine. An open API has a better chance of broad support. Like DirectSR, but more ambitious.
 
Some NPU implementations (particularly AMD & Intel) are architected for streaming computations tasks like the Cell processor so their performance conditions are highly sensitive to external memory roundtrips (DRAM/PCIE) which isn't conducive for running complex multi-pass neural networks such as DLSS/XeSS
If you ignore the ME/MC stuff, how is it complex? It's almost certainly just a local filter on a sliding window.

Qualcomm and Mediatek spend a lot of money putting the RT hardware in their SoCs, Unity dragging their feet isn't necessarily a reason to not want to cooperate.
 
Last edited:
If you ignore the ME/MC stuff, how is it complex? It's almost certainly just a local filter on a sliding window.
Even if the computations can be efficiently memory localized for specific NPU HW implementations, there's still potentially a lot of memory access overhead in passing around the many parameters or other hidden passes going on behind the scenes ...

A robust AI HW system isn't some sole focus race where you can just simply throw more matrix multiply-accumulation logic and then call it a day. Good AI HW is also highly correlated with strong memory architectures too like we see on GPUs (directly addressable device memory w/ transparent cache hierarchies) in comparison to the more hamfisted memory models of some NPUs (explicit DMA interface w/ local SRAM) that poses eerily resemblances to the Cell co-processor's SPUs ...
Qualcomm and Mediatek spend a lot of money putting the RT hardware in their SoCs, Unity dragging their feet isn't necessarily a reason to not want to cooperate.
Both ARM and QCOM only expose inline RT API support at the driver level which might suggest that their HW implementations can't really apply many specific optimizations to the RT PSO model similarly to AMD and do you really want to trust their shader compilers with a feature as complex as RT ?

Unity's lack of RT integration alone may not stop cooperation between the mobile hardware vendors but at the same time there really isn't any reason to invest more into the idea when the lion's share of the mobile gaming market uses Unity even for 3D graphics. It's a bit redundant for them to spend the time to optimize a non-existent use case like RT itself ...
 
Texture sampling has perspective transformation and texture filtering ... this needs to dereference pointers with xy coordinates, calling that texture sampling is a bit exaggerated. It's just 2D postprocessing.
Filtering-less point sampling of render targets is also considered texture sampling, and it should be the dominant texture operation in the pixel and compute shaders. It should still be significantly faster than performing generic memory loads and calculating texture addressing in SW. Whether it's beneficial to use the NPU comes down to performance. Leveraging the closely coupled and integrated XMX units within the SM should be both faster and easier for extracting performance.

Interpolative framerate conversion is a latency disaster, this is over an order of magnitude less relevant.
Not agree. Work distribution between devices often involves pipelining, such as the frame rendering queue between the CPU and GPU, which typically causes much worse delays than the minimal hold needed to complete a fast interpolation frame (+ some sleeps for even pacing). This is why Reflex more than compensates for the latency introduced by frame interpolation. Pipelining resources or frames between the GPU and NPU would be an entirely different kind of nightmare)
 
Last edited:
Reflex has to work partly with user mode code and can magically shave off more than 33 msec (it generally can't, but lets ignore that) and you fear the latency of an interrupt?

Any way, the value of the NPU isn't really the main topic of the thread ... it's the need for a standardized API. The data passed to DLSS/FSR/XeSS came from the relatively standardized use of TAA. That won't work this time. NVIDIA has the market power to force it through, everyone else will need to be a bit more diplomatic.
 
Last edited:
Reflex has to work partly with user mode code and can magically shave off more than 33 msec (it generally can't, but lets ignore that) and you fear the latency of an interrupt?
Working with NPU in the case of Auto SR is not just a simple interrupt. As Digital Foundry noted in their review, the Auto SR spatial upscaling requires additional buffering of one frame, which doubles the latency compared to frame interpolation, for example, which only adds half a frame of additional latency. More complex workloads will require even more buffering and processing time.
 
I am not talking about the GPU drivers. What I meant is that to leverage the NPU effectively, you'd need pipelining. Otherwise, there's no point in using the NPU at all. Why is pipelining necessary? It allows you to perform post-processing for "free", so to speak. For example, you need at least three buffered and pipelined frames for this to work: one frame processed on the CPU, another on the GPU, and a third on the NPU, all in parallel. This setup results in a minimum of three frames of latency with this configuration. As long as the NPU processing time is shorter than the GPU processing time, you would effectively get post-processing for free from a frame-rendering perspective, though not in terms of latency. Even more frames would need to be buffered for optimal performance and additional features (double-buffering v-sync, etc), but this would skyrocket the latency, making gameplay a nightmare.
 
The GPU drivers are important, because if you control the hardware on all sides the difference between intra-GPU pipelining and cross PCIE pipeling is bandwidth an interrupt masking. The work is pipelined regardless.

It would make sense to have the API allow to take the G-buffer in strips. It allows better pipelining and utilization even intra-GPU. With raytracing it has gotten a lot easier for gamedevs to do significant amount of work striped too.
 
Last edited:
It would make sense to have the API allow to take the G-buffer in strips. It allows better pipelining and utilization even intra-GPU.
For the better intra-GPU utilization or "free" denoising/post-processing, all you need is a custom present from the async compute queue. The rest would involve overlapping post-processing with the G-buffer fill of the next frame via the async compute, saving performance with minimal to no additional latency (especially compared to the significant latency added by potential NPU processing). This is how it's done in the ID Tech and some other games.
 
They don't all utilize the same resources, most normal shaders don't have much use for XMX for instance.
 
Last edited:
Back
Top