Machine Learning: WinML/DirectML, CoreML & all things ML

New faster material rendering using Neural Rendering.

This will be released as RTX Neural Materials.

This is even more interesting than DLSS 4 in my opinion because it makes neural rendering a core, integral part of the graphics pipeline, not something applied at the end like DLSS.
 
In a way it might be better to apply it at the end. Having golden samples which must be displayed makes things hard for AI, it's a direct cause of instability.

Maybe it's time for deferred hallucination. Have a subsampled G-buffer as a mere suggestion to the neural renderer, then let the Deep Learning Hallucinating Renderer make something up (for framegen there would also be frame time and view matrix and the G-buffer might be no-sampled).

PS. not being sarcastic.
 
Last edited:
Perhaps this should be another thread, but with "Cooperative Vectors" API coming to Direct X, devs can now access tensor cores for normal shading/compute shaders and thus leverage quick ML performance for games.
Enabling Neural Rendering in DirectX: Cooperative Vector Support Coming Soon

What are Cooperative Vectors, and why do they matter?​

Cooperative vector support will accelerate AI workloads for real-time rendering, which directly improves the performance of neural rendering techniques. It will do so by enabling multiplication of matrices with arbitrarily sized vectors, which optimize the matrix-vector operations that are required in large quantities for AI training, fine-tuning, and inferencing. Cooperative vectors also enable AI tasks to run in different shader stages, which means a small neural network can run in a pixel shader without consuming the entire GPU. Cooperative vectors will enable developers to seamlessly integrate neural graphics techniques into DirectX applications and light up access to AI-accelerator hardware across multiple platforms. Our aim is to provide game developers with the cutting-edge tools they need to create the next generation of immersive experiences.

Intel, AMD, and QUALCOMM support is due as the blog mentions - after having seen the Neural Rendering demos I am 100% onboard for this and would love to see how devs can manage to use this type of feature. Even if it just means getting better universal upscaling in the mid-term or other smaller things.
 
Deepseek has just flipped the script. It used to be believed that MoE could not compete for the highest end frontier models, yet they are competing ... at over an order of magnitude lower computational complexity, due to the combination of MoE, native fp8 and multi-token prediction. They also went a large way towards solving KV cache memory consumption (and Tencent took it further).

This is bringing the potential of very potent language models running off SSD much closer. Deepseek V3 is obviously still a little too complex for that, but something the size of Aria gets close (potent in its own right, but less architecturally adventurous than Deepseek V3). A MoE model specifically designed for SSD would be trained a bit differently too, by restriction on the amount of new experts per tokens instead of simple top-K gating. Pre-gating using outputs from a layer earlier is also an option, so expert selection and layer computation can be pipelined ... though not essential since these MoE models are so fast to compute, compute is almost irrelevant.

I think MoE is going to invade everything soon, not just language models but image gen and rendering too. Dense is dead, everything is going to get ~10x cheaper to run.
 
Deepseek has just flipped the script. It used to be believed that MoE could not compete for the highest end frontier models, yet they are competing ... at over an order of magnitude lower computational complexity, due to the combination of MoE, native fp8 and multi-token prediction
It was revealed that DeepSeek has been trained on a cluster of 50k H100s, does that still count as lower computational complexity?

According to Wang, when it comes to the Chinese accessing NVIDIA's advanced GPUs, "the reality is yes and no. You know the Chinese labs, they have more H100s than, than people think." He added and shared that his "understanding is that DeepSeek has about fifty thousand H100s." Wang outlined, "they can't talk about obviously because it is against the export controls that United States has put in place." He also thinks that "they have more chips than other people expect."

 
Last edited:
It was revealed that Deepseek has been trained on a cluster of 50k H100s, does that still count as lower computational complexity?

The inference has MoE, native fp8 and multi-token prediction. It's open weights, it's factual.

Whether they can really use fp8 matmul and sparse experts at training time is academic for actually using the models. Though I choose to believe they can.
 
Last edited:
Back
Top