Realtime AI content generation *spawn

Those game clips are generated with Gen 3 which has a max length of 10sec. This is not great for temporal consistency as every 10sec Lara Croft's shirt might turn from woolly to leather to whatever material.

If this type of architecture will ever be used for visual effect in real-time gameplay context it has a lot of maturing ahead of it. The good thing is that game engines already have depth buffers, color etc to ground the video.
If this type of architecture was used I don't think it would be an off-the-shelf model. The devs would train their own model for the game.
 
It doesn't really solve lighting or animation or physics interaction. It just gussies up the render, replacing fine detail. Most of the very hard simulation problems remain even if you could use this in real time, used some kind of transfer from a long term reference for consistency and ignored the inevitable screw ups.
 
Last edited:
It's so...blobby! Also the mess it makes of the overhead cables. I do wonder how generative AI will be able to cope when it has zero understanding. If it can be fed geometry data as well, it'll be a lot more robust. But really, we need to see AI working well in offline movies before we have even the hint of a chance of it working well in realtime in games. When Hollywood feeds a model a basic previs render and it outputs final-quality results, then we can look at accelerating that.
 
There's a new generative AI startup in town called Tales, and its goal is to empower gamers and creators to essentially make whatever they want with mere text prompts

the technology is based on a Large World Model (LWM) called Sia

This LWM is allegedly capable of generating all the components of a video game, from environments, 3D models, and gameplay to NPC (non-player character) behavior - along with detailed metadata

 
Nvidia's goal for neural rendering in the near-future seems to be a unified hybrid rasterization-raytracing-neural approach in which the neural renderer would have access to the g-buffer produced by conventional rendering and standard objects (triangle meshes) can be used alongside neural objects. Artists would still have full control of the final output. There will be no need to resort to feeding frames and text descriptions to an AI video generation model and praying it gives the desired output instead of hallucinating something random.
From the paper:
The purpose of the image generator is to synthesize an image from a novel, unobserved view of the scene. The generator receives: parameters of the camera, a G-buffer rendered using a traditional renderer from the novel view, and a view-independent scene representation extracted by the encoder.
One shortcoming of the neural model employed in this article (up to this point) is the rather poor visual quality on high-frequency visual features. However, the fine details and structures due to local illumination can be synthesized inexpensively using classical methods (if an accurate 3D model is available). The output from the classical renderer can be provided to the neural renderer as an additional input, or simply combined with the generated image.The two renderers can complement each other with the neural one focusing on the costly effects only.

We investigate one such scenario in Figure 17: we compute direct illumination via ray tracing and optimize the neural model to produce only indirect illumination.
In recent years, computer-vision algorithms have demonstrated agreat potential for extracting scenes from images and videos in a(semi-)automated manner [Eslami et al. 2018]. The main limitation, common to most of these techniques, is that the extracted scene representation is monolithic with individual scene objects mingled together. While this may be acceptable on micro and meso scales, it is undesired at the level of semantic components that an artist may need to animate, relight, or otherwise alter.

Compositionality and modularity—patterns that arise naturally in the graphics pipeline—are key to enable fine control over the placement and appearance of individual objects. Classical 3D models and their laborious authoring, however, are ripe for revisiting as deep learning can circumvent (parts of) the tedious creation process.

We envision future renderers that support graphics and neural primitives. Some objects will still be handled using classical models (e.g. triangles, microfacet BRDFs), but whenever these struggle with realism (e.g. parts of human face), fail to appropriately filter details (mesoscale structures), or become inefficient (fuzzy appearance), they will be replaced by neural counterparts that demonstrated great potential. To enable such hybrid workflows, compositional and controllable neural representations need to be developed first.
 
Back
Top