Realtime AI content generation *spawn

Those game clips are generated with Gen 3 which has a max length of 10sec. This is not great for temporal consistency as every 10sec Lara Croft's shirt might turn from woolly to leather to whatever material.

If this type of architecture will ever be used for visual effect in real-time gameplay context it has a lot of maturing ahead of it. The good thing is that game engines already have depth buffers, color etc to ground the video.
If this type of architecture was used I don't think it would be an off-the-shelf model. The devs would train their own model for the game.
 
It doesn't really solve lighting or animation or physics interaction. It just gussies up the render, replacing fine detail. Most of the very hard simulation problems remain even if you could use this in real time, used some kind of transfer from a long term reference for consistency and ignored the inevitable screw ups.
 
Last edited:
It's so...blobby! Also the mess it makes of the overhead cables. I do wonder how generative AI will be able to cope when it has zero understanding. If it can be fed geometry data as well, it'll be a lot more robust. But really, we need to see AI working well in offline movies before we have even the hint of a chance of it working well in realtime in games. When Hollywood feeds a model a basic previs render and it outputs final-quality results, then we can look at accelerating that.
 
There's a new generative AI startup in town called Tales, and its goal is to empower gamers and creators to essentially make whatever they want with mere text prompts

the technology is based on a Large World Model (LWM) called Sia

This LWM is allegedly capable of generating all the components of a video game, from environments, 3D models, and gameplay to NPC (non-player character) behavior - along with detailed metadata

 
Nvidia's goal for neural rendering in the near-future seems to be a unified hybrid rasterization-raytracing-neural approach in which the neural renderer would have access to the g-buffer produced by conventional rendering and standard objects (triangle meshes) can be used alongside neural objects. Artists would still have full control of the final output. There will be no need to resort to feeding frames and text descriptions to an AI video generation model and praying it gives the desired output instead of hallucinating something random.
From the paper:
The purpose of the image generator is to synthesize an image from a novel, unobserved view of the scene. The generator receives: parameters of the camera, a G-buffer rendered using a traditional renderer from the novel view, and a view-independent scene representation extracted by the encoder.
One shortcoming of the neural model employed in this article (up to this point) is the rather poor visual quality on high-frequency visual features. However, the fine details and structures due to local illumination can be synthesized inexpensively using classical methods (if an accurate 3D model is available). The output from the classical renderer can be provided to the neural renderer as an additional input, or simply combined with the generated image.The two renderers can complement each other with the neural one focusing on the costly effects only.

We investigate one such scenario in Figure 17: we compute direct illumination via ray tracing and optimize the neural model to produce only indirect illumination.
In recent years, computer-vision algorithms have demonstrated agreat potential for extracting scenes from images and videos in a(semi-)automated manner [Eslami et al. 2018]. The main limitation, common to most of these techniques, is that the extracted scene representation is monolithic with individual scene objects mingled together. While this may be acceptable on micro and meso scales, it is undesired at the level of semantic components that an artist may need to animate, relight, or otherwise alter.

Compositionality and modularity—patterns that arise naturally in the graphics pipeline—are key to enable fine control over the placement and appearance of individual objects. Classical 3D models and their laborious authoring, however, are ripe for revisiting as deep learning can circumvent (parts of) the tedious creation process.

We envision future renderers that support graphics and neural primitives. Some objects will still be handled using classical models (e.g. triangles, microfacet BRDFs), but whenever these struggle with realism (e.g. parts of human face), fail to appropriately filter details (mesoscale structures), or become inefficient (fuzzy appearance), they will be replaced by neural counterparts that demonstrated great potential. To enable such hybrid workflows, compositional and controllable neural representations need to be developed first.
 
Not realtime, but this illustrates what's possible with greater source clarity than just images.


Although the fries do go a bit disconnected. With 3D sources, a lot more is going to be possible than processing images. I'm also curious how the horse was handled. I'm guessing there must be prompts.
 
Yes, but you'd need to have a predefined idea of what a horse is. There must be something prompting the AI on different possible interpretations.

And from that, how much deviation can you have until it doesn't recognise a 'horse' and doesn't know to give it four legs? Or what if you create a 6 legged salamander, drawing three 2D legs? Or a three-legged tripod creature?

So, if it is to work well, I think there must be something guiding the AI, whether word prompts or some other meta data to inform it.
 
Wouldn’t it be easy enough to infer the 2D image was a horse and go from there?

Yes, but you'd need to have a predefined idea of what a horse is. There must be something prompting the AI on different possible interpretations.

And from that, how much deviation can you have until it doesn't recognise a 'horse' and doesn't know to give it four legs? Or what if you create a 6 legged salamander, drawing three 2D legs? Or a three-legged tripod creature?

So, if it is to work well, I think there must be something guiding the AI, whether word prompts or some other meta data to inform it.
Exactly, the AI models don't know what horse is the way humans do. It can be trained on x images and told that's a horse but it will only get you so far. What if someone stuck a carrot in it's forehead to make it look like a unicorn. Human would know it's a horse with a carrot stuck on its forehead, but AI might trip not to recognize it as anything it's been trained on or if it's trained on unicorn images, it could interpret it as unicorn.
 
What would it actually take to get photorealistic graphics at 1080p and 2160p? Tim Sweeney said 40 TFs, but that seems far off the mark save in some specific cases like body-cam urban scenes. Is it a case of the computational power being wrongly directed, or has the workload of reality been grossly underestimated? Given the complete fail of things like accurate foliage that we are nowhere near solving, and truly natural human behaviours, and solid, correct illumination, and realistic fire and smoke, and the many, many flops of ML that we're looking to to solve some of these, the actual workload to create something like watching a film in realtime seems a long, long way off, if even possible. We inch ever closer, but the closer we get, the more the shortcomings stand out.
 
What would it actually take to get photorealistic graphics at 1080p and 2160p? Tim Sweeney said 40 TFs, but that seems far off the mark save in some specific cases like body-cam urban scenes. Is it a case of the computational power being wrongly directed, or has the workload of reality been grossly underestimated? Given the complete fail of things like accurate foliage that we are nowhere near solving, and truly natural human behaviours, and solid, correct illumination, and realistic fire and smoke, and the many, many flops of ML that we're looking to to solve some of these, the actual workload to create something like watching a film in realtime seems a long, long way off, if even possible. We inch ever closer, but the closer we get, the more the shortcomings stand out.
Animation is always gonna be a difference between pre-rendered CGI and real time games graphics. Being able to create bespoke animations for any given scene is a huge advantage compared to the more freeform nature of interactive gameplay. You'd need some nearly magical level of flawless procedural animation and physics working together to make a game that had truly realistic and dynamic behaviors for people and other entities.

Naughty Dog is the clear industry leader in this area, who seem to heavily prioritize these things, and they're still heavily reliant on a limited library of scripted animations at the end of the day, much as everybody else.
 
Even if we excuse things like ambulation, that need to be cheated to provide responsiveness, and look at a Quantic Dream type thing, or Death Stranding walking simulator, facial animations aren't close to realistic yet. The closer we get, the more the fine details fail. Lip sync isn't perfect and that really gives the game away. Facial deformation isn't perfect on a miniscule level, but that miniscule level is enough to make it look wrong. Clothing doesn't fold and bunch and crease and flex right. Skin doesn't either. At the level of world simulation, we've only really just begun and have so very far to go!
 
What would it actually take to get photorealistic graphics at 1080p and 2160p? Tim Sweeney said 40 TFs, but that seems far off the mark save in some specific cases like body-cam urban scenes. Is it a case of the computational power being wrongly directed, or has the workload of reality been grossly underestimated? Given the complete fail of things like accurate foliage that we are nowhere near solving, and truly natural human behaviours, and solid, correct illumination, and realistic fire and smoke, and the many, many flops of ML that we're looking to to solve some of these, the actual workload to create something like watching a film in realtime seems a long, long way off, if even possible. We inch ever closer, but the closer we get, the more the shortcomings stand out.
I actually think with the advent of LLMs and machine learning we have a shot at reaching photorealism quickly, AI will be the shortcut here.


As we've seen with the videos showing gaming scenes converted with AI to photorealsitic scenes full of life like characters, hair physics, cloth simulation, realistic lighting, shadowing and reflections, we have a glimpse into the future. There are many shortcomings of course, but they will be fixed when the AI is closely integrated into the game engine.

The AI model will have access to 3D data, full world space coordinates, lighting information and various details instead of the 2D video data, this will be enough to boost it's accuracy and minimize the amount of inference it has to do. We will also have faster and smarter models requiring less time to do their thing.

I can see future GPUs having much larger matrix cores, to the point of out numbering the regular FP32 cores, CPUs will also have bigger NPUs to assist, this would be enough to do 720p @60fps rendering, maybe even 1080p30 or 1080p60 if progress allows it.

Next, this will be upscaled, denoised and frame generated into the desired fidelity.

All in all, this path is a much quicker path -at least in theory- than waiting for traditional rendering to be mature and fast enough, which is becoming much harder and requires longer times, we simply lack the transistor budget to scale up the required horse power for traditional rendering to reach photorealism and do so at the previously feasible economic levels.

Even now traditional rendering faces huge challenges, chief among them is the code being limited by the CPU, and the slow progress of CPUs themselves, something has to give to escape these seemingly inescapable hurdles that existed for far too long.

So, playing with the ratios of different portions of these transistors budgets to allow for bigger machine learning portion than the traditional portion would be the smart thing to do, especially when it allows access to entirely new visual capabilities.
 
Last edited:
I actually think with the advent of LLMs and machine learning we have a shot at reaching photorealism quickly, AI will be the shortcut here.
I'm not so convinced. It's always the case with prototyping games that you get something fabulous in a weekend, but all the efforts needed to make the polished final takes forever. I think these quick results show promise, but the end result is actually a long way off and the imagined potential isn't within reach. At best, subdividing the game into aspects the ML can solve, like cloth dynamics, might work. I've too much life experience to look at these current results and extrapolate a near-term future of the best we can imagine! The magic bullets never are, and what we always end up with is an awkward compromise of glitchy fudges no matter how much power we throw at it.
 
I actually think with the advent of LLMs and machine learning we have a shot at reaching photorealism quickly, AI will be the shortcut here.


As we've seen with the videos showing gaming scenes converted with AI to photorealsitic scenes full of life like characters, hair physics, cloth simulation, realistic lighting, shadowing and reflections, we have a glimpse into the future. There are many shortcomings of course, but they will be fixed when the AI is closely integrated into the game engine.

The AI model will have acess to 3D data, full world space coordinates, lighting information and various details instead of the 2D video data, this will be enough to boost it's accuracy and minimize the amnout of inference it has to do. We will also have faster and smarter models requiring less time to do their thing.

I can see future GPUs having much larger matrix cores، to the point of out numbering the regular FP32 cores, CPUs will also have bigger NPUs to assist, this would be enough to do 720p @60fps rendering, maybe even 1080p30 or 1080p60 if progress allows it.

Next، this will be upscaled, denoised and frame generated into the desired fidelity.

All in all, this path is a much quicker path -at least in theory- than waiting for traditional rendering to be mature and fast enough, which is becoming much harder and requires longer times, we simply lack the transistor budget to scale up the required horse power for traditional rendering to reach photorealism and do so at the previously feasible economic levels.

Even now traditional rendering faces huge challenges, chief among them is the code being limited by the CPU, and the slow progress of CPUs themselves, something has to give to escape these seemingly inescapable hurdles that existed for far too long.

So, playing with the ratios of different portions of these transistors budgets to allow for bigger machine learning portion than the traditional portion would be the smart thing to do, especially when it allows access to entirely new visual capabilities.
Yeah tbh I really don’t think that video looks good at all. Most of these ‘AI re-imagined’ games look like stylistic messes.

I feel like people forget games are art and therefore you need more than an artificial intelligence coming up with all the artwork. Chasing photorealism at the expense of the art form produces bad results.
 
Back
Top