Impact of nVidia Turing RayTracing enhanced GPUs on next-gen consoles *spawn

Status
Not open for further replies.
yup, they got pretty great results using statistical approximation which is basically in a nutshell what we're doing with AI.
And you can do this without using a NN. Different statistical algorithms would work.
But is also could have been done using DNN, which benefits from massive data sets.
They'd have to run it though and see the results though.
Post here on resetera:
#1

I guess it largely depends on the algorithm used and how effective it is.
 
Consoles have been using advanced upscaling techniques for years now, starting with Killzone Shadowfall on PS4, moving through checkerboarding in Rainbow Six, and moving on to temporal reconstruction of which different developers have their own take, and the best is probably Insomniac's. As it's incredibly effective in compute on <2 TF machines, the response to DLSS has been very muted from some of us and in cases like mine, it looks more like just trying to find something for the AI units to do where assuming they weren't added to Turing for graphics reasons. DLSS's main benefit might well be just adoption in the PC space where temporal reconstruction has remained niche, as reconstruction offers loads more quality per pixel for a minimal reduction in clarity.

If you want some strong pro-upscaling opinions, search Sebbbi's post history for reconstruction.
Now Unreal Engine has temporal upscaling and it's really easy to get working, we should see it even on PC games from now on.

Temporal Upscaling is cheap in terms of processing power needed as well so I'm really looking forward for first game using it and DLSS to get some decent comparison done.
I wouldn't be surprised if TU is faster and gives better quality in many cases and it works with dynamic resolution..
 
c) If you consider how much tensor cores can do, then it's easy to see it do RT denoising (which it hasn't been enabled yet) with DLSS, and any other AI models, like animations, or physics, or even on the fly content creation.
I didn't pass judgement on the value of Tensor cores. I said for upscaling they seemed inefficient.

This is false or at least should be false.
If true, that's make DLSS as an upscale solution even more silicon inefficient. ;)

To be clear, I am only describing upscaling games, which DavidGraham seems to be unaware of the state of the art, and which we know works wonderfully well in compute. So the case for DLSS on Tensor (Turing implemention, as this thread is about Turing) is can it upscale better/cheaper? That means comparing results to silicon costs. And if, as you say, DLSS can be performed on compute, then it's value needs to factored into that.

The major argument you're presenting is Tensor cores are worth it anyway and if they're in there, they can be used for upscaling, which may be a fair argument, but that doesn't disprove Tensor cores are an efficient solution just for upscaling.

TensorCores are nearly for free. They are just processing units. nVidia is putting them into the SM and using the upgraded scheduler hardware to feed them. You can compare GP100 with GV100 for the size penalty.
I'm seeing lots of numbers and no clear way to compare them. Can you provide the Tensor cost size as a figure? What percentage of die on RTX2070 is Tensor cores? How much compute could be had in the same space, and how much compute is needed for high quality temporal injection (we know that's a fraction of PS4's 1.8 TFs)?
 
I'm seeing lots of numbers and no clear way to compare them. Can you provide the Tensor cost size as a figure? What percentage of die on RTX2070 is Tensor cores? How much compute could be had in the same space, and how much compute is needed for high quality temporal injection (we know that's a fraction of PS4's 1.8 TFs)?

GV100 is 33% bigger and has 40% more compute units (SM) than GP100.

nVidia's Bill Dally talked a lot about the cost of FP units in the last years. They are really cheap on the latest process technologies. But the biggest problem is to feed them efficiently. TensorCores are just FP units included into the normal infrastrukture of the compute units. Their size doesnt really matter on 16nm and forward.
 
Their size doesnt really matter on 16nm and forward.
I find that difficult to believe. Being able to do that much maths is going to require an amount of silicon. Tensor units also aren't just FP units, but are fixed-function matrix maths units. This reddit discussion goes at length with two arguing whether Tensor cores take up silicon or not, with no clear consensus. Google's Tensor processor is >300 mm² at 28 nm.

I have no idea whether they take up 3% or 30%, but it's an amount and we can't consider their efficiency until we know that amount.
 
If true, that's make DLSS as an upscale solution even more silicon inefficient. ;)

To be clear, I am only describing upscaling games, which DavidGraham seems to be unaware of the state of the art, and which we know works wonderfully well in compute. So the case for DLSS on Tensor (Turing implemention, as this thread is about Turing) is can it upscale better/cheaper? That means comparing results to silicon costs. And if, as you say, DLSS can be performed on compute, then it's value needs to factored into that.

The major argument you're presenting is Tensor cores are worth it anyway and if they're in there, they can be used for upscaling, which may be a fair argument, but that doesn't disprove Tensor cores are an efficient solution just for upscaling.
lol from this perspective yes I suppose you are correct. To counter point I would have to bring in the analogy of dedicated hardware blocks such as HVEC decoding/encoding vs Compute or SHAPE audio blocks vs True Audio on Compute. In both instances choices were made to leverage dedicated silicon to create a specific functionality to run at higher speeds or use less resources than the compute based method. I have no doubts that we could do HVEC decoding and amazing audio over compute, we know it works, but the hardware was created none the less, likely because the pay off in performing these actions over fixed function hardware, given the available silicon made a lot of sense, even for some features that are not necessarily used all the time.

If I take that concept and draw it over to TensorCores, specifically, let's just assume that TensorCores could not support anything else but DLSS, then we're looking at tensorcores as a fixed function hardware whose main objective is to upscale and antialias at the cost of silicon vs using that space for more compute for instance.

So for the sake of discussion there are some assumptions I need to make, the first that Temporal AA and Temporal Injection should be similar enough in nature such that the performance cost and output should be similar. The second that, we take JHH demo at value, and that he's is not attempting to deceive the audience.

That being said, here we see TAA vs DLSS. And in summary, without watching the video, they present that DLSS has better upscaling and enough performance increase to be considered as at least a deviation, or from what I can see, or 2 better than what we're seeing on TAA.

Now, I don't have enough data points to prove this, but there are other assumptions we need to make as well. One is, because the tensor cores are separate from compute, they are doing their thing without trashing things on the compute side of things. It should in theory be able to go full-tilt without trashing things on the compute side. On the TAA side, i have to assume there is, to some degree, some drawbacks on the pipeline because you're asking compute to do everything from rendering to upscaling etc. There may also be some implementation restrictions I may not be fully aware of so perhaps it may be a reasoning point as to why we may not yet be seeing widespread adoption across all titles.

That being said if we can see that DLSS performance is indeed better in both quality and it can process faster, thus the increase in the frame rate, the only remaining question is how much silicon is being used to support tensor cores which I don't know either. But the question then becomes whether that space with even more compute could perform equally as well here running DLSS or say TAA. And we still need to consider things like architecture, and being able to feed the compute, shared caches etc. It's not so simple as just beefing up the compute values. Whereas tensor cores are a dedicated unit that don't necessarily need to interact with the compute environment as far as I understand. If we go back to the original example of dedicated HVEC and Shape; you can't just add more CUS, like for Xbox can't go from 12 to 13 just because we remove the dedicated blocks. Perhaps you'd move to 16 Compute units (2 groups of 8) and make 1 redundant to have 14 CUs. But perhaps you don't have space for that either. But enough space for say... a small dedicated block.

Quick throwback here:
But Xbox has 14 CUS, and PS4 has 20 CUs. Even if you removed all the esram, we're looking at only fitting 3 more pairs of CUs in there to get to 20. And so I don't believe we can just generalize too easily that removing the dedicated blocks for more compute is straightforward.
xbone_678x452.jpg

To round out the discussion, it also does more than just DLSS. And I think that's an important aspect as well. As we find more ways we can leverage NN for games, I think having an AI accelerator makes a lot of sense. And I think you know, derailing the thread, there is sufficient evidence we see the industry moving in this direction.

Machine Learning Graphics Intern:
https://www.indeed.com/viewjob?jk=0...g+in+Games&tk=1d18ti15d5ico803&from=web&vjs=3

Machine Learning Engineer at Embark - new studio founded among many senior devs including Johan repi, who posts here from time to time
Machine Learning Engineer

Machine Learning at Ubisoft MTL
https://jobs.smartrecruiters.com/Ubisoft2/105304105-machine-learning-specialist-data-scientist

Senior AI Software Engineer
https://ea.gr8people.com/index.gp?opportunityID=152736&method=cappportal.showJob&sysLayoutID=122

EA has an entire AI/NN ML division within Seed apparently
https://www.ibtimes.com/ea-e3-2017-ai-machine-learning-research-division-launches-2551067

This is a small list, but I suspect to see it continue to grow over the years. With so much being poured into this area of research I am not opposed to dedicating silicon to support this entirely separate function for faster performance in this area. Yes compute can do it, but compute cannot do it faster than tensor cores can at least when concerning running tensorflow based models.
 
Last edited:
That being said, here we see TAA vs DLSS. And in summary, without watching the video, they present that DLSS has better upscaling and enough performance increase to be considered as at least a deviation, or from what I can see, or 2 better than what we're seeing on TAA.
I assume TAA here means TAA at full resolution rendering, while DLSS means rendering at low resolution, but upscaling AND AA assisted by tensor cores.
For a fair comparison they would need to show TAA + temporal reconstruction upscaling with low res rendering. (Pretty sure quality and performance can compete easily)
Not sure, though.

Whereas tensor cores are a dedicated unit that don't necessarily need to interact with the compute environment as far as I understand.
Interesting. I thought tensors are driven from compute: They do only matrix multiplications, while compute has to manage program flow, logics and data transfer as usual. So i assume CUs are not free for other tasks while performing DLSS or denoising?
Would be interesting to know more about! (never looked it up because tensors are not exposed anyways yet).

I see another disadvantage of DLSS that maybe nobody has mentioned yet: The need to train per game with a supercomputer. So to use it, you need to be AAA and you need NVs help i guess. Or you need at least a lot of resources.


What can the tensorcores be used for in games, besides dlss reconstruction?

Killer feature would be procedural character animation IMO. Although i work on this myself and i use only physics but no AI, at some point things become complex and fuzzy.
But that's just an example. AI might be useful and spur new innovations we can not predict yet.
 
lol from this perspective yes I suppose you are correct. To counter point I would have to bring in the analogy of dedicated hardware blocks such as HVEC decoding/encoding vs Compute or SHAPE audio blocks vs True Audio on Compute. In both instances choices were made to leverage dedicated silicon to create a specific functionality to run at higher speeds or use less resources than the compute based method.
The analogy there would be dedicated upscale blocks rather than fully programmable AI cores. Tensor and compute and two different problem-solving paradigms. Both can achieve results, so it's a case of picking which is the best bang-per-buck. For some workloads compute is actually pretty ideal, especially image stuff using 'standard' (non-neural net) algorithms, because it's been developed over 20 years to be ideal for that.

So for the sake of discussion there are some assumptions I need to make, the first that Temporal AA and Temporal Injection should be similar enough in nature such that the performance cost and output should be similar. The second that, we take JHH demo at value, and that he's is not attempting to deceive the audience.
Well, we know TAA and temporal injection aren't similar in output. TAA doesn't improve resolution. Insomniac's TI does antialiasing akin to TAA but at a relatively higher resolution, just like the DLSS demo.

Now, I don't have enough data points to prove this, but there are other assumptions we need to make as well. One is, because the tensor cores are separate from compute, they are doing their thing without trashing things on the compute side of things. It should in theory be able to go full-tilt without trashing things on the compute side.
Reconstruction has to happen after the image has been constructed, so it's a clean slice. Render, then upscale, then render the next frame.

Really, this discussion needs a comparable reference. Insomniac haven't gone into details on how their method works versus other reconstruction and it's not on any PC titles AFAIK. What games are there on PC with reconstruction options?

To round out the discussion, it also does more than just DLSS. And I think that's an important aspect as well. As we find more ways we can leverage NN for games, I think having an AI accelerator makes a lot of sense. And I think you know, derailing the thread, there is sufficient evidence we see the industry moving in this direction.
That's probably very true and I'm not against them in principle. Upscaling games isn't anything new though and DLSS isn't really doing anything special yet - DavidGraham's enthusiasm is misplaced because he wasn't aware what was already happening with clever developers and the options fully programmable compute have opened up this gen.
 
TensorCores are nearly for free. They are just processing units. nVidia is putting them into the SM and using the upgraded scheduler hardware to feed them. You can compare GP100 with GV100 for the size penalty.
well not really. Someone in another forum calculated the size of the chip, frequency and the average performance in benchmarks of the 2080TI. Yes the card is faster (nothing new here) but if you calculate performance per mm² the performance was almost reduced by 30% per mm².
Also DLSS has many flaws (btw, also TAA). There are quite pretty upscaling techniques on consoles with only minor flaws. Really don't know why nvidia didn't invest in those. DLSS is really a waste of resources, not that easy to implement after all (only one title so far) and has really mixed results (from "it doesn't do anything at all" to "flickering" to "compression"-artifacts) with a heavy performance hit (compareable to TAA @1800p vs 1440p DLSS (to 4k)).
nvidia is just trying to invent something new, something that isn't optimized for the use-case just to be the first.
 
I assume TAA here means TAA at full resolution rendering, while DLSS means rendering at low resolution, but upscaling AND AA assisted by tensor cores.
For a fair comparison they would need to show TAA + temporal reconstruction upscaling with low res rendering. (Pretty sure quality and performance can compete easily)
Not sure, though.
In the demo, both are operating on 1440p frame buffer. So everything is the same except the algorithm used and the hardware doing it.

Interesting. I thought tensors are driven from compute: They do only matrix multiplications, while compute has to manage program flow, logics and data transfer as usual. So i assume CUs are not free for other tasks while performing DLSS or denoising?
Would be interesting to know more about! (never looked it up because tensors are not exposed anyways yet).

I see another disadvantage of DLSS that maybe nobody has mentioned yet: The need to train per game with a supercomputer. So to use it, you need to be AAA and you need NVs help i guess. Or you need at least a lot of resources.
That part is hidden to me. Tensors are just really high performing 16f multiply adders. I think there are some other little things in there like accumulators but generally should not have things like rasterizers or texture mapping touching them. So I think tensors being put into a compute engine seems not as direct as it could be. I could be wrong, but I'm not a hardware guy.

As for the disadvantage of DLSS:
Training is cheap, it's could easily become the net positive here when you consider the cost of labour (time, complexity, talent etc). Take your finished product and hand over to another company to do processing and it's done. Extrapolate this concept over a variety of different tasks, like content creation and labour costs go down significantly.
 
The analogy there would be dedicated upscale blocks rather than fully programmable AI cores. Tensor and compute and two different problem-solving paradigms. Both can achieve results, so it's a case of picking which is the best bang-per-buck. For some workloads compute is actually pretty ideal, especially image stuff using 'standard' (non-neural net) algorithms, because it's been developed over 20 years to be ideal for that.
Agreed. Tensor is extremely specific here. Which may or may not work against its presence. We could in theory do everything using standard compute, ie. MI60 or just about any non Volta card has been doing since 2010.

Well, we know TAA and temporal injection aren't similar in output. TAA doesn't improve resolution. Insomniac's TI does antialiasing akin to TAA but at a relatively higher resolution, just like the DLSS demo.
Yea I'm not sure what the costs here are, but if injection takes more processing power than just AA, then the resulting performance should be worse.

Reconstruction has to happen after the image has been constructed, so it's a clean slice. Render, then upscale, then render the next frame.

Really, this discussion needs a comparable reference. Insomniac haven't gone into details on how their method works versus other reconstruction and it's not on any PC titles AFAIK. What games are there on PC with reconstruction options?
At least the version that MS shows, (i'll grab a picture in a bit) it would appear that the two need to come together. It's not upscaling after the image is completed, it's part of the render chain. I suspect it's possible Nvidia may also do it this way.

That's probably very true and I'm not against them in principle. Upscaling games isn't anything new though and DLSS isn't really doing anything special yet - DavidGraham's enthusiasm is misplaced because he wasn't aware what was already happening with clever developers and the options fully programmable compute have opened up this gen.
Yup I agree that there is a time and place for everything. Unfortunately a little early for me to put out some real points, there just aren't enough data points on DLSS. We'll know more once the DLSS patch for BFV is released.
 
DLSS has been shown to be "not utterly crappy" only when trained on pre-recorded content (Final Fantasy Bench, Infiltrator tech demo, Porsche Tech demo & now Futurmark bench)….which kind of defies the whole point of it...unless only "playing" benchmarks on your brand new GPU is the new hip thing to do.
 
DLSS has been shown to be "not utterly crappy" only when trained on pre-rendered content (Final Fantasy Bench, Infiltrator tech demo, Porsche Tech demo & now Futurmark bench)….which kind of defies the whole point of it...unless only "playing" benchmarks on your brand new GPU is the new hip thing to do.
I don't think we have enough games with implementations to make any conclusions. We have... 1?
 
iroboto: "Yup I agree that there is a time and place for everything. Unfortunately a little early for me to put out some real points, there just aren't enough data points on DLSS. We'll know more once the DLSS patch for BFV is released."

Makes sense.

Looking up here https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
"During program execution, multiple Tensor Cores are used concurrently by a full warp of execution. The threads within a warp provide a larger 16x16x16 matrix operation to be processed by the Tensor Cores. CUDA exposes these operations as warp-level matrix operations in the CUDA C++ WMMA API. These C++ interfaces provide specialized matrix load, matrix multiply and accumulate, and matrix store operations to efficiently utilize Tensor Cores in CUDA C++ programs."

This is basically my source of the assumption compute is necessary to control tensor cores. But i further assume tensor operations take usually long time and the compute warp becomes idle for other work. (If not this option might come with future hardware.)
Unfortunately this does not answer my question if async compute is already possible to fill the bubbles. Likely we will see with upcoming MS API...

I like this: "Many computational applications use GEMMs: signal processing, fluid dynamics, and many, many others."

Fluid would be an interesting non AI application for games. Propagating light through volume data maybe another.
 
That part is hidden to me. Tensors are just really high performing 16f multiply adders. I think there are some other little things in there like accumulators but generally should not have things like rasterizers or texture mapping touching them. So I think tensors being put into a compute engine seems not as direct as it could be. I could be wrong, but I'm not a hardware guy.

I suppose it's a little curious that nV went to such an extreme on die size (and then R&D for no less than 3 separate ASICs), and shoving in RT/tensor as opposed to more of the traditional GPU bits on there unless there were mitigating factors to scaling up the GPU that way (power density, bandwidth).

Clearly they want to keep the pro-grade features/performance for the $1500USD+ market, so it's interesting where 20xx fits strategically in conjunction with the seemingly out-of-nowhere API support from MS as well, and what business case there would have to be to push game developers into the trenches here. Seems a bit much just to be the next step for GameWorks (if you know what I mean).

/AlFoilToiletRoll
 
Last edited:
I have no idea whether they take up 3% or 30%, but it's an amount and we can't consider their efficiency until we know that amount.

Do we also need to consider the impact on power / thermal efficiency.
DLSS has been shown to be "not utterly crappy" only when trained on pre-recorded content (Final Fantasy Bench, Infiltrator tech demo, Porsche Tech demo & now Futurmark bench)….which kind of defies the whole point of it...unless only "playing" benchmarks on your brand new GPU is the new hip thing to do.

The FF fight sequence is non deterministic and changes each time, the assets do not change but it's not a set sequence of images.

Another point is how difficult it is (or I assume is not) to add dlss to an existing render pipeline. Could be a huge dlss win there.
 
I suppose it's a little curious that nV went to such an extreme on die size (and then R&D for no less than 3 separate ASICs), and shoving in RT/tensor as opposed to more of the traditional GPU bits on there unless there were mitigating factors to scaling up the GPU that way (power density, bandwidth).

Clearly they want to keep the pro-grade features/performance for the $1500USD+ market, so it's interesting where 20xx fits strategically in conjunction with the seemingly out-of-nowhere API support from MS as well, and what business case there would have to be to push game developers into the trenches here. Seems a bit much just to be the next step for GameWorks (if you know what I mean).

/AlFoilToiletRoll
For enterprise and cloud use, we are forced to use the volta/V100 series. Nvidia strictly forbids the use of pro-sumer cards to support the resale of the functionality. We recently just dialed in a bill for 62K CAD for 4xV100s. Mental because functionality wise.. it's the same. I really dont' want to get into that argument here, but honestly... it's a same just way more expensive.

So perhaps this is why? Or they are hoping more people will get into the industry and then when enough prototyping in done on a small scale, get hit in the face with the v100 bill.
 
iroboto: "Yup I agree that there is a time and place for everything. Unfortunately a little early for me to put out some real points, there just aren't enough data points on DLSS. We'll know more once the DLSS patch for BFV is released."

Makes sense.

Looking up here https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
"During program execution, multiple Tensor Cores are used concurrently by a full warp of execution. The threads within a warp provide a larger 16x16x16 matrix operation to be processed by the Tensor Cores. CUDA exposes these operations as warp-level matrix operations in the CUDA C++ WMMA API. These C++ interfaces provide specialized matrix load, matrix multiply and accumulate, and matrix store operations to efficiently utilize Tensor Cores in CUDA C++ programs."

This is basically my source of the assumption compute is necessary to control tensor cores. But i further assume tensor operations take usually long time and the compute warp becomes idle for other work. (If not this option might come with future hardware.)
Unfortunately this does not answer my question if async compute is already possible to fill the bubbles. Likely we will see with upcoming MS API...

I like this: "Many computational applications use GEMMs: signal processing, fluid dynamics, and many, many others."

Fluid would be an interesting non AI application for games. Propagating light through volume data maybe another.
Yea those are volta ;) I don't know if Turing ones are different. But thank you for the reading material will go through it and adjust my understanding of their tech.

The Tesla V100 GPU contains 640 Tensor Cores: 8 per SM. Tensor Cores and their associated data paths are custom-crafted to dramatically increase floating-point compute throughput at only modest area and power costs. Clock gating is used extensively to maximize power savings.
indeed you could be entirely correct here. I don't see them moving away from this just because Turing. i could be wrong though.
 
Last edited:
Status
Not open for further replies.
Back
Top