[Spinoff] Technical evaluation of upscaler performance

Well, then I hope we'll get SSIM graphs vs ground truth rendering soon
Eeehhhhhh. With the amount of settings that we can tweak, without developer assistance, I don’t think any 3rd party tool would work. I tried to make something like this for DF but in the end artifacts ended up sort of came in as detail.

Any thoughts on how to do it would be appreciated though.
 
Eeehhhhhh. With the amount of settings that we can tweak, without developer assistance, I don’t think any 3rd party tool would work. I tried to make something like this for DF but in the end artifacts ended up sort of came in as detail.

Any thoughts on how to do it would be appreciated though.
It's not up to you to provide this. The vendor selling halucinatory rendering should provide the objective metrics for the "quality" level they are offering. Maybe we'll even get SSIM per Watt as distinguishing metric between vendors some day. ;)
 
It's not up to you to provide this. The vendor selling halucinatory rendering should provide the objective metrics for the "quality" level they are offering. Maybe we'll even get SSIM per Watt as distinguishing metric between vendors some day. ;)
But it still seems to be on a per title basis? SSIM works well when you have source truth, like a digital movie, and testing various upscalers. But What is ground truth in this case? TAA? SSAA? Various AA algorithms can preserve more data than TAA, so it gets a bit weird to know.

We would need developers to build a demo which they designate as ground truth where settings are locked like 3D mark, then run the SSIM algorithm against the profile.

Maybe 3D mark can do it, but that would limit it to a single test.

I mean, I suppose if you link the SSIM test into the rendering path, but I can’t help but think this is going to affect frame rate lol. I just don’t know how this could be done.
 
Last edited:
But it still seems to be on a per title basis? SSIM works well when you have source truth, like a digital movie, and testing various upscalers. But What is ground truth in this case? TAA? SSAA? Various AA algorithms can preserve more data than TAA, so it gets a bit weird to know.

We would need developers to build a demo which they designate as ground truth where settings are locked like 3D mark, then run the SSIM algorithm against the profile.

Maybe 3D mark can do it, but that would limit it to a single test.
Why would it be on a per-title basis? When a vendor throws around 33db PNSR for S3TC (DXCT, BCT) and 35db for ETC2 then this is not per title, it's on average on a test corpus. For the approximate rendering case you also would have a corpus (in the form of a program) and averages. The program is deterministic and parameterizable, so obviously you have a ground truth output available. The ground truth a vendor compares to is what the vendor claims to approximate. A considerable amount of the techniques deal with sampling frequencies in space and time, and this is really easy to match and check.
Obviously the corpus would be more than just Cornell box, but many diverse examples. It'd be very interesting to see how Hades II would fare under DLSS4+DG for example. Because, obviously, you put the big problem cases in there. Then you would be able to understand if the solution stepped over the threshold of perceivability in general, or in which situation or for which elements.

Many game developers in the past have been very diligent, soft shadow techniques are compared against ground truth, motion blur, GTAO ofc (like other AO techniques) etc. pp. Same for raytracing techniques (in the academic field), and so on and so forth. I think Nvidia can easily afford spending a billion dollars in "proofing"/showcasing how good their proposal is outside of feel good vibes, and maybe inform the part of the industry that may not need ground truth but can't tolerate hallucinations (similar to the medical industry and MRI).
 
But it still seems to be on a per title basis? SSIM works well when you have source truth, like a digital movie, and testing various upscalers. But What is ground truth in this case? TAA? SSAA? Various AA algorithms can preserve more data than TAA, so it gets a bit weird to know.
Re ground truth -- yeah a 64x SSAA for spatial antialiasing + maybe 4x to 8x frame accumulation for temporal stability?

I'm not sure how good SSIM is for discerning temporal instability. It was designed to evaluate video compression, and highly compressed videos are actually quite temporally stable aren't they? Because video codecs are based on motion + corrective deltas.
 
Why would it be on a per-title basis? When a vendor throws around 33db PNSR for S3TC (DXCT, BCT) and 35db for ETC2 then this is not per title, it's on average on a test corpus. For the approximate rendering case you also would have a corpus (in the form of a program) and averages. The program is deterministic and parameterizable, so obviously you have a ground truth output available. The ground truth a vendor compares to is what the vendor claims to approximate. A considerable amount of the techniques deal with sampling frequencies in space and time, and this is really easy to match and check.
Obviously the corpus would be more than just Cornell box, but many diverse examples. It'd be very interesting to see how Hades II would fare under DLSS4+DG for example. Because, obviously, you put the big problem cases in there. Then you would be able to understand if the solution stepped over the threshold of perceivability in general, or in which situation or for which elements.

Many game developers in the past have been very diligent, soft shadow techniques are compared against ground truth, motion blur, GTAO ofc (like other AO techniques) etc. pp. Same for raytracing techniques (in the academic field), and so on and so forth. I think Nvidia can easily afford spending a billion dollars in "proofing"/showcasing how good their proposal is outside of feel good vibes, and maybe inform the part of the industry that may not need ground truth but can't tolerate hallucinations (similar to the medical industry and MRI).
I’m not sure how to do what you’re suggesting unfortunately. My understanding of signal processing isn’t quite there. I can’t do SNR without some sort of ground truth. That seems easier? For a compression algorithms. Because we know the ground truth in this case is uncompressed. Defining what should be ground truth would be interesting to say the least now. Especially with temporal variables now added into the rendering pipeline and motion blur being an option. There are a lot of scenarios where these algorithms could be really strong at, and others very weak at.

It does sound like we need a 3rd party test suite of some sort that specifically is looking to check the value of the hallucination.

I was often under the assumption that ground truth for developers was what could be pre-rendered in engine versus what could be achieved in realtime.

This is as far as I was able to get. I think you can sort of see why we didn’t pursue this any further.
 
I’m not sure how to do what you’re suggesting unfortunately.
Maybe there's a difference of scope between our two PoVs.
Reading between the lines, I presume you're interested in evaluating proposed Nvidia technology in real world games.? To me this isn't really possible. You are mixing two variables into one and afterwards are unable to decompose the two, as well as capture samples repeatedly deterministically.
For a game developer the product is the game as a whole. It is only (if at all) laterally interested in "competition" with other developers on the aspect of how near it comes to some ideal (shared even) ground truth in specific technical aspects/features. The ground truth for a game as a whole is a very fuzzy thing in the mind of the group of involved people. Parameterization is mostly in function of the game, not to approach this fuzzy ground truth. In general, ofc, the field is diverse and has outliers. Likewise there's little need or interest in determinism for the game as a whole, sometimes for some subsystems for the sake of regression testing and QA in general.
Game engine vendors are a bit more involved ofc, and because the character of the product is more technical, decoupled from specific characteristics of games (replayability, art style, genre, ...), they have generally more defined ground truths (one per technique), that you can produce and compare against. Parameterization is often good, determinism is a bit more tricky but achievable. Decomposability is ofc perfectly achievable.
Nvidias proposed features are very isolated specific things you can test with low effort. You can even take a 1000fps 50 times oversampled rendered or filmed movie file with motion and depth and feed the undersampled subset of it to these techniques repeatedly and deterministically. You don't need a real-time "rerenderer". The file is your ground truth. Nvidia doesn't propose it can substitute bullets by Kitties in an FPS, just that it can raise the apparent sampling rate. Even if you mix in ray sampling frequency to output sampling frequency stuff, it's decomposable, and you have a well defined ground truth (e.g.: the output approximates 4spp at 4x super sampled 4k, while using 0.5spp and 1080 upsampling).

You don't need to test 50 games over and over again to verify a Nvidia feature. You don't need to evaluate and critizise a game through Nvidia features.
If you have a comprehensive test corpus you can extrapolate. Is it possible to double the framerate and half the sampling rate? Yes; half the time; no. Can game XY render at resolution A with B FPS and display at double? Yes; half the time; no.
All the cases should be in the corpus: Volumetrics? Layered geometry composition? UI? ...

To summarize, to test these things you have to adopt a systematic approach of decomposition and evaluation and correct attribution. Not necessarily in the name of "objectivity", but at least for the sake of being able to end circular arguments of subjectivity.
My understanding of signal processing isn’t quite there. I can’t do SNR without some sort of ground truth. That seems easier? For a compression algorithms.
DLSS is in fact "just" a lossy compression algorithm. The sole side-channel is the neural net weights, and the predicted residuals have zero correctives. You can use JPEG2000 to operate exactly the same (it doesn't know motions, but that's not a hard disqualifier).
Because we know the ground truth in this case is uncompressed. Defining what should be ground truth would be interesting to say the least now. Especially with temporal variables now added into the rendering pipeline and motion blur being an option. There are a lot of scenarios where these algorithms could be really strong at, and others very weak at.
Do you have a particular scenario you struggle to associate a ground truth with?
I suspect you're more hang up on the problem of producing that ground truth data, than what that ground truth is?
It does sound like we need a 3rd party test suite of some sort that specifically is looking to check the value of the hallucination.
Can be 1st party too. :) We could try to do it here.? Ancient B3D style.
I was often under the assumption that ground truth for developers was what could be pre-rendered in engine versus what could be achieved in realtime.
That is the ground truth of only a couple of people in the team. Namely the performance/config/preset people. There are others for which it is blender, or Unreal, or Super mario. And others for which it's the sketches. And so on. Not all that many of those people are hung up on particular ground truths for the whole of the product.
This is as far as I was able to get. I think you can sort of see why we didn’t pursue this any further.
Break down your scope to something you can achieve. Get the right people on board. Make a clear design document of deliverables and limitations.
 
Maybe there's a difference of scope between our two PoVs.
Reading between the lines, I presume you're interested in evaluating proposed Nvidia technology in real world games.? To me this isn't really possible. You are mixing two variables into one and afterwards are unable to decompose the two, as well as capture samples repeatedly deterministically.
For a game developer the product is the game as a whole. It is only (if at all) laterally interested in "competition" with other developers on the aspect of how near it comes to some ideal (shared even) ground truth in specific technical aspects/features. The ground truth for a game as a whole is a very fuzzy thing in the mind of the group of involved people. Parameterization is mostly in function of the game, not to approach this fuzzy ground truth. In general, ofc, the field is diverse and has outliers. Likewise there's little need or interest in determinism for the game as a whole, sometimes for some subsystems for the sake of regression testing and QA in general.
Game engine vendors are a bit more involved ofc, and because the character of the product is more technical, decoupled from specific characteristics of games (replayability, art style, genre, ...), they have generally more defined ground truths (one per technique), that you can produce and compare against. Parameterization is often good, determinism is a bit more tricky but achievable. Decomposability is ofc perfectly achievable.
Nvidias proposed features are very isolated specific things you can test with low effort. You can even take a 1000fps 50 times oversampled rendered or filmed movie file with motion and depth and feed the undersampled subset of it to these techniques repeatedly and deterministically. You don't need a real-time "rerenderer". The file is your ground truth. Nvidia doesn't propose it can substitute bullets by Kitties in an FPS, just that it can raise the apparent sampling rate. Even if you mix in ray sampling frequency to output sampling frequency stuff, it's decomposable, and you have a well defined ground truth (e.g.: the output approximates 4spp at 4x super sampled 4k, while using 0.5spp and 1080 upsampling).

You don't need to test 50 games over and over again to verify a Nvidia feature. You don't need to evaluate and critizise a game through Nvidia features.
If you have a comprehensive test corpus you can extrapolate. Is it possible to double the framerate and half the sampling rate? Yes; half the time; no. Can game XY render at resolution A with B FPS and display at double? Yes; half the time; no.
All the cases should be in the corpus: Volumetrics? Layered geometry composition? UI? ...

To summarize, to test these things you have to adopt a systematic approach of decomposition and evaluation and correct attribution. Not necessarily in the name of "objectivity", but at least for the sake of being able to end circular arguments of subjectivity.

DLSS is in fact "just" a lossy compression algorithm. The sole side-channel is the neural net weights, and the predicted residuals have zero correctives. You can use JPEG2000 to operate exactly the same (it doesn't know motions, but that's not a hard disqualifier).

Do you have a particular scenario you struggle to associate a ground truth with?
I suspect you're more hang up on the problem of producing that ground truth data, than what that ground truth is?

Can be 1st party too. :) We could try to do it here.? Ancient B3D style.

That is the ground truth of only a couple of people in the team. Namely the performance/config/preset people. There are others for which it is blender, or Unreal, or Super mario. And others for which it's the sketches. And so on. Not all that many of those people are hung up on particular ground truths for the whole of the product.

Break down your scope to something you can achieve. Get the right people on board. Make a clear design document of deliverables and limitations.
haha, I mean yea, I would be perfectly happy to try to build it as part of the B3D suite. You're right, I was tasked with seeing if it was possible to evaluate the effectiveness of various upscalers against each other to see which algorithm came closest to I suppose native (which would be ground truth I suppose).

Decomposing sounds like the best way forward, I agree with you on this. We can at least define the test scenarios being tested.

And yes, I have been struggling with gathering ground truth data for what you're discussing. I'm drawing a complete blank on the steps I need to take from gathering the ground truth data, to gathering the test data, and evaluating test versus ground truth.

We can take this on messaging, I suppose if anyone else is interested in developing a b3d test suite, reach out!
 
We can take this on messaging, I suppose if anyone else is interested in developing a b3d test suite, reach out!
Put it in a thread apart, we can talk about it as a community, maybe we should. Restrict it if you must (is that possible?). Maybe get a git repository, work on it together.
 
MOD MODE: I was reading the original thread where this discussion came from and thinking it belongs in a new thread. Then I got to @Ethatron 's message and laughed, so I'll just say the credit goes to him :) This is a great conversation and the sort of stuff B3D used to be famous for. Please enjoy your new thread for all the awesome discussion to come!
 
Back
Top