Why would it be on a per-title basis? When a vendor throws around 33db PNSR for S3TC (DXCT, BCT) and 35db for ETC2 then this is not per title, it's on average on a test corpus. For the approximate rendering case you also would have a corpus (in the form of a program) and averages. The program is deterministic and parameterizable, so obviously you have a ground truth output available. The ground truth a vendor compares to is what the vendor claims to approximate. A considerable amount of the techniques deal with sampling frequencies in space and time, and this is really easy to match and check.
Obviously the corpus would be more than just Cornell box, but many diverse examples. It'd be very interesting to see how Hades II would fare under DLSS4+DG for example. Because, obviously, you put the big problem cases in there. Then you would be able to understand if the solution stepped over the threshold of perceivability in general, or in which situation or for which elements.
Many game developers in the past have been very diligent, soft shadow techniques are compared against ground truth, motion blur, GTAO ofc (like other AO techniques) etc. pp. Same for raytracing techniques (in the academic field), and so on and so forth. I think Nvidia can easily afford spending a billion dollars in "proofing"/showcasing how good their proposal is outside of feel good vibes, and maybe inform the part of the industry that may not need ground truth but can't tolerate hallucinations (similar to the medical industry and MRI).