That video is old news.
There is quite a bit of detail about TSR in this feedback thread. Details on optimisations, changing how they measured it as the quality depends on the framerate as well as the input resolution.
EDIT:Release schedule 5.1 compared to Chapter 4 and all the changes needed in TSR made it such that is was still not stable enough which is not great considering absolutely all 3D rendered pixels are going through it. This is where the upcoming 5.2 release is in fact a big deal for TSR being the first release both production proven state, stable and with more debuging tools.
What we can do specifically on PS5 and XSX is that conveniently most of their AMD GPU is public ( https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf ) , so we can go really deep into hardware details and imagine and experiments crazy ideas. And this is what TSR did in 5.1, it exploit performance characterists of RDNA’s 16bit instructions heavily which can have huge performance benefits. In UE5/Main for 5.3, added shader permutation ( https://github.com/EpicGames/UnrealEngine/commit/c83036de30e8ffb03abe9f9040fed899ecc94422 ) to finaly tap on these instructions exposed in standard HLSL in Shader model 6.2 ( 16 Bit Scalar Types · microsoft/DirectXShaderCompiler Wiki · GitHub ) and for instance on an AMD 5700 XT, the performance savings in TSR are identical to how much these consoles are optimised too:
What makes TSR in 5.1+ so different from 5.0 is in how many more convolutions it does using this exposed hardware capabilities. That RejectShading is doing like 15 convolutions in 5.1+, each 3x3 on a current hardware compared to 3 in 5.0, which allows to make TSR substentially smarter thanks to very neat discovered properties chaining some very particular convolutions do. And while this number of convolutions massively increased by a factor of 5, the runtime cost of this part of TSR didn’t change, and yet this gain in smartness of the algorithm allowed to cut significant amount of other costs that was no longer required in the rest of the TSR algorithm which is core reason behind this performance saving from 3.1ms to 1.5ms on these console. Sadly this expose hardware capabilities in standard HLSL are not benefiting all GPUs equally because how they decided to architure their hardware too.
So they are saying TSR are super optimized for the console GPUs in 5.2 release?
Spell it out for the dummies in the room
So much this will saves in the runtime cost is largely if the drivers says it supports, but also how the hardware is capable to take advantage of this optimisation or not. 16bit often saves register pressure that when register bound can saves up to 2x performance improvement, but for instance RDNA GPUs also have the packed instructions like v_pack_mul_f16 capable to two two multiplications of the price of 1 which is another 2x. So that is the use of 16bit instruction on that shader can almost do a x4 perf improvement.
So is that good for the performance outcome of game?They use FP16 and dual FP16but it is used too on PC with GPU compatible with it.
So is that good for the performance outcome of game?
Vega, RDNA1 and RDNA2 support Rapid Packed Math, where FP16 operations run twice as fast as FP32 operations.They use FP16 and dual FP16but it is used too on PC with GPU compatible with it.
Vega, RDNA1 and RDNA2 support Rapid Packed Math, where FP16 operations run twice as fast as FP32 operations.
For NVIDIA, Turing, Ampere and Ada support running FP16 ops on the tensor cores for huge throughput, while also allowing for simultaneous FP16 (on Tensors) and FP32 (on ALUs) action.
Also one last thing. It would be great if we could switch the unreal engine Anti-Aliasing pipeline to tensor cores, Xe cores, AMD matrix, for performance reasons. (A console command would allow devs to let players to decide to tap into unused GPU components if DLSS,XeSS, etc is not available).
TSR could run on tensor cores instead of the main graphical computing units providing the games main render frames.
Would be great! But we are limited to the APIs exposed to us: for instance DirectML that requires roundtrip to main memory between a programmed shader versus matrix multiplications which is not in the interest of performance. This is where each HIV have a lot less constraints compared to us because knows exactly what and how their hardware can do and exploit many advantages that are not standardized like avoiding roundtrip to main memory to squeeze as much as possible runtime GPU performance to invest even more on quality.
What we can do specifically on PS5 and XSX is that conveniently most of their AMD GPU is public ( https://www.amd.com/system/files/TechDocs/rdna2-shader-instruction-set-architecture.pdf ) , so we can go really deep into hardware details and imagine and experiments crazy ideas.
I believe FP16 ops are routed automatically to Tensor cores on NVIDIA hardware, without the developer intervention.It seems they don't use Tensor core.