To understand the compute requirements find the weakest nvidia chip that supports dlss.I just want to reiterate that that my goal is to understand the compute requirements for an upscaler of DLSS quality
I found this https://www.techpowerup.com/gpu-specs/geforce-rtx-2050-max-q.c4012
"Each Tensor Core can perform up to 64 floating point fused multiply-add (FMA)
operations per clock using FP16 inputs. Eight Tensor Cores in an SM perform a total of 512 FP16
multiply and accumulate operations per clock, or 1024 total FP operations per clock. The new
INT8 precision mode works at double this rate, or 2048 integer operations per clock"
From Nvidia turing Whitepaper
2050 max-q has ampere architecture, which doubled the computational capabilities of tensor cores, but their number per SM decreased from 8 to 4. Therefore, the final numbers have not changed in comparison with turing (unless of course sparsity mode is used).
16(sm)*2048(Number of operations per sm)*1700(boost clock)=55 705 600 or 56tops
This is the theoretical maximum int8 bandwidth of the weakest NV chip that supports dlss. In reality, such numbers will never happen
Again these numbers are given to understand very approximate requirements. As already written here, tensor cores are not limited by their performance, they are limited by memory and access to it(Otherwise we would see a strong decrease in execution time with rtx 4000, but it is not happening)
Check Digital foundry video when comparing dp4a xess and hardware xess via XMX. Problem isn't accuracy, it's speedDoes it have lower precision accumulation than "native" INT8 hardware?
Last edited: