CarstenS: tensor cores lower register file bandwidth per math operation because they work on tensors rather than scalars (N^3/N^2).
Nvidia actually shows this phenomenon in their animated tensor core cartoons in their keynotes.
So it’s likely they are close to peak RF bandwidth both at 78...
VGTech just does FPS benchmarking right? I’ve never seen an analysis as in-depth or insightful as DF anywhere else. FPS benchmarks aren’t a dime a dozen but don’t inform about the broader issues.
DLSS makes the game faster only when the frame rate is low. If the frame rate is high, the cost of running the neural network, even with tensor cores, will dominate the rendering time, meaning you won’t see a performance improvement.
The full Volta memory model is supported to all remote GPU memories connected by NVSwitch. You can dereference any pointer, without doing any work in software to figure out where in the system that pointer points. You can use atomics.
It’s not transparent to the GPUs themselves - obviously...
The intrinsic is fine. The missing performance is because the CUDA compiler can’t optimally schedule and register allocate the code that uses the intrinsic. Hopefully that will improve with time. Getting 100% utilization of the tensor cores requires the whole chip to work at full tilt, doing...
The CUDA example is using WMMA, the CUDA abstraction for tensor cores. 50 TFlops is about right for the WMMA interface with current CUDA. To get full performance, use CUBLAS.
GP100 has a completely different SM than GP102. The ratio of scheduling to math hardware and on-chip memory is quite different. So this comparison is not as straightforward as you'd like to make it.