Something does not compute here - why making the TC's FP64 2.5x as fast and yet keep lugging the old FP64 units around?
Please correct me If I am wrong here.
Traditional FP64 units are simply a byproduct of the FP32 CUDA cores, you simply combine two FP32 cores (with some additional data paths) and that's it, that's why FP64 is always half rate in these GPUs, it doesn't take much effort to do this, as this is the lowest effort required to achieve high DP throughput. In other words: there are no actual unique FP64 ALUs on the die. You can see this clearly when running FP64 code on HPC GPUs, it will take over the entire shader core utilization.
This then comes back to the design of the chip, NVIDIA deemed they still need FP32 CUDA cores and Texture units in an AI chip, so doing traditional FP64 on top of them is easy enough after that, it doesn't cost that much, and will likely remain there for compatibility purposes and other general calculations.
For the rest of the stuff, Tensor cores will now provide the majority of throughput, they will replace FP32 for training with TF32, automatically without a code change, and they will achieve high matrix FP64 throughput but require developers to adapt their code to it.
Each Tensor core in Turing was capable of 64 FP32/FP16 operations per clock, Ampere increases that to 256 FP32/FP16 op per clock! So IPC has increased 4 times, with this comes the revelation that NVIDIA can use this heightened ability to unlock running FP64 ops on the tensor cores, I think each GEMM FP64 op now takes about 32 clocks on each tensor core.