We have a bigger table that includes comparisons with the Kepler and Maxwell generations of Tesla accelerators, but this table is too big to display. (
You can view it here in a separate window.) The FP16 with either FP16 or FP32 accumulate, bfloat16 (BF16), and Tensor Float32 (TF32) formats used on the new Tensor Core units show performance without the sparse matrix support and the 2X improvement with it turned on.
The sparse matrix support also gooses INT4 and INT8 inference processing on the Tensor Cores by a factor of 2X when it is activated. It is not available on the FP64 processing on the Tensor Cores, but the Tensor Core implementation of 64-bit matrix math can deliver 2X the throughput on FP64 math compared to the FP64 units on the GA100 and 2.5X that of the GV100, which only had plain vanilla FP64 units.
...
“It may not be obvious from the documentation, but is it’s a non-trivial exercise to get another 2X performance out of the SMs with the Tensor Cores,” Alben tells
The Next Platform. “We pushed it as far as we thought we could in Volta without the thing catching on fire, but with a lot of hard work, we figured out how to get another 2X out of the system, and we were able to do that and get even better utilization than we did in the Volta generation. We are definitely proud to see the results.”
By the way, here is one thing to look forward to: That extra 20 percent of memory bandwidth and memory capacity will be unlocked, and so will the remaining 18.5 percent of latent performance embodied in the 20 SMs that are dark to increase the yield on the chips. This is a bigger block of latent capacity in the Ampere device than was in the Volta device.