that's the only V100 bench published so far (by Nvidia):
Not keen on the chart myself as I think they skewed P100 a bit, but then it is not far from the 8x performance over P100 standard FP32 compute core comment from their blog I quote below.
There are others such as the Caffe2-ResNet one, but they may not be ideally optimised (will notice below).
But with TPU2 and Nvidia, it is best for now IMO to use what they state as their theoretical peak FMA.
Seems TPU2 is 180 TFLOPs FP16 Tensor DL/matrices, and V100 is 30 TFLOPs FP16 or 120TFLOPs Tensor DL/matrices.
I mentioned it in other posts, but Nvidia has stated on their site that Tensor 8 cores are 8x greater throughput than 64 cuda cores operating as 'Standard FP32' (their wording I think).
So that means per SM the ratio is 4x FP16 theoretical throughput and that also comes to 120 TFLOPs FP16 for V100 lining up with the official 30 TFLOPs FP16.
Here is one of the Caffe-Resnet charts, the one worth referencing and context for this discussion is the far right that is FP16 Inferencing and pretty close to the 4x figure and aligns to the above comment:
Yeah I appreciate as this is a real performance it means one also has to adjust to the fact V100 has around 31% more SM as well, so I guess between this chart and yours it does come to maybe 4x greater performance over FP16 P100 in DL.
But I think we are digressing, although it is interesting.
Edit:
Here is the only comparison comment to the general CUDA mixed precision FP32 core stated by Nvidia, this is per SM and so scales.
Nvidia blog said:
This is a dramatic 8X increase in throughput for deep learning applications per SM compared to Pascal GP100 using standard FP32 operations
So for FP16 theoretical it becomes 4x increase with DL using Tensor cores, which ties in with the right FP16 picture and your chart when making for allowances.
And that follows through with the official V100 spec:
FP32 15TFLOPs
FP16 30TFLOPs
Tensor 120TFLOPs. (FP16 mixed precision matrices DL)
Cheers