The point is they still increased it by 41.5% when the physical side only increased by 33% while also adding even more functionality and all within same TDP on a massive die and similar 16nm node (albeit latest iteration in TSMC typical fashion called 12nm).
They still do packed accelerated 2xFP16 math in V100 just like P100 btw.
You get 30TFLOPs FP16 and also the Tensor matrix function unit/cores, usually Tensor matrix will have more specific uses primarily towards Deep Leaning framework/apps (future it is in theory possible to use this with professional rendering-modelling, not talking about gaming though).
Those Tensor function units/cores can also be used for FP32 operations as well, so I think that works out around 2x faster with DL supported framework/apps.
Cheers