txbob
It's based on the use of TensorCore, which is a new computation engine in the Volta V100 GPU.
The TensorCore is not a general purpose arithmetic unit like an FP ALU, but performs a specific 4x4 matrix operation with hybrid data types. If your algorithm (whatever it may be) can take advantage of that, then you may witness a perf improvement. It has to be coded for, and this operation does not trivially map into a C or C++ operator (like multiply) so the exposure will probably primarily be through libraries, and the library in question for deep learning would be CuDNN.
It's likely that future versions of cuDNN will use the TensorCores on V100 (when V100 becomes available, in the future) and to the extent that these then become "available" to operations from e.g. TensorFlow that use the GPU, it should be possible (theoretically) to achieve a speed up for certain operations in Tensorflow.
You should in the future be able to use the TensorCore conceptually similarly to the way novel compute modes like INT8 and FP16 are currently exposed via CuDNN. You will have to specify the right settings, and format your data correctly, and after that it should "just work" for a particular cuDNN library call.
Using it as a standalone operation in pure CUDA C/C++ should theoretically be possible, but it remains to be seen exactly how it will be exposed (if at all) in future versions of CUDA.