The BERT github repository started with a FP32 single-precision model, which is a good starting point to converge networks to a specified accuracy level. Converting the model to use mixed precision with V100 Tensor Cores, which computes using FP16 precision and accumulates using FP32, delivered the first speedup of 2.3x. Tensor Core’s mixed precision brings developers the best of both worlds: the execution speed of a lower precision to achieve significant speedups, and with sufficient accuracy to train networks to the same accuracy as higher precisions. More details about Tensor Cores can be found in our Programming Tensor Cores blog.
The next optimization adds an optimized layer normalization operation called “layer norm” for short, which improves performance by building on the existing cuDNN Batch Normalization primitive, and netted an additional 9% speedup. Next, doubling batch size from its initial size of 8 to 16 increased throughput another 18%.
And finally, the team used TensorFlow’s XLA, a deep learning compiler that optimizes TensorFlow computations. XLA was used to fuse pointwise operations and generate new a optimized kernel to replace multiple slower kernels. Some of the specific operations that saw speedups include a GELU activation function, scale and shift operation in Layer Norm, Adam weights update, attention softmax and attention dropout. A recent blog describes how to get the most out of XLA running on GPUs. This optimization brought an additional 34% performance speedup.