At the International Conference on Machine Learning this week in France, Nvidia rolled out improvements to its CUDA development environment for accelerating applications with GPUs that will substantially improve the performance of deep learning algorithms on a couple of fronts.
....
The first big change is that CUDA 7.5 supports mixed precision, which means it supports FP16 16-bit data as well as the normal 32-bit single precision and 64-bit double-precision floating point data types. “The change to 16-bits allows researchers to store twice as large of a network,” explains Ian Buck, vice president of accelerated computing at Nvidia, to
The Platform. “And though you lose some precision with the 16-bit numbers, the larger model more than makes up for that and you actually end up getting better results.”
....
We also assume that the performance improvements that Nvidia has announced this week for deep learning are additive, meaning that the combination of larger, lower-resolution models, single-GPU training speed improvements, and multi-GPU scale-out can result in larger neural networks to be trained in a much shorter time. By our math, it looks like the speedup using a mix of these technologies could be on the order of 4X to 5X, depending on the number of Titan X GPUs deployed in the system and the nature of the deep learning algorithms.
....
What is interesting to contemplate is how much performance improvement
the future “Pascal” GPUs from Nvidia, coupled with the NVLink interconnect, will deliver on deep learning training algorithms. Back in March, Nvidia co-founder and CEO Jen Jen-Hsun Huang went through some “CEO math” as he called it to show that the combination of mixed precision (FP16) floating point plus 3D high bandwidth memory on the Pascal GPU card (with up to 1 TB/sec of memory bandwidth) plus NVLink interconnects between the GPUs would allow for Pascal GPUs to offer at least a 10X improvement in performance over the Maxwell GPUs used in the Titan X. (There is no Maxwell GPU available in the Tesla compute coprocessors,
and as far as we know there will not be one, although Nvidia has never confirmed this.)
The Pascal GPUs are expected to come to market in 2016 along with CUDA 8 and the NVLink interconnect, which will tightly couple multiple GPUs directly together so they can share data seamlessly. The NVLink ports will run at 20 GB/sec, compared to a top speed of 16 GB/sec for a PCI-Express 3.0 x16 slot, and it will be possible to have four NVLink ports hook together two Pascal GPUs for a maximum of 80 GB/sec between them. There will be enough NVLink ports to have 40 GB/sec between three GPUs in a single node (two ports per GPU) and with four GPUs customers will be able to use a mix of NVLink ports to cross-couple the devices together with a mix of 40 GB/sec and 20 GB/sec links. Depending on how tightly the memories are coupled, such a cluster of GPUs could, in effect, look like one single GPU to any simulation or deep learning training framework, further simplifying the programming model and, we presume, providing a significant performance boost, too.