Although TCUs are prevalent and promise increase in performance and/or energy efficiency, they suffer from over specialization as only matrix multiplication on small matrices is supported. In this paper we express both reduction and scan in terms of matrix multiplication operations and map them onto TCUs.
To our knowledge, this paper is the first to try to broaden the class of algorithms expressible as TCU operations and is the first to show benefits of this mapping in terms of: program simplicity, efficiency, and performance. We implemented the reduction and scan algorithms using NVIDIA’s V100 TCUs and achieved 89%−98% of peak memory copy bandwidth.
TCUs are designed to accelerate Multilayer Perceptrons (MLP), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN) or Neural Network (DNN) in general. TCUs come under the guise of different marketing terms, be it NVIDIA’s Tensor Cores [55], Google’s Tensor Processing Unit [19], Intel’s DLBoost [69], Apple A11’s Neural Engine [3], Tesla’s HW3, or ARM’s ML Processor [4]. They vary in the underlying hardware implementation [15, 27, 63, 71], and are prevalent [18, 55, 58] in both cloud and edge devices.
The objective of the paper is to expand the class of algorithms that can execute on TCUs— enabling the TCUs to be used within a wider range of non-GEMM algorithms. We choose reduction and scan, since a large body of work [7, 9, 36] has shown that they are key primitives for data parallel implementations of radix sort, quicksort, lexical analysis, stream compaction, and polynomial evaluation.
...
We implemented the proposed algorithms onto V100 TCUs, achieved up to 100× speedup for reduction and up to 3× for scan, and showed performance that rivals state of the art implementation in the worst cases. We observed up to 22% less power consumption for reduction and 16% for scan using NVPROF. As a result of the algorithms, we were able to make use of the otherwise idle TCUs— enabling better GPU utilization for kernels that exercise the general purpose ALUs.