Tensors! *spawn*

So... yeah... you've read that right: the same input data is present in *TWO* separate locations in the register file, i.e. it is written TWICE to the register file (and shared memory) by the load/store instructions, and read TWICE by the HMMA instructions. I have verified this to be true for both the WMMA API *and* the cuBLAS kernels via cuda gdb.

I've got a number of theories as to why (e.g. the execution units inside a "sub-core" might be split as 4 x 4-wide ALUs with their own register file in order to improve locality / reduce the wiring distance, and they wanted to keep the data inside each subset of the register file rather than swizzling across the entire SM every instruction... just one of several possibilities), but either way, it makes everything a bit complicated...
I am sure I read some documentation/paper on this and if I get the time will try and find it, pretty sure some time ago I mentioned on here but going to be a pain to find.
 
Last edited:
Do you mean that each sub-core is divided in 4 groups of 4 FMAs basically? Would be very interested if there was anything hinting at that!

If you mean the Hotchips presentation on Volta, that split each SM in 4 sub-cores, but doesn’t split the sub-cores further IIRC
 
Do you mean that each sub-core is divided in 4 groups of 4 FMAs basically? Would be very interested if there was anything hinting at that!

If you mean the Hotchips presentation on Volta, that split each SM in 4 sub-cores, but doesn’t split the sub-cores further IIRC
Hotchips comes to mind but there was also another document/paper on the subject, started to try and trawl through either what I was given or some of the papers/presentation I read, sadly like a needle in a haystack.
It was more on the operation-instruction you describe written/read twice data, I would not like to say if it explained too much detail on the implementation until I can find it - is bugging me :)
 
I'm going away for a week, don't have the time to finish the analysis and write a proper blog post about this now, and who knows if anyone will care post-GTC where NVIDIA might or might not announce a new architecture...
NVIDIA didn't announce any new architectures at GTC 2018, and I'm still interested in what analysis and speculation you can come up with.
 
It does not seem to be using the Tensor cores if one looks at the results between V100 and Pascal GPUs; can be understandable as not everything can be reduced-optimised to fp16 but worth noting Google TPU2 is also only mixed precision so the direction and momentum is going that way.
Looking at other TensorFlow results the gap is much larger.
Table results at bottom for some general testing without optimisation: https://www.pugetsystems.com/labs/h...s-and-Testing-of-FP16-for-Deep-Learning-1141/

But yeah there are caveats to getting the most out of the Tensor cores and how used, and nice to see AMD improving from TF1.1, crux is how quickly they can get up to TF 1.6 support.
 
Hotchips comes to mind but there was also another document/paper on the subject, started to try and trawl through either what I was given or some of the papers/presentation I read, sadly like a needle in a haystack.
It was more on the operation-instruction you describe written/read twice data, I would not like to say if it explained too much detail on the implementation until I can find it - is bugging me :)

I think you're referring to something a research team at Citadel did, and they tracked down the tensor core threads/matrices/registers mapping (for their part, NV refers to the fragments' identity and locality as undefined). They did a presentation of this at one of the recent GTCs, though not Hot Chips IIRC.

I tried extrapolating from that paper and NVIDIA's various dev documentation on the topic, though I'm not confident in the resulting accuracy.
 
I think you're referring to something a research team at Citadel did, and they tracked down the tensor core threads/matrices/registers mapping (for their part, NV refers to the fragments' identity and locality as undefined). They did a presentation of this at one of the recent GTCs, though not Hot Chips IIRC.

I tried extrapolating from that paper and NVIDIA's various dev documentation on the topic, though I'm not confident in the resulting accuracy.
Thanks for the link Nate.
I recognise the names and seen some of their work before, maybe one of their earlier pre-publish papers without geometry aspects *shrug*.
 
Accelerating Reduction and Scan Using Tensor Core Units
November 23, 2019
Although TCUs are prevalent and promise increase in performance and/or energy efficiency, they suffer from over specialization as only matrix multiplication on small matrices is supported. In this paper we express both reduction and scan in terms of matrix multiplication operations and map them onto TCUs.

To our knowledge, this paper is the first to try to broaden the class of algorithms expressible as TCU operations and is the first to show benefits of this mapping in terms of: program simplicity, efficiency, and performance. We implemented the reduction and scan algorithms using NVIDIA’s V100 TCUs and achieved 89%−98% of peak memory copy bandwidth.

TCUs are designed to accelerate Multilayer Perceptrons (MLP), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN) or Neural Network (DNN) in general. TCUs come under the guise of different marketing terms, be it NVIDIA’s Tensor Cores [55], Google’s Tensor Processing Unit [19], Intel’s DLBoost [69], Apple A11’s Neural Engine [3], Tesla’s HW3, or ARM’s ML Processor [4]. They vary in the underlying hardware implementation [15, 27, 63, 71], and are prevalent [18, 55, 58] in both cloud and edge devices.

The objective of the paper is to expand the class of algorithms that can execute on TCUs— enabling the TCUs to be used within a wider range of non-GEMM algorithms. We choose reduction and scan, since a large body of work [7, 9, 36] has shown that they are key primitives for data parallel implementations of radix sort, quicksort, lexical analysis, stream compaction, and polynomial evaluation.
...
We implemented the proposed algorithms onto V100 TCUs, achieved up to 100× speedup for reduction and up to 3× for scan, and showed performance that rivals state of the art implementation in the worst cases. We observed up to 22% less power consumption for reduction and 16% for scan using NVPROF. As a result of the algorithms, we were able to make use of the otherwise idle TCUs— enabling better GPU utilization for kernels that exercise the general purpose ALUs.
https://arxiv.org/pdf/1811.09736.pdf
 
Back
Top