Nvidia Volta Speculation Thread

Deleted member 2197 · May 12, 2017

From the Volta comments section: Mark Harris is Chief Technologist for GPU Computing Software at NVIDIA.
"I've always thought the following blog post was a thoughtful, unbiased, well-explained analysis by someone who gets it. "

SIMD < SIMT < SMT: parallelism in NVIDIA GPUs
http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html

CSI PC · May 12, 2017

Kaotik said:
It just says Volta (CUDA9) is faster than Pascal (CUDA 8) with FP32 ops too, but it actually doesn't say it's because of the tensor cores. I simply don't see any way you could use it to speed up FP32 when it only accepts FP16 x FP16 + FP16/32

Guys it was clearly talking about both mixed precision SGEMMEx and single point FP32 SGEMM together same context, the quote also is quite clear on this and it is obvious SGEMM is FP32 and this case matrix-matrix multiply and fits my context of its use.
The clarity is in the relative gains of up to 1.8x (I thought it was roughly 2x from what I heard) with SGEMM on the V100, this relative gain is not coming out of thin air between the CUDA versions.

And also with the newer CUDA 9/cuDNN 7 version for Volta nice to see FP16 Training now fully supported.

Edit:
Going back and re-read it definitely supports my context so here it is again in bold showing it applicable to both.

Built for Volta and accelerated by Tesla V100 Tensor Cores, cuBLAS GEMMs (Generalized Matrix-Matrix Multiply) achieve up to a 9.3x speedup on mixed-precision computation (SGEMMEx with FP16 input, FP32 computation), and up to 1.8x speedup on single precision (SGEMM).

Notice they are not separated (in fact it is 'and' when talking about SGEMM) after 'accelerated by Tesla V100 Tensor Cores', and obviously both are part of said sentence started about Tensor Cores (which makes sense if one understands SGEMM).
Like I said we have not heard or read everything relating to these function units/cores.
Cheers

Kaotik · May 12, 2017

CSI PC said:
Guys it was clearly talking about both mixed precision SGEMMEx and single point FP32 SGEMM together same context, the quote also is quite clear on this and it is obvious SGEMM is FP32 and this case matrix-matrix multiply and fits my context of its use.
The clarity is in the relative gains of up to 1.8x (I thought it was roughly 2x from what I heard) with SGEMM on the V100, this relative gain is not coming out of thin air between the CUDA versions.

And also with the newer CUDA 9/cuDNN 7 version for Volta nice to see FP16 Training now fully supported.

Edit:
Going back and re-read it definitely supports my context so here it is again in bold showing it applicable to both.

Notice they are not separating after the comma at 'accelerated by Tesla V100 Tensor Cores', and obviously both are part of said sentence started about Tensor Cores (which makes sense if one understands SGEMM).
Like I said we have not heard or read everything relating to these function units/cores.
Cheers

That's the thing, it doesn't specify that the FP32 results are accelerated by Tensor Cores, just read it again if you don't see it.
It statest that new cuBLAS is "Built for Volta and Accelerated by Tesla V100 Tensor Cores". That doesn't mean every function they can do is accelerated by Tesla V100 Tensor cores, for that statement to hold true it's enough that even one function is - in this case the mixed precision results are accelerated by Tensor Cores, FP32 results are not.
And just look at the picture if that's not enough, one says "V100", one says "V100 Tensor Cores", guess which is which?

Razor1 · May 12, 2017

its not the same as packed math, what I think they are doing, is they are leveraging the new the scheduler to use the tensor cores and caching to increase utilization to give the increased performance. Pretty much eliminating the problems pascal had with mixed precision performance issues.

What AMD did I think with vega, is made their ALU's more robost, that is why they seem to get just higher x2 performance where nV is getting under x2.

Which way is better? I don't think it matters at the end really, just what the end performance really is of the entire chip.

CSI PC · May 12, 2017

Kaotik said:
That's the thing, it doesn't specify that the FP32 results are accelerated by Tensor Cores, just read it again if you don't see it.
It statest that new cuBLAS is "Built for Volta and Accelerated by Tesla V100 Tensor Cores". That doesn't mean every function they can do is accelerated by Tesla V100 Tensor cores, for that statement to hold true it's enough that even one function is - in this case the mixed precision results are accelerated by Tensor Cores, FP32 results are not.
And just look at the picture if that's not enough, one says "V100", one says "V100 Tensor Cores", guess which is which?

I never said every function or that it must be FP16 GEMM all the way through for the Tensor cores or that it must be exclusive.

Kaotik · May 12, 2017

CSI PC said:
I never said every function or that it must be FP16 GEMM all the way through for the Tensor cores or that it must be exclusive.

Oh ffs - the only point I've been trying to make the whole time is that unlike you claim, the Tensor Cores have absolutely no function whatsoever if you're using FP32 precision like that cuBLAS FP32 SGEMM . It's always, no exceptions, FMA with FP16 x FP16 + P16/32.

Deleted member 2197 · May 12, 2017

I think we will need Nvidia to clarify when more info is available.

CSI PC · May 13, 2017

Kaotik said:
Oh ffs - the only point I've been trying to make the whole time is that unlike you claim, the Tensor Cores have absolutely no function whatsoever if you're using FP32 precision like that cuBLAS FP32 SGEMM . It's always, no exceptions, FMA with FP16 x FP16 + P16/32.

No, you have taken my OP and posts since that totally out of context nor tried to explain how SGEMM manages 1.8x relative performance gains but keep on only about one aspect of Tensor cores and now wasted several pages on this.
BTW do you also say the CUDA cores are HFMA2 no exceptions with P100?
Oh wait they are FP32 cores also for FMA/GEMM FP16 and FP32/etc.

How did mixed-precision GEMM work on P100 with CUDA 8 because obviously it cannot use the same cores as the single precision GEMM maths (by your logic) with regards to cuBLAS library (what was used for context of relative gains between V100 and P100 and for both FP32 and FP16)?
Before answering this changed when compared to CUDA 7.5 and had to change again to support Volta now as Cuda 9 with the new/updated libraries.

Anyway maybe 1st explain the up to 1.8x relative gain (meaning both GPUs are more equalised and only focused on specific function that for their context and point was GEMM single and mixed precision) before we continue wasting time arguing.
That means it is not thread scheduling or cache improvements because the scope is very specific of what I posted and anyway too great.
The only current observation is that the Tensor cores are not exclusive when it comes to use with cuBLAS library and possibly other related libraries (but as I said repeatedly earlier on possible limitations involved in terms of flexibility).

But I have said enough on this for now as it is probably getting boring for everyone.
Edit:
All of the above comes back to matrix multiply in linear algebraic computations or convolutions.

Edit 2:
Late night posting; Noticed used dp2a for context (technically not quite correct) but better case is HFMA2 for P100, so changed.

Anarchist4000 · May 13, 2017

The tensor core is literally this. The entire performance improvement is because of this:

Each Tensor Core performs 64 floating point FMA mixed-precision operations per clock (FP16 input multiply with full precision product and FP32 accumulate, as Figure 8 shows) and 8 Tensor Cores in an SM perform a total of 1024 floating point operations per clock. This is a dramatic 8X increase in throughput for deep learning applications per SM compared to Pascal GP100 using standard FP32 operations, resulting in a total 12X increase in throughput for the Volta V100 GPU compared to the Pascal P100 GPU. Tensor Cores operate on FP16 input data with FP32 accumulation. The FP16 multiply results in a full precision result that is accumulated in FP32 operations with the other products in a given dot product for a 4x4x4 matrix multiply, as Figure 9 shows.
https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/#disqus_thread

The gains are the result of lots of ALUs and predictable access patterns reducing pressure on the registers and scheduling.

Deleted member 2197 · May 13, 2017

However, the real change in Volta is the addition of the tensor math unit. With this new unit, it's possible to perform a 4x4x4 matrix operation in one clock cycle. The tensor unit takes in a 16-bit floating-point value, and it can perform two matrix operations and an accumulate -- all in one clock cycle.

Internal computations in the tensor unit are performed with fp32 precision to ensure accuracy over many calculations. The V100 can perform 120 Tera FLOPS of tensor math using 640 tensor cores. This will make Volta very fast for deep neural net training and inference.

http://www.technewsworld.com/story/84528.html

Voxilla · May 13, 2017

In hardware these matrix computations are typically done with a so called 'systolic array'.
Google's TPU is also based on this, albeit computing 256x256 matrix multiplications at 8-bit.

Razor1 · May 13, 2017

hmm always wondered why they used systolic, does it have anything to do with pressure? register pressure I'm presuming.

Deleted member 2197 · May 13, 2017

NVIDIA Launches GPU Cloud Platform to Simplify AI Development

NVIDIA today announced the NVIDIA GPU Cloud (NGC), a cloud-based platform that will give developers convenient access -- via their PC, NVIDIA DGX system or the cloud -- to a comprehensive software suite for harnessing the transformative powers of AI.
...
Harnessing deep learning presents two challenges for developers and data scientists. One is the need to gather into a single stack the requisite software components -- including deep learning frameworks, libraries, operating system and drivers. Another is getting access to the latest GPU computing resources to train a neural network. NVIDIA solved the first challenge earlier this year by combining the key software elements within the NVIDIA DGX-1™ AI supercomputer into a containerized package. As part of the NGC, this package, called the NGC Software Stack, will be more widely available and kept updated and optimized for maximum performance.

To address the hardware challenge, NGC will give developers the flexibility to run the NGC Software Stack on a PC (equipped with a TITAN X or GeForce® GTX 1080 Ti), on a DGX system or from the cloud
...
NGC will offer the following benefits:

Purpose Built: Designed for deep learning on the world's fastest GPUs.

Optimized and Integrated: The NGC Software Stack will provide a wide range of software, including: Caffe, Caffe2, CNTK, MXNet, TensorFlow, Theano and Torch frameworks, as well as the NVIDIA DIGITS™ GPU training system, the NVIDIA Deep Learning SDK (for example, cuDNN and NCCL), nvidia-docker, GPU drivers and NVIDIA® CUDA® for rapidly designing deep neural networks.

Convenient: With just one NVIDIA account, NGC users will have a simple application that guides people through deep learning workflow projects across all system types whether PC, DGX system or NGC.

Versatile: It's built to run anywhere. Users can start with a single GPU on a PC and add more compute resources on demand with a DGX system or through the cloud. They can import data, set up the job configuration, select a framework and hit run. The output could then be loaded into TensorRT™ for inferencing.

NGC is expected to enter public beta by the third quarter. Pricing will be announced at a later date

http://nvidianews.nvidia.com/news/nvidia-launches-gpu-cloud-platform-to-simplify-ai-development

LiXiangyang · May 14, 2017

Volta is a really interesting design, looks almost like some ASIC for DL.

As for their tensor core, I believe Nvidia has already made it clear: it use FP16 only for storage, and do the multiply ops in full precision (FP32) and then add the result to a FP32 variable.

Thats why they use SgemmEX for benchmark against Pascal, since SgemmEX did that exact the same way (as contrast to the Hgemm routine): load the data from various precision but done the computation (multiply+add) in full (FP32) precision.

Which means tensor core is a full precision matrix multiplication unit with FP16 data input, thats why nvidia is more confidence of puting this tensor core for not just inference/forecast but also training the network as well.

And due to the computation is fully FP32 just like SgemmEX, the precsion loss is only limited to the FP16 storage stage, so I can think of many application outside of DL domain that could benefit from the vast computing resource GV100 offers.

Rootax · May 14, 2017

How much of all this new techs do we need in a gaming gpu ? (and by "need", I mean with price/power in mind)

Deleted member 2197 · May 14, 2017

Rootax said:
How much of all this new techs do we need in a gaming gpu ? (and by "need", I mean with price/power in mind)

If the direction gaming companies will be taking is using FP16 for calculations then it's significant.
Next id Tech relies on FP16 calculations

One of the most important features of the upcoming id tech is to be half-precision (FP16). Current graphics cards use almost exclusively simple accuracy (FP32), in the automotive and mobile sector as well as in deep learning, the focus is on FP16. If the precision is sufficient, the performance can be increased compared to FP32. The shader units are better utilized because less register memory is required. According to Billy Khan, the current id tech 6 is far more ALU-wide than bandwidth, which is why it promises great advantages from FP16.

https://www.golem.de/news/id-software-naechste-id-tech-setzt-massiv-auf-fp16-berechnungen-1704-127494.html

CSI PC · May 14, 2017

Voxilla said:
In hardware these matrix computations are typically done with a so called 'systolic array'.
Google's TPU is also based on this, albeit computing 256x256 matrix multiplications at 8-bit.

Also to add here is the Google details upon TPU : In-Datacenter Performance Analysis of a Tensor Processing Unit
https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf

The paper was made public earlier this year.
Edit:
I forgot to mention Xilinx are working towards an Int8 Deep Learning optimisation: https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf
Also has some nice relevant links at bottom of the paper.
Cheers

CSI PC · May 14, 2017

LiXiangyang said:
Volta is a really interesting design, looks almost like some ASIC for DL.

As for their tensor core, I believe Nvidia has already made it clear: it use FP16 only for storage, and do the multiply ops in full precision (FP32) and then add the result to a FP32 variable.

Thats why they use SgemmEX for benchmark against Pascal, since SgemmEX did that exact the same way (as contrast to the Hgemm routine): load the data from various precision but done the computation (multiply+add) in full (FP32) precision.

Which means tensor core is a full precision matrix multiplication unit with FP16 data input, thats why nvidia is more confidence of puting this tensor core for not just inference/forecast but also training the network as well.

And due to the computation is fully FP32 just like SgemmEX, the precsion loss is only limited to the FP16 storage stage, so I can think of many application outside of DL domain that could benefit from the vast computing resource GV100 offers.

What will also be interesting going forward is whether GV102 and the other lower GPUs (or some at least) will have a Int8/DP4a instruction version of the Tensor core; more for Int8 inferencing/convolution.
Just like the P100 they are careful not to talk about this, and makes sense from a differentiation perspective between GV100 and GV102.
Quite a few of the Nvidia Deep Learning related/GEMM libraries are also optimised for Int8 and maturing in terms of use and development.
Cheers

mczak · May 15, 2017

LiXiangyang said:
As for their tensor core, I believe Nvidia has already made it clear: it use FP16 only for storage, and do the multiply ops in full precision (FP32) and then add the result to a FP32 variable.

This design of course makes a lot of sense. A fp16 x fp16 multiply giving accurate output should have just about the same hw cost as a fp16 x fp16 multiply with fp16 output. But saying this is doing the multiply ops in full precision is a bit misleading imho, even if technically true, and this should help quite a lot indeed (if you'd do a matrix multiply with just individual fp16 fmas the result would be much worse).

CSI PC · May 15, 2017

mczak said:
This design of course makes a lot of sense. A fp16 x fp16 multiply giving accurate output should have just about the same hw cost as a fp16 x fp16 multiply with fp16 output. But saying this is doing the multiply ops in full precision is a bit misleading imho, even if technically true, and this should help quite a lot indeed (if you'd do a matrix multiply with just individual fp16 fmas the result would be much worse).

Not sure if you are actually disagreeing about the cores full precision computation if you agree it may be technically correct , agree Denormal/subnormal number is a consideration but depends upon the operation-function requirements.

On P100 as an example you can use (and I assume as well for V100) the HFMA2 instruction, that is fp16/fp16/fp16 (computation at f16 with a single rounding for accuracy).
Hgemm instruction (available as well on P100) is the fp16/fp16/fp16 (computation at f16) but limited only to fp16 input and output.
Importantly (with your context) the CUDA cores since GP100 Pascal and CUDA 8 also support fp16/fp16/fp32 (computation at f32) with SgemmEX and in the performance chart they show Tensor with same or comparable instruction; makes sense for Tensor in training.

I really doubt we have been told everything yet about the Tensor cores (or indirectly related instructions available for CUDA 9/latest compute SM versions),especially as they are presenting data performance using the more flexibile SgemmEX that means one can have variable input and variable output albeit with computation at f32 - point being Tensor is a full precision capable core with flexibility depending upon instruction supported.
However comes down to the optimised libraries and instructions supported and compute/CUDA level, and here the focus is on cuBLAS-GEMM by Nvidia, which can also be considered when discussing DP4A instruction that is not found on P100 (doubt it exists on V100) and will again be on GV102 and lower and likely IMO in Tensor core form primarily as int8 inference.

I just had a look at the twitter for various Nvidia engineers and Mark Harris said earlier in the week:

The MULS are 16-bit input, but full precision product

Which matches up with the SgemmEX use and also ties into the further explanation given for the chart showing the up to 9x greater performance than P100 - this was in the follow up article about high level CUDA 9.

cuBLAS GEMMs (Generalized Matrix-Matrix Multiply) achieve up to a 9.3x speedup on mixed-precision computation (SGEMMEx with FP16 input, FP32 computation), and up to 1.8x speedup on single precision (SGEMM). Other CUDA libraries are also optimized to deliver out-of-the-box performance on Volta.
A detailed performance report on CUDA 9 libraries will be available in the future.

The irony though is they never used fp16 for the P100 when comparing to V100 and the Tensor cores, just Sgemm/FMA (fp32) for P100.

Seems that this is the year DL moves to fp16 for training and int8 for inferencing more broadly and beyond the specialist cases to date.
At minimum one of the GTC presentations-workshops this year was on how to convert training for existing FP32 DL systems to FP16.
Cheers

Edit:
Just lazy no difference in my post between using upper and lower case for fp.

Nvidia Volta Speculation Thread

Deleted member 2197

Guest

CSI PC

Kaotik

Drunk Member

Razor1

CSI PC

Kaotik

Drunk Member

Deleted member 2197

Guest

CSI PC

Anarchist4000

Deleted member 2197

Guest

Voxilla

Razor1

Deleted member 2197

Guest

LiXiangyang

Rootax

Deleted member 2197

Guest

CSI PC

CSI PC

mczak

CSI PC

Similar threads