Nvidia Volta Speculation Thread

ieldra · Sep 1, 2017

xpea said:
Good podcast about Volta and CUDA C++ (interesting stuff starts around 20 minutes):
https://player.fm/series/cppcast/volta-and-cuda-c-with-olivier-giroux

Been listening to that on the train, quite interesting and absolutely hilarious to hear Giroux suddenly speak with a french accent when he says "capability" good god I did not expect that out of the blue

Alexko · Sep 1, 2017

ieldra said:
Been listening to that on the train, quite interesting and absolutely hilarious to hear Giroux suddenly speak with a french accent when he says "capability" good god I did not expect that out of the blue

When was that?

rcf · Sep 2, 2017

xpea said:
Good podcast about Volta and CUDA C++ (interesting stuff starts around 20 minutes):
https://player.fm/series/cppcast/volta-and-cuda-c-with-olivier-giroux

He's saying that Volta can correctly auto-parallelize C++11/14/17 programs originally written for CPUs?

psurge · Sep 2, 2017

I think so, although TLS and exceptions still aren't supported. Edit: not sure about auto-parallelization, but if I understood him correctly, it can run parallel C++ programs (subject to the restrictions above).

ieldra · Sep 2, 2017

I understood it to mean it can run any form of concurrency implementable in c++, not that it automatically ports from x86 based cpp implementations. It can run anything implementable with visual c++, not visual c++

What I found most interesting was how long ago the proof of concept for volta was developed, seven years. I wonder how long ago design decisions were set in stone, thinking 7 years ahead seems like a daunting task. Ties in to his talk about the risks of implementing shiny new hardware features that require particular attention by programmers etc

rcf · Sep 2, 2017

I should have written "more easily parallelize" instead of "auto-parallelize".
Volta may run any form of concurrency implementable in C++, but I guess it would still be very inefficient with stuff like the Actor Model.
And if, instead of parallelizing one C++ program, you're simultaneously running multiple independent C++ programs on Volta, wouldn't it be faster to run them on a multi-core CPU and pin each program to a core?

smw · Sep 2, 2017

Without having listened to the podcast, but following the development of C++17, my guess would be that he is talking mostly about the new Parallel STL algorithms. Basically, most of the algorithms in the STL will accept an execution policy, that specifies whether you want sequential, parallel or parallel + vectorized version of the algorithms to run. I seriously doubt that he is talking about running std::thread's on the GPU. Here is the relevant reference (note the authors)
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf

This is, of course, if he is talking about C++

Deleted member 13524 · Sep 7, 2017

First Volta DGX systems shipping:
http://www.anandtech.com/show/11824/nvidia-ships-first-volta-dgx-systems

And here Anandtech keeps insisting that GV100 has 2x FP16 throughput (240 TFLOPs on 8 GPUs = 30 TFLOPs per GPU), despite Cuda 9.0 RC stating otherwise.

CarstenS · Sep 7, 2017

At least they're not parrotting Nvidia, who claim 960 FP16-TFLOPS for DGX-1 without mentioning that this number is derived from the Tensor cores.
https://www.nvidia.com/content/dam/...dgx-1/dgx-1-ai-supercomputer-datasheet-v4.pdf

Kaotik · Sep 7, 2017

CarstenS said:
At least they're not parrotting Nvidia, who claim 960 FP16-GFLOPS for DGX-1 without mentioning that this number is derived from the Tensor cores.
https://www.nvidia.com/content/dam/...dgx-1/dgx-1-ai-supercomputer-datasheet-v4.pdf

You mean TFLOPS

CarstenS · Sep 7, 2017

Yes, of course.

silent_guy · Sep 7, 2017

CarstenS said:
At least they're not parrotting Nvidia, who claim 960 FP16-TFLOPS for DGX-1 without mentioning that this number is derived from the Tensor cores.
https://www.nvidia.com/content/dam/...dgx-1/dgx-1-ai-supercomputer-datasheet-v4.pdf

Even worse: Nvidia is claiming 15 FP32 TFLOPS without mentioning that is derived from SIMD cores!

CarstenS · Sep 7, 2017

As of now and AFAIK, it has not been established that you can use the Tensor Cores for anything else than DNN training and probably inferencing, i.e. they seem to be not the reasonably and expectedly freely programmable FLOPS people are usually talking about when comparing FLOPS.

Or would you like another Geforce 3 with 76 GFLOPS?

silent_guy · Sep 8, 2017

CarstenS said:
As of now and AFAIK, it has not been established that you can use the Tensor Cores for anything else than DNN training and probably inferencing, i.e. they seem to be not the reasonably and expectedly freely programmable FLOPS people are usually talking about when comparing FLOPS.

It has been very well established that you will be able to do all that.

https://devblogs.nvidia.com/parallelforall/inside-volta/#more-7832 has the following:

In addition to CUDA C++ interfaces to program Tensor Cores directly, CUDA 9 cuBLAS and cuDNN libraries include new library interfaces to make use of Tensor Cores for deep learning applications and frameworks.

Edit: for the gory CUDA details, check out slide 51 from this GTC session:
http://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf

MDolenc · Sep 8, 2017

CarstenS said:
As of now and AFAIK, it has not been established that you can use the Tensor Cores for anything else than DNN training and probably inferencing, i.e. they seem to be not the reasonably and expectedly freely programmable FLOPS people are usually talking about when comparing FLOPS.

It's just D = A * B + C. Where all 4 components are matrices and not scalars like with ordinary MAD or FMAD. Nothing particularly magical about that. Normally if you want to transform a vertex for example that's C = A * B where A would be a matrix and C and A are vectors. But you can transform 4 vertices at the same time if C and B are also matrices (so B packs 4 vertices). Just an example.

psurge · Sep 8, 2017

Is fp16 enough for the transform matrix?

CarstenS · Sep 8, 2017

Interesting. I must have missed this. I was under the impression, that for the time being the Tensor Cores would be accessible only via Nvidias DNN libraries.

psurge said:
Is fp16 enough for the transform matrix?

It was just an example and if devs know what they're doing why not? Vertex processing though was exempt even from DX9's minimum general precision of 24 bits and always went with FP32 minimum.

HKS · Sep 8, 2017

CarstenS said:
Interesting. I must have missed this. I was under the impression, that for the time being the Tensor Cores would be accessible only via Nvidias DNN libraries.

I you want more details on using the tensor cores for general compute, check out Mark Harris CUDA 9 and beyond presentation from GTC:

http://on-demand.gputechconf.com/gt...-mark-harris-new-cuda-features-and-beyond.pdf

RecessionCone · Sep 8, 2017

HKS said:
I you want more details on using the tensor cores for general compute, check out Mark Harris CUDA 9 and beyond presentation from GTC:

http://on-demand.gputechconf.com/gt...-mark-harris-new-cuda-features-and-beyond.pdf

Note that this deck explains the matrices need to be 16x16 matrices, not 4x4 matrices in order to access them from CUDA. It also explains that the tensor core operations are synchronizing operations across the warp, because the inputs and outputs to the matrices are striped across threads in a warp in a pattern that is opaque to the programmer. They provide an API to make accessing these striped matrices possible, but all this means that the programming model for the tensor cores is quite different from traditional GPU shader code.

psurge · Sep 8, 2017

CarstenS said:
It was just an example and if devs know what they're doing why not? Vertex processing though was exempt even from DX9's minimum general precision of 24 bits and always went with FP32 minimum.

I'm not sure that devs knowing what they are doing automatically means it's possible to get acceptable quality with fp16. Maybe it is, maybe not - that's the question. There are some obvious things one can do - with matrix-multiply-accumulate, vertex translation can happen via the fp32 addend matrix - but I'm not a graphics programmer so I'm not really sure where the precision is needed. I also didn't realize matrices needed to be 16x16. I guess that might help in some ways, since each component of the transform matrix could be represented as the unevaluated sum of two fp16 numbers, giving you extra mantissa precision. But I'm too dumb to figure out a way to set things up without the 16x16 transform matrix containing lots of zero blocks, so unless there's some way to skip MMAs for 0 4x4 matrix sub-blocks, I'm guessing there would be no performance gain (and it's all irrelevant anyway if consumer Volta omits the tensor cores).

Maybe skinning meshes would be a better application?

Nvidia Volta Speculation Thread

ieldra

Alexko

rcf

psurge

ieldra

rcf

smw

Deleted member 13524

Guest

CarstenS

Moderator

Kaotik

Drunk Member

CarstenS

Moderator

silent_guy

CarstenS

Moderator

silent_guy

MDolenc

psurge

CarstenS

Moderator

HKS

RecessionCone

psurge

Similar threads