Nvidia Volta Speculation Thread

Been listening to that on the train, quite interesting and absolutely hilarious to hear Giroux suddenly speak with a french accent when he says "capability" good god I did not expect that out of the blue

When was that?
 
I think so, although TLS and exceptions still aren't supported. Edit: not sure about auto-parallelization, but if I understood him correctly, it can run parallel C++ programs (subject to the restrictions above).
 
I understood it to mean it can run any form of concurrency implementable in c++, not that it automatically ports from x86 based cpp implementations. It can run anything implementable with visual c++, not visual c++

What I found most interesting was how long ago the proof of concept for volta was developed, seven years. I wonder how long ago design decisions were set in stone, thinking 7 years ahead seems like a daunting task. Ties in to his talk about the risks of implementing shiny new hardware features that require particular attention by programmers etc
 
I should have written "more easily parallelize" instead of "auto-parallelize".
Volta may run any form of concurrency implementable in C++, but I guess it would still be very inefficient with stuff like the Actor Model.
And if, instead of parallelizing one C++ program, you're simultaneously running multiple independent C++ programs on Volta, wouldn't it be faster to run them on a multi-core CPU and pin each program to a core?
 
Without having listened to the podcast, but following the development of C++17, my guess would be that he is talking mostly about the new Parallel STL algorithms. Basically, most of the algorithms in the STL will accept an execution policy, that specifies whether you want sequential, parallel or parallel + vectorized version of the algorithms to run. I seriously doubt that he is talking about running std::thread's on the GPU. Here is the relevant reference (note the authors)
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf

This is, of course, if he is talking about C++ :)
 
As of now and AFAIK, it has not been established that you can use the Tensor Cores for anything else than DNN training and probably inferencing, i.e. they seem to be not the reasonably and expectedly freely programmable FLOPS people are usually talking about when comparing FLOPS.

Or would you like another Geforce 3 with 76 GFLOPS?
 
As of now and AFAIK, it has not been established that you can use the Tensor Cores for anything else than DNN training and probably inferencing, i.e. they seem to be not the reasonably and expectedly freely programmable FLOPS people are usually talking about when comparing FLOPS.
It has been very well established that you will be able to do all that.

https://devblogs.nvidia.com/parallelforall/inside-volta/#more-7832 has the following:

In addition to CUDA C++ interfaces to program Tensor Cores directly, CUDA 9 cuBLAS and cuDNN libraries include new library interfaces to make use of Tensor Cores for deep learning applications and frameworks.

Edit: for the gory CUDA details, check out slide 51 from this GTC session:
http://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf
 
As of now and AFAIK, it has not been established that you can use the Tensor Cores for anything else than DNN training and probably inferencing, i.e. they seem to be not the reasonably and expectedly freely programmable FLOPS people are usually talking about when comparing FLOPS.
It's just D = A * B + C. Where all 4 components are matrices and not scalars like with ordinary MAD or FMAD. Nothing particularly magical about that. Normally if you want to transform a vertex for example that's C = A * B where A would be a matrix and C and A are vectors. But you can transform 4 vertices at the same time if C and B are also matrices (so B packs 4 vertices). Just an example. ;)
 
Interesting. I must have missed this. I was under the impression, that for the time being the Tensor Cores would be accessible only via Nvidias DNN libraries.

Is fp16 enough for the transform matrix?
It was just an example and if devs know what they're doing why not? Vertex processing though was exempt even from DX9's minimum general precision of 24 bits and always went with FP32 minimum.
 
I you want more details on using the tensor cores for general compute, check out Mark Harris CUDA 9 and beyond presentation from GTC:

http://on-demand.gputechconf.com/gt...-mark-harris-new-cuda-features-and-beyond.pdf

Note that this deck explains the matrices need to be 16x16 matrices, not 4x4 matrices in order to access them from CUDA. It also explains that the tensor core operations are synchronizing operations across the warp, because the inputs and outputs to the matrices are striped across threads in a warp in a pattern that is opaque to the programmer. They provide an API to make accessing these striped matrices possible, but all this means that the programming model for the tensor cores is quite different from traditional GPU shader code.
 
It was just an example and if devs know what they're doing why not? Vertex processing though was exempt even from DX9's minimum general precision of 24 bits and always went with FP32 minimum.

I'm not sure that devs knowing what they are doing automatically means it's possible to get acceptable quality with fp16. Maybe it is, maybe not - that's the question. There are some obvious things one can do - with matrix-multiply-accumulate, vertex translation can happen via the fp32 addend matrix - but I'm not a graphics programmer so I'm not really sure where the precision is needed. I also didn't realize matrices needed to be 16x16. I guess that might help in some ways, since each component of the transform matrix could be represented as the unevaluated sum of two fp16 numbers, giving you extra mantissa precision. But I'm too dumb to figure out a way to set things up without the 16x16 transform matrix containing lots of zero blocks, so unless there's some way to skip MMAs for 0 4x4 matrix sub-blocks, I'm guessing there would be no performance gain (and it's all irrelevant anyway if consumer Volta omits the tensor cores).

Maybe skinning meshes would be a better application?
 
Back
Top