Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    Been listening to that on the train, quite interesting and absolutely hilarious to hear Giroux suddenly speak with a french accent when he says "capability" good god I did not expect that out of the blue
     
    pharma likes this.
  2. Alexko

    Veteran Subscriber

    Joined:
    Aug 31, 2009
    Messages:
    4,515
    Likes Received:
    934
    When was that?
     
  3. rcf

    rcf
    Regular Newcomer

    Joined:
    Nov 6, 2013
    Messages:
    430
    Likes Received:
    355
  4. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    948
    Likes Received:
    46
    Location:
    LA, California
    I think so, although TLS and exceptions still aren't supported. Edit: not sure about auto-parallelization, but if I understood him correctly, it can run parallel C++ programs (subject to the restrictions above).
     
  5. ieldra

    Newcomer

    Joined:
    Feb 27, 2016
    Messages:
    149
    Likes Received:
    116
    I understood it to mean it can run any form of concurrency implementable in c++, not that it automatically ports from x86 based cpp implementations. It can run anything implementable with visual c++, not visual c++

    What I found most interesting was how long ago the proof of concept for volta was developed, seven years. I wonder how long ago design decisions were set in stone, thinking 7 years ahead seems like a daunting task. Ties in to his talk about the risks of implementing shiny new hardware features that require particular attention by programmers etc
     
    pharma and silent_guy like this.
  6. rcf

    rcf
    Regular Newcomer

    Joined:
    Nov 6, 2013
    Messages:
    430
    Likes Received:
    355
    I should have written "more easily parallelize" instead of "auto-parallelize".
    Volta may run any form of concurrency implementable in C++, but I guess it would still be very inefficient with stuff like the Actor Model.
    And if, instead of parallelizing one C++ program, you're simultaneously running multiple independent C++ programs on Volta, wouldn't it be faster to run them on a multi-core CPU and pin each program to a core?
     
  7. smw

    smw
    Newcomer

    Joined:
    Sep 13, 2008
    Messages:
    113
    Likes Received:
    43
    Without having listened to the podcast, but following the development of C++17, my guess would be that he is talking mostly about the new Parallel STL algorithms. Basically, most of the algorithms in the STL will accept an execution policy, that specifies whether you want sequential, parallel or parallel + vectorized version of the algorithms to run. I seriously doubt that he is talking about running std::thread's on the GPU. Here is the relevant reference (note the authors)
    http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3554.pdf

    This is, of course, if he is talking about C++ :)
     
    BRiT and pharma like this.
  8. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    11,035
    Likes Received:
    5,576
  9. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,936
    Likes Received:
    2,273
    Location:
    Germany
    #569 CarstenS, Sep 7, 2017
    Last edited: Sep 7, 2017
  10. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    9,088
    Likes Received:
    2,955
    Location:
    Finland
    pharma and CarstenS like this.
  11. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,936
    Likes Received:
    2,273
    Location:
    Germany
    Yes, of course. :embarrased:
     
  12. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    nnunn, Picao84 and sonen like this.
  13. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,936
    Likes Received:
    2,273
    Location:
    Germany
    As of now and AFAIK, it has not been established that you can use the Tensor Cores for anything else than DNN training and probably inferencing, i.e. they seem to be not the reasonably and expectedly freely programmable FLOPS people are usually talking about when comparing FLOPS.

    Or would you like another Geforce 3 with 76 GFLOPS?
    [​IMG]
     
    Lightman likes this.
  14. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,380
    It has been very well established that you will be able to do all that.

    https://devblogs.nvidia.com/parallelforall/inside-volta/#more-7832 has the following:

    Edit: for the gory CUDA details, check out slide 51 from this GTC session:
    http://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf
     
  15. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    692
    Likes Received:
    441
    Location:
    Slovenia
    It's just D = A * B + C. Where all 4 components are matrices and not scalars like with ordinary MAD or FMAD. Nothing particularly magical about that. Normally if you want to transform a vertex for example that's C = A * B where A would be a matrix and C and A are vectors. But you can transform 4 vertices at the same time if C and B are also matrices (so B packs 4 vertices). Just an example. :wink:
     
    ieldra and xpea like this.
  16. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    948
    Likes Received:
    46
    Location:
    LA, California
    Is fp16 enough for the transform matrix?
     
  17. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,936
    Likes Received:
    2,273
    Location:
    Germany
    Interesting. I must have missed this. I was under the impression, that for the time being the Tensor Cores would be accessible only via Nvidias DNN libraries.

    It was just an example and if devs know what they're doing why not? Vertex processing though was exempt even from DX9's minimum general precision of 24 bits and always went with FP32 minimum.
     
  18. HKS

    HKS
    Newcomer

    Joined:
    Apr 26, 2007
    Messages:
    31
    Likes Received:
    14
    Location:
    Norway
    pharma, silent_guy and CarstenS like this.
  19. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    Note that this deck explains the matrices need to be 16x16 matrices, not 4x4 matrices in order to access them from CUDA. It also explains that the tensor core operations are synchronizing operations across the warp, because the inputs and outputs to the matrices are striped across threads in a warp in a pattern that is opaque to the programmer. They provide an API to make accessing these striped matrices possible, but all this means that the programming model for the tensor cores is quite different from traditional GPU shader code.
     
    nnunn, psurge, pharma and 1 other person like this.
  20. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    948
    Likes Received:
    46
    Location:
    LA, California
    I'm not sure that devs knowing what they are doing automatically means it's possible to get acceptable quality with fp16. Maybe it is, maybe not - that's the question. There are some obvious things one can do - with matrix-multiply-accumulate, vertex translation can happen via the fp32 addend matrix - but I'm not a graphics programmer so I'm not really sure where the precision is needed. I also didn't realize matrices needed to be 16x16. I guess that might help in some ways, since each component of the transform matrix could be represented as the unevaluated sum of two fp16 numbers, giving you extra mantissa precision. But I'm too dumb to figure out a way to set things up without the 16x16 transform matrix containing lots of zero blocks, so unless there's some way to skip MMAs for 0 4x4 matrix sub-blocks, I'm guessing there would be no performance gain (and it's all irrelevant anyway if consumer Volta omits the tensor cores).

    Maybe skinning meshes would be a better application?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...