Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,933
    Likes Received:
    1,629
  2. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Guys it was clearly talking about both mixed precision SGEMMEx and single point FP32 SGEMM together same context, the quote also is quite clear on this and it is obvious SGEMM is FP32 and this case matrix-matrix multiply and fits my context of its use.
    The clarity is in the relative gains of up to 1.8x (I thought it was roughly 2x from what I heard) with SGEMM on the V100, this relative gain is not coming out of thin air between the CUDA versions.

    And also with the newer CUDA 9/cuDNN 7 version for Volta nice to see FP16 Training now fully supported.

    Edit:
    Going back and re-read it definitely supports my context so here it is again in bold showing it applicable to both.
    Notice they are not separated (in fact it is 'and' when talking about SGEMM) after 'accelerated by Tesla V100 Tensor Cores', and obviously both are part of said sentence started about Tensor Cores (which makes sense if one understands SGEMM).
    Like I said we have not heard or read everything relating to these function units/cores.
    Cheers
     
    #242 CSI PC, May 12, 2017
    Last edited: May 12, 2017
    Razor1 and pharma like this.
  3. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,186
    Likes Received:
    1,841
    Location:
    Finland
    That's the thing, it doesn't specify that the FP32 results are accelerated by Tensor Cores, just read it again if you don't see it.
    It statest that new cuBLAS is "Built for Volta and Accelerated by Tesla V100 Tensor Cores". That doesn't mean every function they can do is accelerated by Tesla V100 Tensor cores, for that statement to hold true it's enough that even one function is - in this case the mixed precision results are accelerated by Tensor Cores, FP32 results are not.
    And just look at the picture if that's not enough, one says "V100", one says "V100 Tensor Cores", guess which is which?
     
  4. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    its not the same as packed math, what I think they are doing, is they are leveraging the new the scheduler to use the tensor cores and caching to increase utilization to give the increased performance. Pretty much eliminating the problems pascal had with mixed precision performance issues.

    What AMD did I think with vega, is made their ALU's more robost, that is why they seem to get just higher x2 performance where nV is getting under x2.

    Which way is better? I don't think it matters at the end really, just what the end performance really is of the entire chip.
     
    pharma likes this.
  5. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    I never said every function or that it must be FP16 GEMM all the way through for the Tensor cores or that it must be exclusive.
     
    #245 CSI PC, May 12, 2017
    Last edited: May 12, 2017
  6. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,186
    Likes Received:
    1,841
    Location:
    Finland
    Oh ffs - the only point I've been trying to make the whole time is that unlike you claim, the Tensor Cores have absolutely no function whatsoever if you're using FP32 precision like that cuBLAS FP32 SGEMM . It's always, no exceptions, FMA with FP16 x FP16 + P16/32.
     
  7. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,933
    Likes Received:
    1,629
    I think we will need Nvidia to clarify when more info is available.
     
    Razor1 likes this.
  8. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    No, you have taken my OP and posts since that totally out of context nor tried to explain how SGEMM manages 1.8x relative performance gains but keep on only about one aspect of Tensor cores and now wasted several pages on this.
    BTW do you also say the CUDA cores are HFMA2 no exceptions with P100?
    Oh wait they are FP32 cores also for FMA/GEMM FP16 and FP32/etc.

    How did mixed-precision GEMM work on P100 with CUDA 8 because obviously it cannot use the same cores as the single precision GEMM maths (by your logic) with regards to cuBLAS library (what was used for context of relative gains between V100 and P100 and for both FP32 and FP16)?
    Before answering this changed when compared to CUDA 7.5 and had to change again to support Volta now as Cuda 9 with the new/updated libraries.

    Anyway maybe 1st explain the up to 1.8x relative gain (meaning both GPUs are more equalised and only focused on specific function that for their context and point was GEMM single and mixed precision) before we continue wasting time arguing.
    That means it is not thread scheduling or cache improvements because the scope is very specific of what I posted and anyway too great.
    The only current observation is that the Tensor cores are not exclusive when it comes to use with cuBLAS library and possibly other related libraries (but as I said repeatedly earlier on possible limitations involved in terms of flexibility).

    But I have said enough on this for now as it is probably getting boring for everyone.
    Edit:
    All of the above comes back to matrix multiply in linear algebraic computations or convolutions.

    Edit 2:
    Late night posting; Noticed used dp2a for context (technically not quite correct) but better case is HFMA2 for P100, so changed.
     
    #248 CSI PC, May 13, 2017
    Last edited: May 13, 2017
  9. Anarchist4000

    Veteran Regular

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    [​IMG]
    The tensor core is literally this. The entire performance improvement is because of this:
    The gains are the result of lots of ALUs and predictable access patterns reducing pressure on the registers and scheduling.
     
  10. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,933
    Likes Received:
    1,629
    http://www.technewsworld.com/story/84528.html
     
  11. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    711
    Likes Received:
    282
    In hardware these matrix computations are typically done with a so called 'systolic array'.
    Google's TPU is also based on this, albeit computing 256x256 matrix multiplications at 8-bit.
     
    pharma, Lightman, Razor1 and 2 others like this.
  12. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    hmm always wondered why they used systolic, does it have anything to do with pressure? register pressure I'm presuming.
     
  13. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,933
    Likes Received:
    1,629
    NVIDIA Launches GPU Cloud Platform to Simplify AI Development
    http://nvidianews.nvidia.com/news/nvidia-launches-gpu-cloud-platform-to-simplify-ai-development
     
    #253 pharma, May 13, 2017
    Last edited: May 14, 2017
  14. LiXiangyang

    Newcomer

    Joined:
    Mar 4, 2013
    Messages:
    81
    Likes Received:
    47
    Volta is a really interesting design, looks almost like some ASIC for DL.

    As for their tensor core, I believe Nvidia has already made it clear: it use FP16 only for storage, and do the multiply ops in full precision (FP32) and then add the result to a FP32 variable.

    Thats why they use SgemmEX for benchmark against Pascal, since SgemmEX did that exact the same way (as contrast to the Hgemm routine): load the data from various precision but done the computation (multiply+add) in full (FP32) precision.

    Which means tensor core is a full precision matrix multiplication unit with FP16 data input, thats why nvidia is more confidence of puting this tensor core for not just inference/forecast but also training the network as well.

    And due to the computation is fully FP32 just like SgemmEX, the precsion loss is only limited to the FP16 storage stage, so I can think of many application outside of DL domain that could benefit from the vast computing resource GV100 offers.
     
    pharma, CSI PC, xpea and 1 other person like this.
  15. Rootax

    Veteran Newcomer

    Joined:
    Jan 2, 2006
    Messages:
    1,180
    Likes Received:
    584
    Location:
    France
    How much of all this new techs do we need in a gaming gpu ? (and by "need", I mean with price/power in mind)
     
  16. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,933
    Likes Received:
    1,629
    If the direction gaming companies will be taking is using FP16 for calculations then it's significant.
    Next id Tech relies on FP16 calculations
    https://www.golem.de/news/id-software-naechste-id-tech-setzt-massiv-auf-fp16-berechnungen-1704-127494.html
     
  17. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Also to add here is the Google details upon TPU : In-Datacenter Performance Analysis of a Tensor Processing Unit
    https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf

    The paper was made public earlier this year.
    Edit:
    I forgot to mention Xilinx are working towards an Int8 Deep Learning optimisation: https://www.xilinx.com/support/documentation/white_papers/wp486-deep-learning-int8.pdf
    Also has some nice relevant links at bottom of the paper.
    Cheers
     
    #257 CSI PC, May 14, 2017
    Last edited: May 14, 2017
  18. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    What will also be interesting going forward is whether GV102 and the other lower GPUs (or some at least) will have a Int8/DP4a instruction version of the Tensor core; more for Int8 inferencing/convolution.
    Just like the P100 they are careful not to talk about this, and makes sense from a differentiation perspective between GV100 and GV102.
    Quite a few of the Nvidia Deep Learning related/GEMM libraries are also optimised for Int8 and maturing in terms of use and development.
    Cheers
     
  19. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,015
    Likes Received:
    112
    This design of course makes a lot of sense. A fp16 x fp16 multiply giving accurate output should have just about the same hw cost as a fp16 x fp16 multiply with fp16 output. But saying this is doing the multiply ops in full precision is a bit misleading imho, even if technically true, and this should help quite a lot indeed (if you'd do a matrix multiply with just individual fp16 fmas the result would be much worse).
     
  20. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Not sure if you are actually disagreeing about the cores full precision computation if you agree it may be technically correct , agree Denormal/subnormal number is a consideration but depends upon the operation-function requirements.

    On P100 as an example you can use (and I assume as well for V100) the HFMA2 instruction, that is fp16/fp16/fp16 (computation at f16 with a single rounding for accuracy).
    Hgemm instruction (available as well on P100) is the fp16/fp16/fp16 (computation at f16) but limited only to fp16 input and output.
    Importantly (with your context) the CUDA cores since GP100 Pascal and CUDA 8 also support fp16/fp16/fp32 (computation at f32) with SgemmEX and in the performance chart they show Tensor with same or comparable instruction; makes sense for Tensor in training.

    I really doubt we have been told everything yet about the Tensor cores (or indirectly related instructions available for CUDA 9/latest compute SM versions),especially as they are presenting data performance using the more flexibile SgemmEX that means one can have variable input and variable output albeit with computation at f32 - point being Tensor is a full precision capable core with flexibility depending upon instruction supported.
    However comes down to the optimised libraries and instructions supported and compute/CUDA level, and here the focus is on cuBLAS-GEMM by Nvidia, which can also be considered when discussing DP4A instruction that is not found on P100 (doubt it exists on V100) and will again be on GV102 and lower and likely IMO in Tensor core form primarily as int8 inference.

    I just had a look at the twitter for various Nvidia engineers and Mark Harris said earlier in the week:
    Which matches up with the SgemmEX use and also ties into the further explanation given for the chart showing the up to 9x greater performance than P100 - this was in the follow up article about high level CUDA 9.
    The irony though is they never used fp16 for the P100 when comparing to V100 and the Tensor cores, just Sgemm/FMA (fp32) for P100.

    Seems that this is the year DL moves to fp16 for training and int8 for inferencing more broadly and beyond the specialist cases to date.
    At minimum one of the GTC presentations-workshops this year was on how to convert training for existing FP32 DL systems to FP16.
    Cheers

    Edit:
    Just lazy no difference in my post between using upper and lower case for fp.
     
    #260 CSI PC, May 15, 2017
    Last edited: May 15, 2017
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...