Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,915
    Likes Received:
    2,237
    Location:
    Germany
    Given the cloud-based NGC, I think we are in the process of crossing the divergence threshold between gaming and specialized HPC. Maybe with Volta, one foot's already through the door.
     
  2. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    On a higher level comparative between GP100 CUDA cores and Tensor cores using same fp16 instruction.
    The 8 Tensor cores in a single SM are around 4x theoretical peak throughput better than 64 FP32 CUDA cores in their single SM (P100 was the reference).
    So per core it is 32x 'faster' with that mixed precision GEMM instruction than the P100 CUDA core.
    Raises the question just what the limit is for number of Tensor cores per SM as implemented with Volta without further changes required to the architecture, and whether they will/can in future introduce say a GV100b that has reduced FP64 ratio but with more Tensor cores.
    Or is 8 Tensor cores per SM currently the hard limit similar way we see with the 64 FP32 cores per SM specifically with the P100/V100 - seems quite probable.

    Just as a note.
    In the Nvidia devblog they mention 8x faster than P100 at a per SM level, but crucially that reference was using those CUDA cores as 'standard' FP32 rather than FP16 instruction.

    Cheers
     
    #262 CSI PC, May 15, 2017
    Last edited: May 15, 2017
  3. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Yeah possibly seeing it now (definitely a thought for the future) but there already exists such specialist clouds using the P100 and DGX-1 for HPC compute to hire and yeah I appreciate NGC is just taking that to the next service level.
    I think the dilemma for Nvidia sooner rather than later is how they will continue to differentiate between their flagship multi-purpose mixed precision DP top GPU (P100 and V100 and whatever comes next) and the next tier below (P40 and 'V40'), especially as Nvidia comes under more pressure from other DL/compute hardware and requirement of a complete range of mixed-precision provided on all GPUs (meaning loss of current differentiation we see with GP100 to rest of range and seems GV100 still has this for now).
    And the GP102/GV102 must feed their solution such as compute version/instructions/CUDA/libraries/etc also into the lower GPU models (at least for Tesla also with DL and Quadro) as they are a viable alternative for many.
    All of this compounded by more specific instruction cores being added to the architecture (now Tensor) and required with ever greater performance.
    Especially when one considers scale up/out costs and purpose of node.
    As you say at some point Nvidia will have to specialise this a bit more but will also need careful consideration how to do this for such a broad encompassing design that has interconnected-dependent R&D through all segments from consumer to 'Tegra'.

    Cheers
     
    #263 CSI PC, May 15, 2017
    Last edited: May 15, 2017
  4. mczak

    Veteran

    Joined:
    Oct 24, 2002
    Messages:
    3,018
    Likes Received:
    114
    No I don't disagree with that. I suppose it's just the naming, to me this is really fp16 multiplies with higher precision output, and not fp32 multiplies with fp16 inputs. Just because that's a lot closer to what the hw is actually doing.
    So I take issue that the fp16 inputs is "only due to storage" (e.g. less register bandwidth required). The multipliers would have definitely be more expensive with fp32 inputs (if it would only be due to storage, the tensor unit should support fp32 "half-matrix" multiplies at the same rate, and I very highly doubt it can do this).
     
    #264 mczak, May 15, 2017
    Last edited: May 15, 2017
  5. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    It could be debatable *shrug*.
    The same can be said about the P100 with compute version SM_60 and use of SGEMMEx instruction if wanting FP16/FP16/FP32; no difference with that and how Nvidia manages double FP16 TFLOPs relative to FP32 with the P100; both are in effect FP32 cores involving FP16 and the same or very similar instruction, albeit the Tensor cores are more optimised-specialised for matrix multiplication and so with 4x greater throughput in theory when per SM comparison.

    Cheers
     
    #265 CSI PC, May 15, 2017
    Last edited: May 15, 2017
  6. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Probably makes sense to put it here as well as Vega speculation as the interesting aspect is GDDR6 from SK Hynix (if one believes Nvidia has moved away from Samsung and the recent newsbrief from SK Hynix for a client using this 2018).

    SK Hynix Q2 '17 Graphics memory catalogue:
    GDDR6 8GB 12 & 14 Gbps available Q4 '17
    GDDR5 8GB 10Gbps Q4 '17 (needs 1.55V)

    More relevent to competition rather than Nvidia:
    HBM2 4GB 1.6Gbps Only 4-Hi stack Q2 '17 - so looks like this is not changing anytime soon and has implications for others from both capacity and BW, especially as Samsung is now very close to hitting 2Gbps


    I really cannot see 8-Hi anytime soon from any of the manufacturers, especially for GPUs.
    Anyway looks like in Q4 there will be a choice between 14Gbps GDDR5x (looking that way for Micron) or 12/14Gbps GDDR6 (if SK Hynix is not being over optimistic) or Samsung and their GDDR6.
    Cheers
     
    #266 CSI PC, May 15, 2017
    Last edited: May 15, 2017
    ImSpartacus and iMacmatician like this.
  7. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,846
    Likes Received:
    5,418
    There's a Vega 10 GPU appearing in a Compubench result showing 16GB VRAM. Vega 10 uses 2 stacks so that would mean it's two stacks of 8GB each.
    8-Hi stacks might be coming sooner than you think, though probably not for consumer cards.
     
  8. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Considering all the various wild leaks we have had in the past year relating to AMD, I prefer to wait before accepting that over international manufacturing catalogues.
     
  9. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    Well GDDR6 is in Hynix's updated catalog now but still no 8 hi HBM2 so....
     
    #269 Razor1, May 15, 2017
    Last edited: May 15, 2017
    pharma and CSI PC like this.
  10. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,846
    Likes Received:
    5,418
    Of course we should all wait. I did use the word might in my previous post, didn't I?
     
  11. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    734
    Likes Received:
    309
    Not so great news for Nvidia Volta V100 as Google revealed some details of it's TPU2.
    180 TFLOP/s both for training and inferencing.
     
    BRiT likes this.
  12. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,424
    Likes Received:
    2,076
    I believe they are comparing against Nvidia’s K80 GPU, not Volta. Looks like they are using 4 to get to 180 TFLOPS.
     
    Razor1 likes this.
  13. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    404
    Likes Received:
    431
    well it's 180/4 = 45 TFLOPS per ASIC, very poor performance in my opinion for dedicated silicon. The important sentence in the source article:
    GV100 is 120 TFLOPS per GPU (960TFLOPS in HGX rack) and can also be used for HPC (strong FP64) and any other more challenging workflows (with the new thread scheduler)
     
    Lightman, pharma and Razor1 like this.
  14. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Yeah the TPU board size sort of reminds me also of 2GPUs in a blade.
    So one needs to consider 2xV100 as a possible comparison.

    However this highlights why IMO Nvidia cannot wait too long for the GV102 that one could sort of expect having 8xTensor cores per SM but operating as Int8 with associated instructions and optimised libraries; meaning theoretical peak possibly double of the 120 FP16 TFLOPs on the V100 - yeah reality will not be that but it will be at a very competitive real world figure.
    It would also mean a more cohesive platform across Volta for training and inference when it comes moving from V100 to V102 when considering CUDA and Library versions compatibility and coding - One area Google commented upon regarding complexity/delay moving design from one system to the other and such a POV can also extend when applying this between Pascal and Volta with regard to Cuda/optimised library/framework support version levels and coding.
    Still not as ideal as having it all on one node as TPU2 does (does TPU2 losing any peak throughput/optimisation doing this though), but with the software-platform support Nvidia builds into their ecosystem it should still be acceptable until next generation of Nvidia tech.

    CHeers
     
    #274 CSI PC, May 18, 2017
    Last edited: May 18, 2017
  15. xpea

    Regular Newcomer

    Joined:
    Jun 4, 2013
    Messages:
    404
    Likes Received:
    431
    that's the only V100 bench published so far (by Nvidia):
    [​IMG]
     
  16. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Not keen on the chart myself as I think they skewed P100 a bit, but then it is not far from the 8x performance over P100 standard FP32 compute core comment from their blog I quote below.

    There are others such as the Caffe2-ResNet one, but they may not be ideally optimised (will notice below).
    But with TPU2 and Nvidia, it is best for now IMO to use what they state as their theoretical peak FMA.
    Seems TPU2 is 180 TFLOPs FP16 Tensor DL/matrices, and V100 is 30 TFLOPs FP16 or 120TFLOPs Tensor DL/matrices.

    I mentioned it in other posts, but Nvidia has stated on their site that Tensor 8 cores are 8x greater throughput than 64 cuda cores operating as 'Standard FP32' (their wording I think).
    So that means per SM the ratio is 4x FP16 theoretical throughput and that also comes to 120 TFLOPs FP16 for V100 lining up with the official 30 TFLOPs FP16.

    Here is one of the Caffe-Resnet charts, the one worth referencing and context for this discussion is the far right that is FP16 Inferencing and pretty close to the 4x figure and aligns to the above comment:
    [​IMG]

    Yeah I appreciate as this is a real performance it means one also has to adjust to the fact V100 has around 31% more SM as well, so I guess between this chart and yours it does come to maybe 4x greater performance over FP16 P100 in DL.

    But I think we are digressing, although it is interesting.
    Edit:
    Here is the only comparison comment to the general CUDA mixed precision FP32 core stated by Nvidia, this is per SM and so scales.
    So for FP16 theoretical it becomes 4x increase with DL using Tensor cores, which ties in with the right FP16 picture and your chart when making for allowances.
    And that follows through with the official V100 spec:
    FP32 15TFLOPs
    FP16 30TFLOPs
    Tensor 120TFLOPs. (FP16 mixed precision matrices DL)

    Cheers
     
    #276 CSI PC, May 18, 2017
    Last edited: May 18, 2017
    Razor1 likes this.
  17. A1xLLcqAgt0qc2RyMz0y

    Veteran Regular

    Joined:
    Feb 6, 2010
    Messages:
    1,205
    Likes Received:
    605
    Why should Nvidia worry. The TPU2 is 45 TFLOPS not 180 TFLOPS.

    Google brings 45 teraflops tensor flow processors to its compute cloud

     
    Razor1 and ImSpartacus like this.
  18. manux

    Veteran Regular

    Joined:
    Sep 7, 2002
    Messages:
    1,959
    Likes Received:
    828
    Location:
    Earth
    Is google going to sell tpu2 to competing cloud providers? If not then tpu2 only increases market for AI acceleration chips as microsoft, amazon,... would need to somehow compete with google. These are fun times as the market hasn't been divided yet.

    TPU2 looks interesting but way too little information out to really understand what the chip is useful for and what not. What is precision of computation? Is there big difference between how inference and training is implemented/performs? Bandwidth and amount of memory is unknown(dataset sizes that tpu2 can handle).

    There is tradeoff between flexibility and performance. My hunch is tpu2 is not as flexible as gpu's which again are not as flexible as cpu's. I don't think the ai is solved to the point where perfect algorithm and accelerator can be built. If design is not flexible enough it could be a dead end outside current use cases. This is not to say current use cases wouldn't be valid but it's just a bit of more research and work before skynet is here.

    What to me looks most interesting in volta is it's flexibility. Deploy one type of gpu to cloud and you can sell computing time for dnn training, inferencing and also generic hpc workloads(strong 64bit floating point performance). Also the scalability via nvlink2 especially together with ibm cpu's could be game changer.
     
    #278 manux, May 18, 2017
    Last edited: May 18, 2017
  19. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    3,424
    Likes Received:
    2,076
    https://www.nextplatform.com/2017/05/17/first-depth-look-googles-new-second-generation-tpu/
     
  20. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    781
    Likes Received:
    211
    What is the TDP of the TPU2?
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...