Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    Right, the article title is a bit misleading "Machine-learning ASIC doubles performance", as TPU1 already did 90 TOPS/s 8 bit inferencing. In that respect it indeed looks rather poor.
     
  2. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    looking at the huge heat sinks, I would say at least 80W per ASIC:
    [​IMG]

    which IMHO looks really bad when related to performance
     
    trinibwoy likes this.
  3. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    That would be 320 Watt / 180 TFLOP/s vs 300 Watt / 120 TFLOP/s for Volta, so actually better.
     
    iMacmatician likes this.
  4. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Well it combines both Training and Inference in one node while Nvidia has shown the V100 to date to be more like the P100 (focus on training down to FP16) and depends upon size of the board and power demand (important in nodes/cluster), and importantly it would be a fair bit cheaper than V100 (Nvidia need cheaper models for Inferencing).
    It is also quite probable the Inference from TPU2 would have higher peak throughput than the 45 TFLOPs (which looks to be FP16 figure) when it comes to Int8 Inference.
    I doubt it worries Nvidia but it is more competitive and more cohesive in some ways without other Volta backing up the V100 for DL ecosystem.

    Edit:
    Yeah as mentioned by another TPU had 90 TFLOPs Int8 peak performance.
    It looks like it all comes down to how well optimised the performance is when comparing workflow vs matrix.
    Cheers
     
    #284 CSI PC, May 18, 2017
    Last edited: May 18, 2017
  5. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,244
    Likes Received:
    4,465
    Location:
    Finland
    V100 doesn't do 120 FP16 TFLOPS, it does 120 TFLOPS only with very specific Tensor ops.
     
  6. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    Loosing a high profile customer like Google, undoubltly must worry Nvidia.
    Also there is the possible prospect that other high profile customers, with deep pockets, may get inspired by this and start fabricating their own ASICs for DL.
     
  7. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    For reference here is the power demand of TPU still 4 per board according to the * note.
    [​IMG]

    Not sure how to correlate TPU2 to Nvidia in terms of size and power demand, because so far Nvidia has been coy about Int8 inference and looks like they will do the same differentiation as before with GV100 for FP16 training and with GV102 for int8.
    So that would also mean taking into account multiple nodes increasing size and overall power demand, although I guess one could argue TPU2 is doing one or the other so for the Nvidia environment maybe only take into account only one of the nodes for power.
    Edit:
    And yeah I appreciated one cannot use this as a direct reflection of current TPU with FP16 DL training.
    Cheers
     
    #287 CSI PC, May 18, 2017
    Last edited: May 18, 2017
  8. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    You really love arguing with me and being pedantic while ignoring context - something you have been doing to me now over last 6-12months.
    You did notice I mention Tensor cores in relation to that 120TFLOPs in the part you quoted?
    I think everyone who has been following this thread understand now what function the Tensor core is used for; in theory though it goes beyond just DL and potentially for further matrices math/algorithms and instructions, but my context was in response to others and TPU meaning DL but anyway it is quite clear I am not talking about general CUDA core performance.

    Look back and several of my posts over the last couple of days makes my context is quite clear.
    BTW you did not correct Xpea or Voxilla who used 120TFLOPs or related figures themselves when they did not provide full semantics in all posts on said subject, and yet they understand the context.
     
    #288 CSI PC, May 18, 2017
    Last edited: May 18, 2017
  9. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Yeah to some extent and also now Google is looking to offer this as a service and expand upon it with others.
    But Nvidia has their own plan and they are pretty competitive if they rollout more of the Volta GPUs for the DL ecosystem and provide ease in terms of cohesiveness between the different training/inference nodes-GPUs in regards to compatibility environment versions and instructions to some extent (Cuda/libraries/compute SM version), one of the reasons I think Volta rollout will be faster than most expect as Nvidia is experiencing higher competition in this field and Intel will also have their specialised solution in the future (albeit from a catchup position).
    The headache for Nvidia like I mentioned earlier is that at some point they will need the node to be able to offer both training and inference to a pretty high performance level (some will still want dedicated training/inference nodes and others will not just like Google and some other large scale deployers), something I have argued about for some time and it will affect how they position the Gx100 and Gx102 in future.

    Cheers
     
    #289 CSI PC, May 18, 2017
    Last edited: May 18, 2017
  10. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    well we can't compare like that. GV100 300W is with rasterizer, geometry engine, texture unit, FP64, FP32, INT32, hardware scheduler when TPU2 only does FP16 matrix. in other words, I highly doubt GV100 will consume 300W when only using the tensor cores...
     
  11. entity279

    Veteran Subscriber

    Joined:
    May 12, 2008
    Messages:
    1,332
    Likes Received:
    500
    Location:
    Romania
    If an ASIC for tensor operations is in the same ballpark , with power consumption , as a full fledged GPU isnt't it pretty mutch a perf/w failure? (sure there's pricing and production costs that could be factored in)
     
  12. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    The 'only' tensor cores are 128 FP16/FP32 mixed precision FMA, compared to 16 FP32 per SM.
     
  13. SpaceBeer

    Newcomer

    Joined:
    Apr 15, 2017
    Messages:
    48
    Likes Received:
    22
    Location:
    The Balkans
    If you only do tensor operations, you don't care about any other things GPU can do.
     
  14. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    I think more data is needed about the TPU2 in how exactly that 180 TFLOPs FP16 is applicable to the DL workflow beyond BLAS-GEMM matrices computation and instructions that is applicable to Nvidia's Tensor cores.
    Maybe reading too much into the article but seems to infer they expanded the FP16 operation to more of the DL workflow such as loading-analysing-understanding the data/the data objects/compute error/other operations: https://medium.com/the-downlinq/establishing-a-machine-learning-workflow-530628cfe67
    Quite a lot of tasks and operations associated with DL workflow and I guess more may be being made of TensorFlow in this way with the TPU giving it a compute number advantage if so.
    Nvidia states they also optimise/accelerate their library for TensorFlow framework on Volta but still to be shown how effective.
    Cheers
     
    #294 CSI PC, May 18, 2017
    Last edited: May 18, 2017
  15. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    I think he referred to the usually massive advantages in either raw performance or, in more recent times, performance per watt that arise from and show the true value in using ASICs.
     
  16. Malo

    Malo Yak Mechanicum
    Legend Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    8,931
    Likes Received:
    5,530
    Location:
    Pennsylvania
    Since when were general purpose accelerators compared to ASICs? Yes v100 has tensor cores now to speed up those certain types of algorithms but obviously a client who only wants to do tensor operations wouldn't be looking at GPUs anyway?
     
  17. entity279

    Veteran Subscriber

    Joined:
    May 12, 2008
    Messages:
    1,332
    Likes Received:
    500
    Location:
    Romania
    Yup, this one is not an order of magnitude better, not even close. I'd be curious to know why

    Or what Malo above says, how come we ended up comparing these two very different chips?Which should each have their very disjoint usages, even when both are used accelerators for Deeplearning.

    I'm just asking what for me is an obvious question, sorry if the answers are just as obvious from some of you
     
    #297 entity279, May 18, 2017
    Last edited: May 18, 2017
  18. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Since around the low to mid 280's in this thread, I'd say.

    And I think it's a fair point, especially when your alternative is to have separate installations for all special cases or if you can swat a couple of HPC-flies with the same installation. Depending on your needs, of course, the former or the latter might make more sense.
     
    Malo likes this.
  19. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY

    There is an overlap area for both, and Google does use nV GPU's for AI tasks outside of training so this is where Volta will come in handy.
     
  20. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,296
    Location:
    Helsinki, Finland
    Seems that people value different things. Those tensor processor's don't interest me much... but I am awestruck about that configurable 128 KB L1 cache design.

    Nvidia implies that their new L1 cache is as fast as groupshared memory. That's going to change the way GPUs are programmed. Nvidia did show a benchmark where they reached 93% of groupshared mem optimized algorithm performance without groupshared memory (thanks to the huge fast L1 caches). Soon GPU compute shaders aren't as hard to program as Cell SPUs. I need to learn new tricks :)
     
    Lightman, BRiT, Heinrich4 and 3 others like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...