Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    12nm is an optimization of 16nm, so it would be a tweaked version of a mature process--which would probably make a chip that is pushing its size to the very limit possible.
    Nvidia would likely be fighting with Apple and other vendors for the next actual shrink, and the timing would not be good since that process would be of uncertain availability and early in the yield maturation curve. A GV100 even shrunk down would be a big chip compared to the SOCs that typically come out that early.

    One thought I have is that this might minimize the risk of a process/manufacturing slip relative to Nvidia's HPC contract deadlines.
     
    xpea, CSI PC, Clukos and 4 others like this.
  2. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    Lightman and pharma like this.
  3. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    552
    Likes Received:
    787
    Location:
    EU-China
    They are high level (one would say artistic) views but the changes are important between GP100 and GV100 SMs:
    GP100:
    [​IMG]

    GV100:
    [​IMG]

    And you can't get such massive power efficiency gain with only few tweaks
     
    Heinrich4, Razor1 and pharma like this.
  4. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    There are quite a few things, that were not depicted in the artistic impressions (i like that phrase) of earlier SMs. Coming from a simple block diagram and trying to extrapolate the inner workings of those chips is like sailing through uncharted territory. IOW, they paint what they want you to believe.
     
    BRiT, Razor1 and lanek like this.
  5. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    The point is they still increased it by 41.5% when the physical side only increased by 33% while also adding even more functionality and all within same TDP on a massive die and similar 16nm node (albeit latest iteration in TSMC typical fashion called 12nm).
    They still do packed accelerated 2xFP16 math in V100 just like P100 btw.
    You get 30TFLOPs FP16 and also the Tensor matrix function unit/cores, usually Tensor matrix will have more specific uses primarily towards Deep Leaning framework/apps (future it is in theory possible to use this with professional rendering-modelling, not talking about gaming though).
    Those Tensor function units/cores can also be used for FP32 operations as well, so I think that works out around 2x faster with DL supported framework/apps.
    Cheers
     
    #225 CSI PC, May 12, 2017
    Last edited: May 12, 2017
  6. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    From what I understand it was always meant to be 16nm, the 10nm rumour came out of left field and quite possibly re-inforced by WCCFT that then was spammed.
    Like I mentioned before Pascal was a technical risk milestone towards Volta and the big HPC project obligations, albeit one that also added great value to Nvidia.
    Pascal only came into being once those very large and high profile projects with IBM were agreed.
    It also explains why they started with 610mm2 die 1st with the node shrink even with the massive cost/risk that on its own had let alone adding HBM2/NVLink/etc also to GP100.
    Cheers
     
  7. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,245
    Likes Received:
    4,465
    Location:
    Finland
    No they can't, they're explicitly FMA units which do FP16 multiplication and FP16/32 accumulation
     
  8. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    We have not been told everything about Tensor function unit/cores and it was high level.
    Pretty sure Nvidia will be posting soon the results of actual results using Tensor matrix cores against FP32 training and FP16 inferencing or Matrix FP32 and mixed precision, but maybe it is a sleight of hand *shrug*.

    Edit:
    Caffe blog is showing cuDNN 6 training at FP32 for P100 while cuDNN 7 training at FP16 for V100, performance difference of 2.4X

    Cheers
     
    #228 CSI PC, May 12, 2017
    Last edited: May 12, 2017
  9. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Seems there is more data now posted:
    [​IMG]
    I appreciate this is not exactly the same as what you were inferring, but my point was beyond that anyway in that it does help FP32 as well (in context of my posts).
    Cheers
     
    #229 CSI PC, May 12, 2017
    Last edited: May 12, 2017
    Razor1, xpea and pharma like this.
  10. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    Effictevly, at best they just represent the numbers of units and differenciation ( and again.. sometimes it is even flawed )
     
    Razor1 and CSI PC like this.
  11. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    So if I understand correctly from their public claims:
    - 32-wide warps being scheduled on a 16-wide ALU... (G80 and Rogue say hi! :) )
    - This allows them to decode 2 instructions in the time the main FP32 pipeline executes 1, so they can run some other instructions "for free" (G80 and Rogue say hi again!)
    - Register file *might* be over-specced, it *looks* like it's still 32-wide despite the ALU being 16-wide, which allows these 2 instructions to have a lot less restrictions than they had on G80
    - Thanks to the above, you can run FP32 and INT32 in parallel - and maybe FP32 and FP64 in parallel? Or FP32 and Tensor Cores in parallel?

    Alternatively maybe the register file isn't 2x the ALU width, and they rely on their "register cache" (and/or extra register banks) to execute multiple pipelines in parallel?

    The one thing I'm most surprised by is that they have *full-speed* INT32; presumably full-speed INT32 MULs, not just INT32 ADDs? If so, that's quite expensive... More expensive than the Vec2 INT16/Vec4 INT8 they had on GP102/104. I wonder why? I can't think of any workload that needs it, the only benefit I can think of is simplifying the scheduler a bit. Are they reusing the INT32 units in some clever way - e.g. sharing them with FP64? There are 'interesting' unusual ways you could share some of the INT logic with some other pipelines (rather than just over-speccing the mantissa for FP32 and clock gating it when not doing INT32) but that wouldn't allow full generality of co-issue with all other pipelines which is what they are implying.

    Also for their tensor core performance numbers, they are comparing to "FP16 input with FP32 compute" on Pascal; I'm going to guess that's effectively using the FP32 pipeline rather than the FP16 pipeline on Pascal, so 9x isn't quite as mind-blowing (but still impressive); they could have gotten a *lot* of performance simply by supporting the same Vec2 INT16 dot product instruction they had on GP102/GP104 with FP16 instead (since INT16 accumulating to INT32 is good for inference, but not always good enough for training).

    I'm also curious about the effective parallelism required to make use of the tensor cores; it's effectively a 4x4x4 matrix multiply, but according to their blog, that's per-thread so across a warp it becomes a 16x16x16 matrix multiply (based on a 32-wide warp I'd expect 16x16x8, not sure at what level the extra 2x happens). That's a *massive* amount of parallelism required for a single instruction, which is fine for convolutional networks, but it sounds like it might not work as well for e.g. recurrent networks in which case you'd want to stick to the CUDA cores? The ideal scenario would be if the scheduler could efficiently use the tensor cores and CUDA cores in parallel for different warps on the same scheduler.

    EDIT: Actually if it's 16x16x16, that sounds like they might be running 4x4x4 matrix multiplies sequentially, so the tensor cores might be exposed with descheduling data fences with a long latency to get results back. If so it seems likely that FP32/FP16 CUDA cores and Tensor Cores can work in parallel (but for workloads where you can use the Tensor Cores, it probably makes more sense to only use them, since they should be more power efficient).
     
    #231 Arun, May 12, 2017
    Last edited: May 12, 2017
  12. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,894
    Likes Received:
    4,548
    http://it.sohu.com/20170511/n492686208.shtml
    https://translate.google.com/translate?hl=en&sl=zh-CN&tl=en&u=http://it.sohu.com/20170511/n492686208.shtml
     
  13. Arun

    Arun Unknown.
    Legend

    Joined:
    Aug 28, 2002
    Messages:
    5,023
    Likes Received:
    302
    Location:
    UK
    I really hate how NVIDIA always talks about everything "per SM" when only texturing units/local memory/L1 caches/etc are shared across schedulers inside a SM. For most intents and purposes, I feel it makes more sense to think of "units per warp scheduler", i.e. 2 tensor cores per warp scheduler.
     
    Razor1 and CarstenS like this.
  14. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    The artists impression does credit that aspect a bit more this time. Much of it was already present with Pascal and Maxwell IIRC.
     
    #234 CarstenS, May 12, 2017
    Last edited: May 12, 2017
  15. CSI PC

    Veteran

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Yeah I think it is parallel as well, but comes down to register pressure/BW and more details about the design and its flexibility, which if they tried something similar with P100 would had hit those limitations but I thought this was the next step for the arch in Volta, some results/figures sort of suggest it is parallel but I guess detail about flexibility/limitations is important.
    Cheers
     
    #235 CSI PC, May 12, 2017
    Last edited: May 12, 2017
  16. JF_Aidan_Pryde

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    601
    Likes Received:
    3
    Location:
    New York
    With the new info, I think I've finally managed to do the math to get to Xavier's 30 tera ops. Basically it's a sum of tensor ops, FP32 ops, and DLA ops. Initially Xavier was 20 TOPs @ 20 watts. Then NVIDIA bumped it to 30 TOPs @ 30 watts. Basically a 50% power bump. So I bumped the clocks by 50% as well. This gets to 29 teraops. Let me know what you think.
     

    Attached Files:

  17. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    10,245
    Likes Received:
    4,465
    Location:
    Finland
    It just says Volta (CUDA9) is faster than Pascal (CUDA 8) with FP32 ops too, but it actually doesn't say it's because of the tensor cores. I simply don't see any way you could use it to speed up FP32 when it only accepts FP16 x FP16 + FP16/32
     
    Deleted member 13524 likes this.
  18. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,894
    Likes Received:
    4,548
    Published on May 12, 2017

     
  19. psurge

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    955
    Likes Received:
    52
    Location:
    LA, California
    I remembered this paper: http://hwacha.org/papers/scalarization-cgo2013.pdf. Section 2.4 describes the way current GPUs work (i.e. with software or hardware divergence stacks), while section 5.2 describes "stackless temporal-SIMT". It provisions a hardware PC per thread and allows "diverged threads to truly execute independently". It even includes a syncwarp instruction for compiler-managed thread reconvergence. To me, the properties described in the paper sound basically the same as those being claimed for Volta.

    On the other hand, temporal-SIMT doesn't seem to correspond to the following blog post statement: "Note that execution is still SIMT: at any given clock cycle CUDA cores execute the same instruction for all active threads in a warp just as before, retaining the execution efficiency of previous architectures".

    (Edit) These papers also seem interesting and potentially relevant. Disclaimer: I haven't had time to read them:

    https://www.ece.ubc.ca/~aamodt/papers/eltantawy.hpca2014.pdf
    https://www.ece.ubc.ca/~aamodt/papers/eltantawy.micro2016.pdf
     
    #239 psurge, May 12, 2017
    Last edited: May 12, 2017
  20. Source for Volta's FP32 units doing 2*FP16 packed math?
    Doesn't make much sense to have that functionality if the Tensor units can also do it and would probably have the same throughput per-SM as if the FP32 units could do with packed math.
    Even less considering that in Pascal the FP32 will only do 2*FP16 if it's using the same operation, so the Tensor units would probably get better usage.


    I see nothing in that left picture claiming the Tensor units are responsible for the increase in FP32 throughput.
    There's a 43% increase in theoretical FP32 throughput, there's higher memory bandwidth, a lot more RF cache and there's an improved CUDA 9 stack. All of that is contributing to better FP32 benchmark results, but not the Tensor cores.

    The result of the FP16*FP16 matrix is a FP32 matrix, to which it's added a second FP32, so I guess the tensor cores could be used for FP32 ADD operations..
    However, I doubt the Tensor cores are flexible enough to call a FP32 variable from a place that's not the result of the two FP16 multiply.
     
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...