Pascal FP16/FP32/INT8 Performance

Discussion in 'Architecture and Products' started by JF_Aidan_Pryde, Aug 12, 2016.

Tags:
  1. JF_Aidan_Pryde

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    601
    Likes Received:
    3
    Location:
    New York
    Hey guys,

    Wanted to start a clean thread on since it's a little hard to navigate the uber threads.

    Just want to get a clean record for the various instruction rates of various Pascal based parts. NVIDIA has provided a sparse matrix of information. Of note is the restriction of FP16 on consumer parts and the special INT8 mode on Titan X. But how these specs fill in other GPUs isn't clear.

    Would love to see this table (attached) filled out.

    -James
     

    Attached Files:

  2. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    GP100 is lacking dp4a and dp2a so that's a 0 in the int8 row.
    Titan X (GP102) has only compatibility fp16, same as 1080 (GP104) so 1/64 (though it's really 1/128 vec2). It's also having only compatibility double support so 1/32 same as 1080.
    The int8 is also full rate on 1080 so 4x of the fp32.
    Also not that anything below fp32 is basically cuda only stuff.
     
  3. spworley

    Newcomer

    Joined:
    Apr 19, 2013
    Messages:
    146
    Likes Received:
    190
    That chart would be more useful if you added a DP4A row, which is the simple but useful new integer instruction in non-P100 Pascal.

    Ryan did a great writeup of the fp16x2 compute in his GTX 1080 review. He shows a similar chart but also usefully extending back to Maxwell.

    All Kepler, Maxwell, and Pascal GPUs have quad-rate Int8.
    GP100 and Tegra X1 (Maxwell) have fp16x2, but no other NVidia GPUs do.
    DP4A and DP2A (byte and word dot product accumulate) are on GP106, GP104, and GP102, but not GP100.
    GP100 has 1/2 rate FP64, but GP106, GP104, and GP102 (TitanX) have 1/32 rate.
     
  4. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    Source?
     
  5. spworley

    Newcomer

    Joined:
    Apr 19, 2013
    Messages:
    146
    Likes Received:
    190
    The vmad instruction in CUDA performs bytewise multiply-and-accumulate. Describing throughput rate ratios isn't as straightforward as fp32 since int32 mads are multiple instructions and not a single clock unlike fp32, hence giving vmad the throughput rate bonus relative to that baseline. And even more confusion comes with Kepler which has even higher throughput single clock 4-way SIMD byte operations (but not MAD). No NVIDIA GPU has 4-parallel-bytes-per-clock MAD, including the GTX 1080. I don't know if vmad is exposed to graphics compute.. it's not in CUDA C, but in CUDA PTX. Likely DP2A and DP4A will be exposed only in PTX as well but the final CUDA 8.0 documentation hasn't been released yet.
     
    sebbbi, pixelio, CSI PC and 1 other person like this.
  6. MDolenc

    Regular

    Joined:
    May 26, 2002
    Messages:
    690
    Likes Received:
    425
    Location:
    Slovenia
    I see. Haven't realized that throughput on these is that high.
     
  7. sebbbi

    Veteran

    Joined:
    Nov 14, 2007
    Messages:
    2,924
    Likes Received:
    5,288
    Location:
    Helsinki, Finland
    Int32 multiply seems to be roughly 1/3 rate on Kepler and Pascal (compared to fp32). Results here: https://devtalk.nvidia.com/default/topic/948014/forward-looking-gpu-integer-performance/

    In comparison int32 multiply is 1/4 rate on AMD GCN. Int24 multiply is full rate on GCN (Nvidia no longer has fast int24 mul). It is very useful (as you rarely need full 32 bit muls), but unfortunately not exposed on PC DirectX. Bitwise ops and shifts are full rate on GCN. Nvidia has half rate shifts.
     
    Ike Turner, BRiT, spworley and 2 others like this.
  8. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    I will have a stupid question, its saturday night, so it shoulld be well on purpose.. is DP2A, DP4A are only Nvidia "instruction" term ( meaning it is impossible to compare witth anything outside Nvidia ? )..
     
    #8 lanek, Aug 14, 2016
    Last edited: Aug 14, 2016
  9. pixelio

    Newcomer

    Joined:
    Feb 17, 2014
    Messages:
    47
    Likes Received:
    75
    Location:
    Seattle, WA
    spworley likes this.
  10. spworley

    Newcomer

    Joined:
    Apr 19, 2013
    Messages:
    146
    Likes Received:
    190
    On the CPU side, PMADDWD does four parallel 16 bit weighted pair summations, so it's like 4 DP2As. It goes all the way back to MMX, so it's been in your CPUs for 20 years.
     
    lanek and pixelio like this.
  11. Xmas

    Xmas Porous
    Veteran Subscriber

    Joined:
    Feb 6, 2002
    Messages:
    3,298
    Likes Received:
    137
    Location:
    On the path to wisdom
    Is there an exact description of DP2A/DP4A available somewhere? The names strongly hint at
    dp2a= a0*b0 + a1*b1 + c
    dp4a= a0*b0 + a1*b1 + a2*b2 + a3*b3 + c
    but are these just (unsigned?) integer multiplications with (effectively) 32-bit integer additions?
     
  12. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    Yes. Although I think there are signed variants. The real trick is the 32-bit accumulate.
     
  13. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    Anyone have tested them in real code situations ? i have strange return from friends and at work ( who are coders, im not and thats a good reason for me to dont advance much in this aspect . )..
     
    #13 lanek, Aug 28, 2016
    Last edited: Aug 28, 2016
  14. pharma

    Veteran Regular

    Joined:
    Mar 29, 2004
    Messages:
    2,929
    Likes Received:
    1,626
    Nvidia: Eight bits ought to be enough for anybody ... doing AI
    http://www.theregister.co.uk/2016/09/13/nvidia_p4_p40_gpu_ai/
     
  15. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
  16. JF_Aidan_Pryde

    Regular

    Joined:
    Feb 6, 2002
    Messages:
    601
    Likes Received:
    3
    Location:
    New York
    Seems like Nvidia is being really smart about how they segment their products:
    – If you want high DP perf, must buy GP100
    – GP100 is also the best card for training since it's the only desktop card that does 2x FP16. Assume FP16 is good for training but INT8 isn't.
    – If you want really fast inference, must get P4/P40. Only "Tesla" variants have 4x INT8 perf. (not counting Titan X)
    – If you buy a plain gaming card, eg GTX 1080, you don't get fast FP16, you don't get 4x IN8, and of course no usable DP64
     
  17. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    - Yes and no, outside specific contract, GP 100 is not so easy to obtain . And it is not a " desktop " gpu.. ( only tesla server )

    - Its a side effect, GP102 is a sku created after the initial Pascal, surely due to the cost and delay of HBM2, where they are really intellligent, is they add specific new "instructions" in between that was not initially on GP100 but can be used on a specific market, it was too add incentive to have this sku on the market ..
    ( sku who will certainly, after the TitanX,, been used too on the 1080TI )

    The big problem of this generation, for both AMD and Nvidia is HBM2... both have delay their initial roadmap... Nvidia can release GP100 with HBM2 but, this sku is only available on selected contract for supercomputers, and others "Big company of AI, deeplearning" ( for automative ) ... AMD have delay the initial release ( Vega ) to 2017 ( or, as Nvidia, their initial roadmaps was point the 2016 gpu's ),

    Well we could say, that Nvidia have somewhat keep track on their roadmap with GP100 for 3Dstacked memory ( HBM2 ), but honestly, it is not really "widly " available outside supercomputers center and other contracts ( who should been " up" and running " online on 2017-2018 if everything is going good anyway )
     
    #17 lanek, Sep 13, 2016
    Last edited: Sep 13, 2016
  18. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    You can buy the P100 here in the UK from at least one of the Nvidia service providors, not sure which one it was though and albeit within their own platform (does not need to be max configuration), but it is available now.
    Cheers
     
  19. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland
    You can order it ... and it willl b there when it can be ... I have not say it is not available .. But it can take time ..

    But its not a problem, entreprise who order it have offtly nearly 1 year for put in places the process for upgrades, tests etcs ... We are not speaking about gamers upgrading ...

    I can order one right now, and get one ,but if i want more and i have not a specific contract for it ... maybe ... in 1 year ..

    Honestly, if it was so massively available, we will have return for the industry on the performance with the system who use it no? no tests, no leaks ---- no rumored perfomances, test numbers ..

    But maybe you can provide me at least benchmark on how work the implementation of the HBM2 on Nvidia ? whats the performance of
    it, does it bring somehing to the Pascall architecture ?


    I will not imagine how everyone, as me, could compare, the GP 102 vs GP 100 when we think the GP100 is availlablle since nearly 6 months .....

    Nvidia themselves was announces a GP100 in Q1 2017, in between they have release the GP 102 ...
     
    #19 lanek, Sep 14, 2016
    Last edited: Sep 14, 2016
  20. CSI PC

    Veteran Newcomer

    Joined:
    Sep 2, 2015
    Messages:
    2,050
    Likes Received:
    844
    Not as bad as it used to be it seems though.
    By available I mean immediately (unless recently changed), but this is not all Nvidia service providors and in fact may only be 1 able to do this in the UK and I think it is part of their platform solutions that is certified, no idea for other regions.
    I agree most will be doing orders for scheduled projects usually at least 3-6 months away for the initial phase, but then those tend to be larger scale deployments/research/academia.
    Also key clients will be buying direct from Nvidia, but yeah this is made more complex as they also need to supply their partners such as Cray and IBM who could be ordering 1000s for each project.
    But then the P100 is a rather special case GPU.

    Cheers

    Edit:
    Also it seems prices are coming down for quoted single unit GPU, one of the solution providors is listing prices before discounts/etc as under $6k for the PCI-E 12GB, under $7.5K for PCIE-E 16GB, and under $10k for the mezzazine model. - caveat is these 'single' unit/non-platform (context not DGX-1 but a providors own certified solution platform) orders were not expected to be available until Sept/Oct.
     
    #20 CSI PC, Sep 14, 2016
    Last edited: Sep 14, 2016
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...