Nvidia Pascal Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 25, 2014.

Tags:
Thread Status:
Not open for further replies.
  1. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    74
    Likes Received:
    107
    If we're going to be comparing Pascal/Volta to KNL, then we really need to bring up NVLink, since we're already in the specialized hardware realm. They're claiming 50-200 GB/s for it, depending on the number of links - that is, access to host memory at full bandwidth. That means that you'll get just as much DDR4 capacity connected to the Tesla card over NVLink as KNL gets.
     
    pharma and RecessionCone like this.
  2. keldor

    Newcomer

    Joined:
    Dec 22, 2011
    Messages:
    74
    Likes Received:
    107
    From what I gather, a lot of the deep learning stuff is depending on multiplication of oddly shaped matrices, that is, long skinny ones. In these cases, you end up needing a lot more bandwidth than for square matrices, since you're getting close to vector-matrix multiplication. I believe this is why batch size affects performance to the extent it does.
     
    pharma and RecessionCone like this.
  3. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    Technically, this is true. In practice, though - I'm not planning to port my software to IBM Power, which is sadly necessary if one wants to use NVLink. Lots of work to get software running on a different system.
     
  4. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Do you see a benefit in GPU to GPU nvlink? One of the major benefits that was listed during the GTC keynote was that fast exchange of weights between GPUs would be a major advantage.
     
  5. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    It would be useful, for sure. My big problem with NVLink is that it's not routable, and it can't use ribbon cables, so the topology has to be burned into the motherboard. I'm concerned that this will make things low volume and expensive.
     
  6. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    Rather than doing a dedicated motherboard, I think they'll do dual-GPU boards with nvlink in between them.
     
  7. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    Yeah, they'll do that. But I use lots more than 2 GPUs, and we're limited by the weakest link.
     
    nnunn likes this.
  8. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    I wonder if NVidia will lock "high performance" cuDNN on Pascal to the Tesla variants, like it did with double-precision.
     
  9. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    SGEMV wants, say, 25 times more bandwidth per FLOP than SGEMM on square matrices, so yes skinny matrices sounds like a problem.

    Are these matrices sparse too? Are the longest dimensions in the 10s or 100s of thousands?
     
  10. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    The convolutional layers are, say, 256x256 images that multiplied for each pixel with a multiple 3x3 or 5x5 kernels. Those are definitely not sparse.
     
  11. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    499
    Likes Received:
    177
    The models I train are primarily RNN (LSTM), so they have dimensions like (1024 X 1024) X (1024 X 64) => 1024 X 64.
    They are dense.
     
  12. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    773
    Likes Received:
    200
    I noticed the following fact a little while ago and I didn't see it mentioned here: on page 7 of the presentation, the Tesla GPU for 2012 is the K20 and not the K20X, even though both parts were launched that year. The same situation holds for 2010, since the M2050 is in the graph when both it and the M2070 were 2010 products.

    Code:
                        K20   K20X   TITAN
            Chip SMs     15     15      15
         Enabled SMs     13     14      14
    Core clock (MHz)    706    732     837
           SP GFLOPS   3520   3940    4500
           DP GFLOPS   1170   1310    1500
                 TDP    225    235     250
         SP GFLOPS/W     16     17      18
    
    If the 2016 Pascal lineup includes two Teslas then I think it's likely that the 4 DP TFLOPS refers to the lower-end one. In this case, two Pascal Teslas and a Pascal TITAN that are similar to the above Kepler parts in terms of the proportions of enabled SMs, ratios of GFLOPS, and TDPs could be along these lines:

    Code:
                        [P]   [PX]  TITAN "Y"
    
            Chip SMs     32     32         32   (case 1)
         Enabled SMs     28     30         30
    Core clock (MHz)   1100   1150       1300   (approximate)
    
            Chip SMs     36     36         36   (case 2)
         Enabled SMs     31     34         34
    Core clock (MHz)   1000   1000       1150   (approximate)
    
           SP GFLOPS   8000   9000      10000   (approximate)
           DP GFLOPS   4000   4500       5000   (approximate, assuming 1:2 DP)
                 TDP    225    235        250
         SP GFLOPS/W     36     38         40
    
    That gives a more reasonable FLOPS increase of 45-50% from the TITAN X to the higher-end Pascal Tesla and 60-65% from one TITAN to the next. (1.3 GHz seems rather high though, but if it actually has 32 SMs then I expect the die to be a fair amount smaller than the GK110, so we might see (example) 29/31/31 instead of the 28/30/30 calculated from a direct comparison.)
     
    ImSpartacus and nnunn like this.
  13. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    331
    Likes Received:
    85
    It'd be odd though, for Nvidia to go and brag about its lowest performing chip. Perhaps it's the lowest end one they're confident of going though based on their first engineering samples and yields, and might reveal a more functional/higher clocked chip later if they find yields are good enough?
     
  14. kalelovil

    Regular

    Joined:
    Sep 8, 2011
    Messages:
    555
    Likes Received:
    93
    Perhaps it was just done for marketing reasons; to make the graph trend look better. Otherwise the 2011-2012 improvement from K20X to K40 would be rather meagre.
    The M2050 and M2070 would produce identical results in the the page 7 graphs, since the M2070 just doubles the memory capacity.
     
  15. dbz

    dbz
    Newcomer

    Joined:
    Mar 21, 2012
    Messages:
    98
    Likes Received:
    41
    The M2070 isn't represented in the graphs. I think you'll find that it is the M2090 that is shown. The M2050/2070 is GF100 powered, the M2090 is GF110 - so higher core count, higher frequencies - and thus bandwidth and floating point ops.
     
  16. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,559
    Likes Received:
    34
    #516 AnarchX, Jan 5, 2016
    Last edited: Jan 5, 2016
  17. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    773
    Likes Received:
    200
    From the second slide on the TechPowerUp page:
    [​IMG]

    Honestly I'm surprised by the low SP FLOPS of the DRIVE PX 2. It only has slightly more SP FLOPS than the TITAN X and I can't imagine that the Tegra parts take more than a small piece of the 250 W TDP. Is there something I'm missing here?
     
    Alexko and Clukos like this.
  18. AnarchX

    Veteran

    Joined:
    Apr 19, 2007
    Messages:
    1,559
    Likes Received:
    34
    Indeed a kind of low. We are probably talking about: 2x ~2B transistors for the Tegras + 2x ~6B transistors for the Pascal GPUs.

    My guess to 8 TFLOPs:

    Tegra: 2x 384SPs @ 0.8GHz = 1.2 TFLOPs
    Pascals: 2x ~4096SPs @ 0.85GHz = 6.8 TFLOPs

    Since this supercomputer must perform in hard conditions (>100°C heated up super cars), clockrates should be such low.

    Power Budget could be:
    Tegra: 2x 20W
    GDDR5: 2x 16W
    Pascals: 2x ~80W
    Boards/IO: ~18W.
     
  19. LiXiangyang

    Newcomer

    Joined:
    Mar 4, 2013
    Messages:
    81
    Likes Received:
    47
    With a 250W budge, NVIDIA manage to put 2 Denver and 2 Pascal cores in a single board, judging by this, I will expect NV to launch dual-Pascal+HBM2-based Tesla cards from the very beginning of Pascal life cycle, thats would be 2TB/sec of memory bandwidth per GPU slot, that would be a significant performance jump comparing to K80, cannot wait.
     
    nnunn likes this.
  20. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    710
    Likes Received:
    282
    From that it now becomes more clear how the next big Pascal i.e. GP200 will look like
    4096 SP / 8 TFLOPS SP / 4 TFLOPS DP
    Regarding the FP16 performance, NV have created this new metric of
    DLTOPs (deep learning tera operations per second)
    Which would be at 24 DLTOPS. Given the new name that indicates it's not the same as 24 TFLOPS FP16.

    For compute applications that need DP and DL it will be much better compared to GM200.
    But for gaming and SP that would only be about 25% improvement which is rather a small improvement.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...