Nvidia Pascal Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 25, 2014.

Tags:
Thread Status:
Not open for further replies.
  1. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    Not really, what changes with temperature is the required refresh rate as the leakage on the capacitors increases slightly, but apart from that, only the regular increase in electric resistance. Overall, higher temperature just means even higher power consumption. You will eventually experience data corruption if you push the temperature to far.

    But the point about HBM is that you don't even aim for pushing the access and signal rate to the limits, but that you can instead achieve the data rate by a super wide parallel bus over a short distance. So even if the RAM gets a bit hotter than before, it's still a net gain in bandwidth and power efficiency, thanks to the simplified signaling. And a major increase on the price tag.
     
    Grall likes this.
  2. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    If Apple can produce a hundred million 100mm2 chips in 16nm, I wouldn't worry too much about a 600mm2 die: the number of identical functional units will only go up, and so will the ability to disable faulty ones with increasingly lower performance impact.
    The moment you can disable 3 or more identical units, the ability to produce a functional sellable die goes up dramatically.
    If you disable 3 out of, say, 32 shaders in a GP100, you still have less than 10% performance hit, and you can even compensate a bit for that by increasing the clock due to equal power budget.
     
  3. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    That would look terrible compared to Maxwell (3072 shaders) and Furry (4096 shaders)
    Pascal focus is also to be very good for neural net computation (was claimed 4 times better as Maxwell, ie FP16).
     
  4. gamervivek

    Regular

    Joined:
    Sep 13, 2008
    Messages:
    805
    Likes Received:
    320
    Location:
    india
    4TF DP and 24TF HP would then end up with different ratios for DP and HP performance. Or different boost clocks for different modes, but that would be a drastic difference.
     
  5. Razor1

    Veteran

    Joined:
    Jul 24, 2004
    Messages:
    4,232
    Likes Received:
    749
    Location:
    NY, NY
    there won't be different boost clocks for different modes, I'm still leaning towards same alu's for all the different precisions.

    Also I think we are overthinking all this a little bit, taking bits of information from one place and then trying to compare from another source, just doesn't work too well ;)
     
  6. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    About that...

    I don't think this actually means "FP16 rate @ 4x Maxwells FP32 rate". Most neural networks still use the logistic sigmoid as the activator function, which requires one exponentiation and one division for every 5 regular (addition or multiplication) FLOPs. The speedup could as well originate from an increased performance of the exponentiation or division instructions.

    FP16 performance for MUL/ADD/FMADD might not be anywhere as good as every one is expecting. It could even turn out FP32 = FP16 rate for these instructions, and Nvidias claims could still remain true.
     
  7. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,382
    My apologies for being pedantic, but in most neural networks, the sigmoid function has been replaced by the computationally much simpler ReLU function f(x) = max(0, x).
     
  8. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    Also, the vast majority of computation in a neural network is in GEMM or convolutions. The nonlinearity is not significant from a computational perspective, although it's important to the algorithm.
     
    silent_guy likes this.
  9. Ext3h

    Regular

    Joined:
    Sep 4, 2015
    Messages:
    428
    Likes Received:
    497
    Thank you, wasn't aware of GEMM.

    Nvidias own paper from December 2014, presenting their own convolutional network framework: http://arxiv.org/pdf/1410.0759.pdf
    That should be the framework they were basing their own numbers on in the original announcement: http://blogs.nvidia.com/blog/2015/03/17/pascal/

    The paper shows something interesting, page 6 and 7, reached peak SP throughput. Barely 50% on Maxwell, and even less on Kepler. That's not much much, given how promising the approach sounded.
    Not sure what the limiting factor is though, transfer from global to local shared memory, bandwidth of the local memory or keeping the FPUs utilized. Or maybe just the size of the local shared memory, limiting pipelining between wavefronts and thereby effective utilization.

    Either way, there's still a lot of headroom to reach the proclaimed performance improvements, and it's not necessarily all attributed to the FPUs.
     
  10. Voxilla

    Regular

    Joined:
    Jun 23, 2007
    Messages:
    832
    Likes Received:
    505
    Since that paper, Nvidia seems to have come up already with more optimized algorithms.
    http://devblogs.nvidia.com/parallelforall/cudnn-v2-higher-performance-deep-learning-gpus/
    i.e. "The IMPLICIT_PRECOMP_GEMM algorithm is a modification of the IMPLICIT_GEMM approach, which uses a small amount of working space (see the Release Notes for details on how much) to achieve significantly higher performance than the original IMPLICIT_GEMM for many use cases."
    Algorithmic optimizations (software) though is not a substitute for faster hardware ie Pascal 4x FP16 GPU.
     
  11. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    Actually, getting within a factor of two of peak performance is the majority of the optimization. The rest is diminishing returns. CUDNN is on v4 now and is much closer to peak performance than v1. The new hotness for convolution these days is specialization for fixed kernel sizes and more work efficient algorithms based on decomposition a like FFT and Winograd that can actually reach somewhere around 1.6x peak FMA bandwidth given a good assembly implementation.

    Another thing you might find interesting is how hard it is to get GEMM to perform well for neural networks. Here's a link showing that Nvidia's CUBLAS can be 6x slower at GEMM for Recurrent Neural Networks than a GEMM from Nervana Systems: http://svail.github.io/rnn_perf/
    As you can see, GEMM is often far slower than peak FMA, or even 1/2 of peak - even when written by Nvidia. Once you get to half of peak, you've done pretty well.

    Finally, this is why I won't even consider AMD GPUs for neural networks. Who would do these optimizations? Peak performance is irrelevant, sustained performance is what matters - but getting there requires a big investment in software. I don't have time to do much close assembly level optimization, I rely on libraries to do it for me. There aren't any libraries for AMD getting the kind of investment to reach even 1/2 of peak FMA bandwidth on the GEMM sizes that I need. Let alone to implement crazy Winograd kernels.
     
    pharma likes this.
  12. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
  13. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,579
    Likes Received:
    4,799
    Location:
    Well within 3d
    It's not slight at the upper range, and GDDR5 chip thermal reporting basically tops its scale at the operating temp of some GPUs like Hawaii.
    The high-temperature refresh range for at least some GDDR5 data sheets starts at 85C, and for GDDR5 it doubles the rate of refreshes.
    That increases the number of operations that are not related to actual memory servicing, and eats into the electrical budget of the DRAM in terms of activiations.

    I am not sure what HBM's high temperature rate is (it does have multiple refresh ranges based on temp), but so far that range has been avoided for Fury products.
    AMD's research into processors stacked under TSV memory took pains to keep DRAM below this threshold as well, so it can be a case where even if the cost is not ruinous, it just isn't a net win.
     
  14. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    Dave -
    Big square matrix multiplication (M==N==K) is not the problem! So performance on those sizes is not relevant!

    The matrices we have to work with in deep learning are awkwardly shaped (read the blog post I linked to earlier for more detail), and the performance penalty from poorly tuned BLAS libraries is very significant (like 6X) - not 15% like in that post.

    Finally, Kepler performance for deep learning is a non sequitur. Whereas Kepler could get 70% peak FMA from SGEMM, Maxwell can get 95%, and GM200 is also just much bigger. Kepler just isn't relevant to us.
     
    pharma likes this.
  15. nnunn

    Newcomer

    Joined:
    Nov 27, 2014
    Messages:
    40
    Likes Received:
    31
    Sounds like AutoGemm'd clBlas covers that base. From the page Dave pointed to:

    Time to consider quad Fury X2's...
     
  16. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    Show me the numbers. The capability to generate kernels means nothing without the numbers to back it up. Given the amount of work that people are putting in to hand optimized assembly for Nvidia, it's hard for me to imagine AutoGemm can compete.

    Fury is also a non-starter because of its cooling issues. We pack 8 Titan X in a single 4U box. Can't do that with Fury until someone comes out with a replacement for the CLC for high-density servers.
     
    pharma and nnunn like this.
  17. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    11,708
    Likes Received:
    2,132
    Location:
    London
    The GPU route to SGEMM success is registers. Lots of them, which is why it's easier to get Maxwell 2 to high performance, and why AMD wiped the floor with NVidia for most of the history of GPU SGEMM.

    The irony is that NVidia's facing competition from Intel in the form of KNL. And the only way to compete with that is to ditch "lots of registers" and "shared memory", since the Intel ethos is smart cache algorithms in hardware and nice compilers.
     
  18. Dave Baumann

    Dave Baumann Gamerscore Wh...
    Moderator Legend

    Joined:
    Jan 29, 2002
    Messages:
    14,090
    Likes Received:
    694
    Location:
    O Canada!
  19. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    Absolute performance suffers when power throttles, I would love to see what Nano performance looks like when running flat out 24-7 for months at a time.

    Beyond the performance issue, the most important reason Fiji is unthinkable for machine learning is the 4 GB DRAM. Our 12 GB cards are bursting at the seams.
     
    pharma and silent_guy like this.
  20. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    Not true: the Intel ethos is handwritten intrinsics and assembly. MKL is not written by a smart cache and a nice compiler.
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...