Nvidia Pascal Speculation Thread

Status
Not open for further replies.
Maybe it allows for lower clocks, in a way, since now the RAM sits right under the heatspreader with the CPU core, meaning it gets grilled to whatever temp the core is running at... I'm no expert about these things, but hasn't it said that DRAM performance characteristics is affected by its running temperature?
Not really, what changes with temperature is the required refresh rate as the leakage on the capacitors increases slightly, but apart from that, only the regular increase in electric resistance. Overall, higher temperature just means even higher power consumption. You will eventually experience data corruption if you push the temperature to far.

But the point about HBM is that you don't even aim for pushing the access and signal rate to the limits, but that you can instead achieve the data rate by a super wide parallel bus over a short distance. So even if the RAM gets a bit hotter than before, it's still a net gain in bandwidth and power efficiency, thanks to the simplified signaling. And a major increase on the price tag.
 
If Apple can produce a hundred million 100mm2 chips in 16nm, I wouldn't worry too much about a 600mm2 die: the number of identical functional units will only go up, and so will the ability to disable faulty ones with increasingly lower performance impact.
The moment you can disable 3 or more identical units, the ability to produce a functional sellable die goes up dramatically.
If you disable 3 out of, say, 32 shaders in a GP100, you still have less than 10% performance hit, and you can even compensate a bit for that by increasing the clock due to equal power budget.
 
A 1Ghz base clock of 4096 shaders part with 8TF SP and 4TF DP seems a good bet. It'd also look way better marketing to compare the base clock now in those graphs vs. K80.

That would look terrible compared to Maxwell (3072 shaders) and Furry (4096 shaders)
Pascal focus is also to be very good for neural net computation (was claimed 4 times better as Maxwell, ie FP16).
 
4TF DP and 24TF HP would then end up with different ratios for DP and HP performance. Or different boost clocks for different modes, but that would be a drastic difference.
 
there won't be different boost clocks for different modes, I'm still leaning towards same alu's for all the different precisions.

Also I think we are overthinking all this a little bit, taking bits of information from one place and then trying to compare from another source, just doesn't work too well ;)
 
Pascal focus is also to be very good for neural net computation (was claimed 4 times better as Maxwell, ie FP16).
About that...

I don't think this actually means "FP16 rate @ 4x Maxwells FP32 rate". Most neural networks still use the logistic sigmoid as the activator function, which requires one exponentiation and one division for every 5 regular (addition or multiplication) FLOPs. The speedup could as well originate from an increased performance of the exponentiation or division instructions.

FP16 performance for MUL/ADD/FMADD might not be anywhere as good as every one is expecting. It could even turn out FP32 = FP16 rate for these instructions, and Nvidias claims could still remain true.
 
Most neural networks still use the logistic sigmoid as the activator function, ...
My apologies for being pedantic, but in most neural networks, the sigmoid function has been replaced by the computationally much simpler ReLU function f(x) = max(0, x).
 
My apologies for being pedantic, but in most neural networks, the sigmoid function has been replaced by the computationally much simpler ReLU function f(x) = max(0, x).
Also, the vast majority of computation in a neural network is in GEMM or convolutions. The nonlinearity is not significant from a computational perspective, although it's important to the algorithm.
 
Also, the vast majority of computation in a neural network is in GEMM or convolutions. The nonlinearity is not significant from a computational perspective, although it's important to the algorithm.
Thank you, wasn't aware of GEMM.

Nvidias own paper from December 2014, presenting their own convolutional network framework: http://arxiv.org/pdf/1410.0759.pdf
That should be the framework they were basing their own numbers on in the original announcement: http://blogs.nvidia.com/blog/2015/03/17/pascal/

The paper shows something interesting, page 6 and 7, reached peak SP throughput. Barely 50% on Maxwell, and even less on Kepler. That's not much much, given how promising the approach sounded.
Not sure what the limiting factor is though, transfer from global to local shared memory, bandwidth of the local memory or keeping the FPUs utilized. Or maybe just the size of the local shared memory, limiting pipelining between wavefronts and thereby effective utilization.

Either way, there's still a lot of headroom to reach the proclaimed performance improvements, and it's not necessarily all attributed to the FPUs.
 
The paper shows something interesting, page 6 and 7, reached peak SP throughput. Barely 50% on Maxwell, and even less on Kepler. That's not much much, given how promising the approach sounded.
Not sure what the limiting factor is though, transfer from global to local shared memory, bandwidth of the local memory or keeping the FPUs utilized. Or maybe just the size of the local shared memory, limiting pipelining between wavefronts and thereby effective utilization.

Either way, there's still a lot of headroom to reach the proclaimed performance improvements, and it's not necessarily all attributed to the FPUs.

Since that paper, Nvidia seems to have come up already with more optimized algorithms.
http://devblogs.nvidia.com/parallelforall/cudnn-v2-higher-performance-deep-learning-gpus/
i.e. "The IMPLICIT_PRECOMP_GEMM algorithm is a modification of the IMPLICIT_GEMM approach, which uses a small amount of working space (see the Release Notes for details on how much) to achieve significantly higher performance than the original IMPLICIT_GEMM for many use cases."
Algorithmic optimizations (software) though is not a substitute for faster hardware ie Pascal 4x FP16 GPU.
 
Thank you, wasn't aware of GEMM.

Nvidias own paper from December 2014, presenting their own convolutional network framework: http://arxiv.org/pdf/1410.0759.pdf
That should be the framework they were basing their own numbers on in the original announcement: http://blogs.nvidia.com/blog/2015/03/17/pascal/

The paper shows something interesting, page 6 and 7, reached peak SP throughput. Barely 50% on Maxwell, and even less on Kepler. That's not much much, given how promising the approach sounded.
Not sure what the limiting factor is though, transfer from global to local shared memory, bandwidth of the local memory or keeping the FPUs utilized. Or maybe just the size of the local shared memory, limiting pipelining between wavefronts and thereby effective utilization.

Either way, there's still a lot of headroom to reach the proclaimed performance improvements, and it's not necessarily all attributed to the FPUs.
Actually, getting within a factor of two of peak performance is the majority of the optimization. The rest is diminishing returns. CUDNN is on v4 now and is much closer to peak performance than v1. The new hotness for convolution these days is specialization for fixed kernel sizes and more work efficient algorithms based on decomposition a like FFT and Winograd that can actually reach somewhere around 1.6x peak FMA bandwidth given a good assembly implementation.

Another thing you might find interesting is how hard it is to get GEMM to perform well for neural networks. Here's a link showing that Nvidia's CUBLAS can be 6x slower at GEMM for Recurrent Neural Networks than a GEMM from Nervana Systems: http://svail.github.io/rnn_perf/
As you can see, GEMM is often far slower than peak FMA, or even 1/2 of peak - even when written by Nvidia. Once you get to half of peak, you've done pretty well.

Finally, this is why I won't even consider AMD GPUs for neural networks. Who would do these optimizations? Peak performance is irrelevant, sustained performance is what matters - but getting there requires a big investment in software. I don't have time to do much close assembly level optimization, I rely on libraries to do it for me. There aren't any libraries for AMD getting the kind of investment to reach even 1/2 of peak FMA bandwidth on the GEMM sizes that I need. Let alone to implement crazy Winograd kernels.
 
Not really, what changes with temperature is the required refresh rate as the leakage on the capacitors increases slightly, but apart from that, only the regular increase in electric resistance.
It's not slight at the upper range, and GDDR5 chip thermal reporting basically tops its scale at the operating temp of some GPUs like Hawaii.
The high-temperature refresh range for at least some GDDR5 data sheets starts at 85C, and for GDDR5 it doubles the rate of refreshes.
That increases the number of operations that are not related to actual memory servicing, and eats into the electrical budget of the DRAM in terms of activiations.

I am not sure what HBM's high temperature rate is (it does have multiple refresh ranges based on temp), but so far that range has been avoided for Fury products.
AMD's research into processors stacked under TSV memory took pains to keep DRAM below this threshold as well, so it can be a case where even if the cost is not ruinous, it just isn't a net win.
 
Dave -
Big square matrix multiplication (M==N==K) is not the problem! So performance on those sizes is not relevant!

The matrices we have to work with in deep learning are awkwardly shaped (read the blog post I linked to earlier for more detail), and the performance penalty from poorly tuned BLAS libraries is very significant (like 6X) - not 15% like in that post.

Finally, Kepler performance for deep learning is a non sequitur. Whereas Kepler could get 70% peak FMA from SGEMM, Maxwell can get 95%, and GM200 is also just much bigger. Kepler just isn't relevant to us.
 
The matrices we have to work with in deep learning are awkwardly shaped (read the blog post I linked to earlier for more detail), and the performance penalty from poorly tuned BLAS libraries is very significant (like 6X) - not 15% like in that post.
Sounds like AutoGemm'd clBlas covers that base. From the page Dave pointed to:
Customizability:
"For an application with unique GEMM requirements (such as very small or very skinny matrices), AutoGemm can be customized to generate application-specific kernels for additional performance."

Time to consider quad Fury X2's...
 
Sounds like AutoGemm'd clBlas covers that base. From the page Dave pointed to:

Time to consider quad Fury X2's...
Show me the numbers. The capability to generate kernels means nothing without the numbers to back it up. Given the amount of work that people are putting in to hand optimized assembly for Nvidia, it's hard for me to imagine AutoGemm can compete.

Fury is also a non-starter because of its cooling issues. We pack 8 Titan X in a single 4U box. Can't do that with Fury until someone comes out with a replacement for the CLC for high-density servers.
 
The GPU route to SGEMM success is registers. Lots of them, which is why it's easier to get Maxwell 2 to high performance, and why AMD wiped the floor with NVidia for most of the history of GPU SGEMM.

The irony is that NVidia's facing competition from Intel in the form of KNL. And the only way to compete with that is to ditch "lots of registers" and "shared memory", since the Intel ethos is smart cache algorithms in hardware and nice compilers.
 
http://www.anandtech.com/show/9621/the-amd-radeon-r9-nano-review/15

Compute Efficiency remains quite strong with Fiji, even when the PowerTune limits are scaled back (as per Nano).
Absolute performance suffers when power throttles, I would love to see what Nano performance looks like when running flat out 24-7 for months at a time.

Beyond the performance issue, the most important reason Fiji is unthinkable for machine learning is the 4 GB DRAM. Our 12 GB cards are bursting at the seams.
 
The GPU route to SGEMM success is registers. Lots of them, which is why it's easier to get Maxwell 2 to high performance, and why AMD wiped the floor with NVidia for most of the history of GPU SGEMM.

The irony is that NVidia's facing competition from Intel in the form of KNL. And the only way to compete with that is to ditch "lots of registers" and "shared memory", since the Intel ethos is smart cache algorithms in hardware and nice compilers.
Not true: the Intel ethos is handwritten intrinsics and assembly. MKL is not written by a smart cache and a nice compiler.
 
Status
Not open for further replies.
Back
Top