Nvidia Pascal Speculation Thread

Status
Not open for further replies.
Absolute performance suffers when power throttles, I would love to see what Nano performance looks like when running flat out 24-7 for months at a time.

Beyond the performance issue, the most important reason Fiji is unthinkable for machine learning is the 4 GB DRAM. Our 12 GB cards are bursting at the seams.

Right now AMD actually puts out less heat, at least for Fury, than Nvidia does for a Titan X. And if you can afford the Firepro S1970 you can get 32gb of ram http://www.amd.com/en-us/press-releases/Pages/amd-delivers-worlds-2015jul08.aspx

Of course it's $3,000 but it's also 32gb of ram.
 
Absolute performance suffers when power throttles, I would love to see what Nano performance looks like when running flat out 24-7 for months at a time.
24/7 running doesn't make much difference once you reach the limits and you can do that in fairly short order.
 
Actually, most GPUs from AMD put out less heat than Titan X. However, they are all slower than Titan X. S1970 and Fury are much slower than Titan X, for example.
Can you share some percentage differences for neural nets on some Fury variants against a Titan X?

It's almost impossible to find anything serious for AMD.
 
Not true: the Intel ethos is handwritten intrinsics and assembly. MKL is not written by a smart cache and a nice compiler.
Linear algebra is the single most important library in HPC - of course they optimise that.

For everything else, general purpose is the one thing that Intel does far, far better than anyone else.

Anyway, wouldn't you rather have ~400GB per processor than a measly 16/32GB? Your application is surely the use case for KNL as memory density is the single biggest problem.
 
Can you share some percentage differences for neural nets on some Fury variants against a Titan X?

It's almost impossible to find anything serious for AMD.

AMD is slow because no one does AMD because Nvidia is fast because no one does AMD. Seems a bit of a circular logic. Then again I'm not nearly familiar enough with neural nets to know what's going on, keep intending to take the time for it... eh someday

Regardless, point is AMD currently has the GPU with the highest amount of RAM, and the lowest heat. If that's what your limited by somehow then buying Nvidia just because it's popular seems pointless. If you're really locked into Nvidia's development environment, enough that you can't switch even if another IHV has better hardware for your needs, well that's a deficiency you'd probably do best to address. If Nvidia just has the highest throughput per board/watt that you need for the price, well then you buy Nvidia.

Either way, typing this, I just realized this has probably gotten way off topic :p
 
I'm specifically asking for neural network workloads, where everybody is using highly optimized frameworks, yet none of them seem to have solid support for AMD GPU. (Caffe has nothing, Torch7 seems to have some support, Theano is unclear.)

I tried looking for comparative benchmarks and can't find anything meaningful.

So if RecessionCone has some real life benchmarks, that's be very interesting.

It doesn't matter that AMD has the lowest heat and highest memory if none of the popular software can use it.
 
Not true: the Intel ethos is handwritten intrinsics and assembly. MKL is not written by a smart cache and a nice compiler.

Further: Doesn't NV already kinda cache the RF? And isn't the RF like 16MB on Fiji?
Would there be a point in going to a more CPU oriented design with a classical 32-64 registers/thread RF and cache backed stack? I'm sure the literature is bursting with stack cache clever ideas and the GPU execution model seems friendly to more stack cache shenanigans (to avoid RAM writes) like have a POPINVALIDATE instruction that just marks the cache line as free.

I've learnt here that PPA rules the semi world. What's the outlook wrt. PPA of going from massive RF to stack based cache backed archs?

I'd assume cache management will use more (dynamic) power than a simple RF and even if the cache is one half/one quarter of the RF you will lose on power, but stack based archs might achieve better perf (and possibly perf/W) with complex kernels with lots of registers/temporals which drive the occupancy down on some RF architectures. Also FinFETs have reduced static power (right?) which makes SRAM more attractive than complex logic switching often (the cache logic).

And of course there's the software driving the machines, and it seems classical compilers should be more comfortable with stack based archs (see: rantings in B3D on GCN's compiler). A perfect compiler for RF archs can be written, for infinite money. Less than perfect compilers cost less than infinite money.
 
I'm specifically asking for neural network workloads, where everybody is using highly optimized frameworks, yet none of them seem to have solid support for AMD GPU. (Caffe has nothing, Torch7 seems to have some support, Theano is unclear.)

I tried looking for comparative benchmarks and can't find anything meaningful.

So if RecessionCone has some real life benchmarks, that's be very interesting.

It doesn't matter that AMD has the lowest heat and highest memory if none of the popular software can use it.

That's exactly what I meant. I honestly don't know what the comparison is, and no one else seems to, because everyone just uses Nvidia by default for Neural Networks. But if the development environment is locked into one single IHV because of software support instead of hardware performance, then that's an entire industry shooting itself in the foot (they're the ones that wrote the locked in software for the most part) and then saying that the pain isn't that bad when asked why they did it.

Also FinFETs have reduced static power (right?)

Correct. I'd also doubt anyone but Intel is interested in SRAM. For the amount of die space it takes up it's not worth the bandwidth and latency for most GPU tasks.
 
That's exactly what I meant. I honestly don't know what the comparison is, and no one else seems to, because everyone just uses Nvidia by default for Neural Networks. But if the development environment is locked into one single IHV because of software support instead of hardware performance, then that's an entire industry shooting itself in the foot (they're the ones that wrote the locked in software for the most part) and then saying that the pain isn't that bad when asked why they did it.
Well the point is that Nvidia is investing money in neural networks with important software support, not AMD. cuDNN (CUDA Deep Neural Network) library is written by Nvidia for this market:
https://developer.nvidia.com/cudnn
Version 3 was recently launched and offers some nice speed improvements. It supports Caffe, Theano and Torch. For example, Theano support:
http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html

cuDNN 4 is on the way and will offer FP16 support for TX1 and Pascal with another great speed bump:
http://www.nvidia.com/content/tegra/embedded-systems/pdf/jetson_tx1_whitepaper.pdf
http://devblogs.nvidia.com/parallelforall/inference-next-step-gpu-accelerated-deep-learning/

Based on CUDA and cuDNN, Nvidia also offers DIGITS, a middleware dedicated to deep learning:
https://developer.nvidia.com/digits

Finally Nvidia offers lot of deep learning courses:
https://developer.nvidia.com/deep-learning-courses

All in one, it's easy to understand why Nvidia and CUDA are the defacto standard and why nobody (but really nobody) use AMD
 
I tried looking for comparative benchmarks and can't find anything meaningful.

inference_performance_TX1_TitanX1-624x403.png



The experiments were run on four different devices: The NVIDIA Tegra X1 and the Intel Core i7 6700K as client-side processors; and the NVIDIA GeForce GTX Titan X and a 16-core Intel Xeon E5 2698 as high-end processors. The neural networks were run on the GPUs using Caffe compiled for GPU usage using cuDNN. The Intel CPUs run the most optimized CPU inference code available, the recently released Intel Deep Learning Framework (IDLF) [17]. IDLF only supports a neural network architecture called CaffeNet that is similar to AlexNet with batch sizes of 1 and 48.

http://devblogs.nvidia.com/parallelforall/inference-next-step-gpu-accelerated-deep-learning/

Also included in the whitepaper is a discussion of new cuDNN optimizations aimed at improving inference performance (and used in the performance results shown in the paper), as well as optimizations added to the Caffe deep learning framework. Among the many optimizations in cuDNN 4 is an improved convolution algorithm that is able to split the work of smaller batches across more multiprocessors, improving the performance of small batches on larger GPUs. cuDNN 4 also adds support for FP16 arithmetic in convolution algorithms. On supported chips, such as Tegra X1 or the upcoming Pascal architecture, FP16 arithmetic delivers up to 2x the performance of equivalent FP32 arithmetic. Just like FP16 storage, using FP16 arithmetic incurs no accuracy loss compared to running neural network inference in FP32.

GPUs also benefit from an improvement contributed to the Caffe framework to allow it to use cuBLAS GEMV (matrix-vector multiplication) instead of GEMM (matrix-matrix multiplication) for inference when the batch size is 1. See the whitepaper for more details on these optimizations.

Whitepaper
http://developer.download.nvidia.co...69494ec9b0fcad&file=jetson_tx1_whitepaper.pdf
 
Last edited by a moderator:
Everybody knows that GPUs are an order of magnitude faster than CPUs. I was talking specifically about Nvidia vs AMD GPUs.
I gave you the answer, but I will explain in another words: they are no comparison between NV and AMD because they are no library/API/middleware written for AMD hardware. Nvidia did all the work to support deep learning, not AMD. So people use Nvidia because their software works with Nvidia.
Until someone will spend thousand of hours to write something for AMD, don't expect to see a comparison.
If AMD wants to get a bit of this market, they have to invest human resource / money and provide same support as Nvidia. Don't always expect others to work for you if you want to success...
 
But if the development environment is locked into one single IHV because of software support instead of hardware performance, then that's an entire industry shooting itself in the foot (they're the ones that wrote the locked in software for the most part) and then saying that the pain isn't that bad when asked why they did it.
All AMD needed to do was to be first with a high level language GPU development environment, to organize a few large conferences, seed universities all around the world for years with GPUs, write and support a bunch of libraries, win a few government supercomputer contracts, encourage/support a few software vendors to use CUDA, have their HW installed in AWS, and go all in on the first truly meaningful GPU compute application. ;)

For a developer is safer to go where the puck is than to go where the puck may but likely will never be at some undetermined point in the future.
 
Until someone will spend thousand of hours to write something for AMD, don't expect to see a comparison.
If AMD wants to get a bit of this market, they have to invest human resource / money and provide same support as Nvidia. Don't always expect others to work for you if you want to success...
Funny that you mention it. Remember the recent announcement of that "Boltzman Initiative"? Part of that is HIP - which is nothing but a CUDA compatibility layer for any AMD GPUs. It would be interesting to see how cuDNN compiled against HIP instead of CUDA fairs on AMDs hardware. Someone got enough free time at hands?
 
Funny that you mention it. Remember the recent announcement of that "Boltzman Initiative"? Part of that is HIP - which is nothing but a CUDA compatibility layer for any AMD GPUs. It would be interesting to see how cuDNN compiled against HIP instead of CUDA fairs on AMDs hardware. Someone got enough free time at hands?
Does NVIDIA give developers the source code to cuDNN?
 
then that's an entire industry shooting itself in the foot (they're the ones that wrote the locked in software for the most part) and then saying that the pain isn't that bad when asked why they did it..

What the people above said, also, that's only shooting yourself in the foot if the added hardware cost through vendor lock in is more than the added cost of doing cross-platform software. I very much doubt that's true. I don't know almost anything about machine learning, but in the areas of HPC I've seen, I'd say that the typical player has hundreds of thousands to millions invested in the hardware, and millions to tens of millions invested in the software. Doing development that's only ever going to be used by one company is expensive, and if work by a hardware vendor can even marginally reduce those costs, then that's easily worth paying many times the cost in hardware.

This is also why so much code that could be running on GPUs still runs on Xeons. While you could save a lot on the HW costs, the required investment in SW would be way, way more than you'd save. What happens is one big player does the grunt work, and then open-sources their stack hoping that they will get some returns from other people developing it forward. Whatever platform they decided on will be the platform that entire segment works on, simply because for any small player the cost of switching would be ridiculously more than just paying the premium.
 
I happened to work in HPC sector, and maybe you can tell Xeon itself is more effieicent than GPU for common computing tasks (with no linear algebra involved, through not many, but yes, there are some), but comparing Xeon Phi to GPU, espicially the NV's ones, I can only say it is a failure.

Linear algebra, actually, not all LA routines but just GEMM is the only area where Xeon Phi is close to a Kepler or Maxwell in terms of performance, for anything else, Working with Xeon Phi typically means you spend 5X more time on optimizing and tunning, ending up with less portable and harder-to-maintain codes for typically LESS than half performance comparing to a Tesla, despite the fact Xeon Phi enjoys a significant node advantage and dont need waste any heat and die-area on graphic stuff.

Judging by the tech specifics of the next manycores intel offers, I will expect it will still much slower than a GPU like the coming Pascal on computing tasks.
 
Status
Not open for further replies.
Back
Top