Nvidia Pascal Speculation Thread

LiXiangyang · Dec 2, 2015

I hope NVIDIA wont put too much weight on DNN stuff in there next generation GPUs, FP16 is almost useless in most computing tasks other than DNN thing.

As for AMD vs NVIDIA thing, in GPGPU, AMD is nothing, AMD offer almost nothing (even their BLAS is now open-sourced, which means AMD almost give up it) here whilst NVIDIA offers a full toolchain, there is no comparison.

RecessionCone · Dec 2, 2015

Ryan Smith said:
Does NVIDIA give developers the source code to cuDNN?

No. cuDNN is written in assembly anyway.

Frenetic Pony · Dec 2, 2015

tunafish said:
What the people above said, also, that's only shooting yourself in the foot if the added hardware cost through vendor lock in is more than the added cost of doing cross-platform software. I very much doubt that's true. I don't know almost anything about machine learning, but in the areas of HPC I've seen, I'd say that the typical player has hundreds of thousands to millions invested in the hardware, and millions to tens of millions invested in the software. Doing development that's only ever going to be used by one company is expensive, and if work by a hardware vendor can even marginally reduce those costs, then that's easily worth paying many times the cost in hardware.

This is also why so much code that could be running on GPUs still runs on Xeons. While you could save a lot on the HW costs, the required investment in SW would be way, way more than you'd save. What happens is one big player does the grunt work, and then open-sources their stack hoping that they will get some returns from other people developing it forward. Whatever platform they decided on will be the platform that entire segment works on, simply because for any small player the cost of switching would be ridiculously more than just paying the premium.

That's what they want you to think, a toolchain, even a complex one, could be millions. But the cost of lock in is potentially almost infinite. The only "reason" old code doesn't get updated is because "well it runs on the old harware and we have the old hardware so why bother?" By your logic the systems still running Windows 3.1 do so because it's somehow too expensive to re-write the software.

DNN is extremely hardware intensive, it's not like 4 Titan X's to a case is somehow inexpensive. And the toolchain Nvidia provides wouldn't be a huge cost to replace. It's the software built on it that took all that money, and only now that the money is spent, and everyone realizes they just locked themselves into a single IHV just to save a bit of time up front, do people rationalize that it was "worth it" the whole time.

The exact same bullshit has always applied, to 3DFX Glyde, to building a Direct X based rendering engine instead of a higher level open abstraction, to etc. etc. It's brilliant on Nvidia's part, it worked. Someone, anyone, could put out a GPU with 64gb of Ram and quadruple the performance of a Nvidia card and DNN builders would still hesitate. It's why Nvidia did what it did in the first place. Vendor lock in, software, hardware, etc. pretty much always costs more than it's worth. It's only after you've gone down the hole too far to get out that most tend to realize their mistake.

RecessionCone · Dec 3, 2015

Here's a benchmark of AMD CNN performance.
A FirePro W9100 (5.24 TFlops peak) is 11X slower than an original Titan (4.5 TFlops peak) using CUDNN 2.0. CUDNN 3.0 is much faster, at least on the Titan X we use, so this comparison isn't super useful, other than to show that the software for AMD is just non-existent.

https://github.com/vmarkovtsev/veles-benchmark/blob/master/src/alexnet.md

RecessionCone · Dec 3, 2015

Frenetic Pony said:
That's what they want you to think, a toolchain, even a complex one, could be millions. But the cost of lock in is potentially almost infinite. The only "reason" old code doesn't get updated is because "well it runs on the old harware and we have the old hardware so why bother?" By your logic the systems still running Windows 3.1 do so because it's somehow too expensive to re-write the software.

DNN is extremely hardware intensive, it's not like 4 Titan X's to a case is somehow inexpensive. And the toolchain Nvidia provides wouldn't be a huge cost to replace. It's the software built on it that took all that money, and only now that the money is spent, and everyone realizes they just locked themselves into a single IHV just to save a bit of time up front, do people rationalize that it was "worth it" the whole time.

The exact same bullshit has always applied, to 3DFX Glyde, to building a Direct X based rendering engine instead of a higher level open abstraction, to etc. etc. It's brilliant on Nvidia's part, it worked. Someone, anyone, could put out a GPU with 64gb of Ram and quadruple the performance of a Nvidia card and DNN builders would still hesitate. It's why Nvidia did what it did in the first place. Vendor lock in, software, hardware, etc. pretty much always costs more than it's worth. It's only after you've gone down the hole too far to get out that most tend to realize their mistake.

It's not a mistake: AMD is not a realistic alternative. Even assuming the software existed (which it doesn't), their perf/W is not good enough to allow 8 high-performance GPUs to sit together in a box.
The fact that their highest performing GPU has 4 GB of memory also disqualifies them.

Given these constraints, of course no one will write software for them.

RecessionCone · Dec 3, 2015

Frenetic Pony said:
Right now AMD actually puts out less heat, at least for Fury, than Nvidia does for a Titan X. And if you can afford the Firepro S1970 you can get 32gb of ram http://www.amd.com/en-us/press-releases/Pages/amd-delivers-worlds-2015jul08.aspx

Of course it's $3,000 but it's also 32gb of ram.

Less heat because it's slower. Might as well put a Radeon 6750 in the box if all you need is less heat.

RecessionCone · Dec 3, 2015

Jawed said:
Linear algebra is the single most important library in HPC - of course they optimise that.

For everything else, general purpose is the one thing that Intel does far, far better than anyone else.

Anyway, wouldn't you rather have ~400GB per processor than a measly 16/32GB? Your application is surely the use case for KNL as memory density is the single biggest problem.

You can already have 400GB on a GPU. Just use host mapped memory. I do it all the time to exceed GPU memory limits. Of course, accessing memory over PCIe is slow.
Guess what, KNL also accesses its 400GB memory through a narrow interface. So no, my application is not the use case for KNL. Turns out that deep learning needs a balance of compute, memory capacity, memory bandwidth, cost, and software support. KNL has worse compute than its 2016 competitors (remember, KNL specs aren't even public yet), its memory bandwidth looks to be pretty bad (~400 GB/s going up against 800 GB/s or so I'm guessing with HBM2), capacity of its HMC is ok, but not that amazing, cost is probably going to be terrible as the other Xeon Phis were expensive, and software support is getting better, but still not great. I'm expecting MKL to get deep learning support only 2 years after CUDNN appeared.

silent_guy · Dec 3, 2015

RecessionCone said:
...

Very cool to hear the perspective from people in the field who run more than just toy examples.
.

Dade · Dec 3, 2015

RecessionCone said:
It's not a mistake: AMD is not a realistic alternative. Even assuming the software existed (which it doesn't), their perf/W is not good enough to allow 8 high-performance GPUs to sit together in a box.

There are many people running 8xAMD GPUs in a box (or more), a couple of examples:

- http://luxmark.info/node/836
- http://luxmark.info/node/1398

I agree with the absence of software for AMD GPUs but the hardware is well competitive with NVIDIA: it is often better in term of perf/$ and perf/Watt (in GPU computing tasks). It is more a CUDA developer attitude: do they need to convince them-self that there are not alternative to vendor lock in ?

CarstenS · Dec 3, 2015

xpea said:
So you think HPC Pascal will have 1:2 ratio on FP64:FP32 ALUs ?
I may be wrong but I think a single mixed precision FP16/32/64 ALU will cost less die/logic/registers than one FP64 + two FP16/32 ALUs.

In the (short) end what I am thinking is that it would make the most sense to see almost completely separate chips for consumer and HPC/whathaveyou with the former keeping 1/32th rate or so in order to enable programmers.

So, GP100 as a 1:2:4 (with DP(1) being separate clusters and FP16(4) being done through native FP32(2) units) ratio monster as that would be able to maximize utilization of the datapaths. At the same time, the will be a smaller 1:32:64 ASIC (similar to GM204 from a ratio perspektive and with dedicated FP64 units) for gaming parts and maybe dense neural network stuff if those really dig the FP16 precision.

If need be, there can be a GP102 chip as another big GPU, but optimized for gaming with 1:32:64 as soon as the process has matured and gotten cheaper to have a refresh of GP104.

Reasoning:
The architecture cycle has slowed down and we are at the start of a new process node that is the first one where power almost completely dominates area.

Brodda Thep · Dec 3, 2015

Frenetic Pony said:
That's what they want you to think, a toolchain, even a complex one, could be millions. But the cost of lock in is potentially almost infinite. The only "reason" old code doesn't get updated is because "well it runs on the old harware and we have the old hardware so why bother?" By your logic the systems still running Windows 3.1 do so because it's somehow too expensive to re-write the software.

DNN is extremely hardware intensive, it's not like 4 Titan X's to a case is somehow inexpensive. And the toolchain Nvidia provides wouldn't be a huge cost to replace. It's the software built on it that took all that money, and only now that the money is spent, and everyone realizes they just locked themselves into a single IHV just to save a bit of time up front, do people rationalize that it was "worth it" the whole time.

The exact same bullshit has always applied, to 3DFX Glyde, to building a Direct X based rendering engine instead of a higher level open abstraction, to etc. etc. It's brilliant on Nvidia's part, it worked. Someone, anyone, could put out a GPU with 64gb of Ram and quadruple the performance of a Nvidia card and DNN builders would still hesitate. It's why Nvidia did what it did in the first place. Vendor lock in, software, hardware, etc. pretty much always costs more than it's worth. It's only after you've gone down the hole too far to get out that most tend to realize their mistake.

It isn't even close to the same issue. There is no software lock. nVidia was the first to specifically target CNN applications with software libraries. But those libraries are just a single linkt of a long chain of development tools. Intel has also done work to specifically target neural network applications (https://software.intel.com/en-us/ar...d-training-on-intel-xeon-e5-series-processors). Nervana has also written these tools for CUDA. AMD is the one coming up short. If they want customers to buy their hardware for neural networks they need to do the work. However, there is an openCL implementation of Torch out there. You can see how it performs here. (https://github.com/soumith/convnet-benchmarks)

People doing work with CNNs love nVidia because they are aggressively pursuing applications for those neural networks. TX1 is a great example.

At any rate nVidia is offering great value for the money to developers. They should be applauded for their work.

gamervivek · Dec 5, 2015

silent_guy said:
I'm specifically asking for neural network workloads, where everybody is using highly optimized frameworks, yet none of them seem to have solid support for AMD GPU. (Caffe has nothing, Torch7 seems to have some support, Theano is unclear.)

I tried looking for comparative benchmarks and can't find anything meaningful.

So if RecessionCone has some real life benchmarks, that's be very interesting.

It doesn't matter that AMD has the lowest heat and highest memory if none of the popular software can use it.

There's OpenCl initiative for Caffe from AMD and some benchmarks too, not sure how it compares with nvidia.

https://github.com/amd/OpenCL-caffe

From here,

http://stackoverflow.com/questions/30622805/opencl-amd-deep-learning

lanek · Dec 5, 2015

RecessionCone said:
Less heat because it's slower. Might as well put a Radeon 6750 in the box if all you need is less heat.

You dont hope to convince us with one benchmark or ? And from what the hell, i have no idea from where is coming this benchmark..

Deleted member 2197 · Dec 5, 2015

lanek said:
You dont hope to convince us with one benchmark or ?

https://github.com/soumith/convnet-benchmarks

As mentioned above the benchmark contains the openCL implementation of Torch tested on AMD hardware for comparison purposes.

lanek · Dec 5, 2015

pharma said:
https://github.com/soumith/convnet-benchmarks

As mentioned above the benchmark contains the openCL implementation of Torch tested on AMD hardware for comparison purposes.

I still dont know how the implementation of OpenCL have been made, and for be more serious i have no idea what is testing this benchmark .... i have never seen an OpenCL benchmark slower on AMD gpu#s than on Nvidia cards, who only support OpenCL 1.1 ( and for be honest, are seriously bad on 1.1 too ) .

So , is this particular benchmark is more reprensatative than an other for the performance of AMD gpus or is it a particular case ?

I.S.T. · Dec 5, 2015

Nvidia added Open CL 1.2 a while back: http://www.anandtech.com/show/9139/...fix-driver-fixes-games-adds-opencl-12-support

lanek · Dec 6, 2015

I.S.T. said:
Nvidia added Open CL 1.2 a while back: http://www.anandtech.com/show/9139/...fix-driver-fixes-games-adds-opencl-12-support

Only half way, ( they have never implemented all the ressources of 1.1, let alone 1.2, their driver for it, are just a damn nightmare .. ) , and we are at openCL 2.1 now ....

And i have never seen any plan from Nvidia to support OpenCL 2.x...

So, dont blame me when i see a benchmark with Nvidia gpu's who use OpenCL.. to laugh a little bit ..

silent_guy · Dec 6, 2015

lanek said:
So , is this particular benchmark is more reprensatative than an other for the performance of AMD gpus or is it a particular case ?

Dude, it's not difficult.

This is just one particular kind of workload that happens to be an important one in an emerging field. Nobody is saying that AMD GPUs are in theory worse than Nvidia GPUs. It's just that, in practice, nobody of the smart kids can be bothered to spend a lot of precious time on libraries and optimize them like crazy. As a result, their non-optimized libraries perform an order of magnitude worse.

That is all.

Razor1 · Dec 6, 2015

lanek said:
You dont hope to convince us with one benchmark or ? And from what the hell, i have no idea from where is coming this benchmark..

When a new market is created like the compute, hardware supported neural net, etc. There will always be reasons for companies to pick one hardware over another, and when you start looking at the amount of money these companies are shelling out to do this type of research, if the support isn't there from the person selling the hardware they will not be adopted. Once they aren't adopted optimizations for them will automatically be dropped, and this is what we are seeing. AMD if they pushed their initiative well they could regain some of that, but its a up hill battle and will cost more now, since nV was first out to market, than Intel (both nV and Intel have continued support and increased their initiatives where AMD really hasn't), than AMD.

Jawed · Dec 8, 2015

RecessionCone said:
You can already have 400GB on a GPU. Just use host mapped memory. I do it all the time to exceed GPU memory limits. Of course, accessing memory over PCIe is slow.

KNL is about 90GB/s supposedly.

Guess what, KNL also accesses its 400GB memory through a narrow interface. So no, my application is not the use case for KNL. Turns out that deep learning needs a balance of compute, memory capacity, memory bandwidth, cost, and software support. KNL has worse compute than its 2016 competitors (remember, KNL specs aren't even public yet), its memory bandwidth looks to be pretty bad (~400 GB/s going up against 800 GB/s or so I'm guessing with HBM2), capacity of its HMC is ok, but not that amazing, cost is probably going to be terrible as the other Xeon Phis were expensive, and software support is getting better, but still not great. I'm expecting MKL to get deep learning support only 2 years after CUDNN appeared.

SGEMM isn't main memory bandwidth bound in any meaningful fashion (<< 0.1 byte per FLOP in any decent implementation), so ~480GB/s should support around 7 TFLOPs without breaking a sweat.

It's interesting how badly optimised Caffe is:

https://software.intel.com/en-us/ar...d-training-on-intel-xeon-e5-series-processors

With the right use of Intel MKL, vectorization, and parallelization it is possible to achieve an 11x increase in training performance and a 10x increase in classification performance compared to a non-optimized Caffe implementation.

That's with 28 cores at a worst case of 1.9GHz (AVX2 base clock, assuming thermals force it to that clock):

http://www.intel.com/content/dam/ww...on-e5-v3-advanced-vector-extensions-paper.pdf

For example, Linpack runs closer to the AVX base frequency on the Intel Xeon processor E5 v3 family, so using the Intel AVX base frequency to calculate theoretical peak FLOPS will more accurately represent Linpack efficiency.

But agreed, a Titan X at ~$1000 is cheaper than KNL at ~$8000 (guess).

Nvidia Pascal Speculation Thread

LiXiangyang

RecessionCone

Frenetic Pony

RecessionCone

RecessionCone

RecessionCone

RecessionCone

silent_guy

Dade

CarstenS

Moderator

Brodda Thep

gamervivek

lanek

Deleted member 2197

Guest

lanek

I.S.T.

lanek

silent_guy

Razor1

Jawed

Similar threads