Nvidia Pascal Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 25, 2014.

Tags:
Thread Status:
Not open for further replies.
  1. LiXiangyang

    LiXiangyang Newcomer

    I hope NVIDIA wont put too much weight on DNN stuff in there next generation GPUs, FP16 is almost useless in most computing tasks other than DNN thing.

    As for AMD vs NVIDIA thing, in GPGPU, AMD is nothing, AMD offer almost nothing (even their BLAS is now open-sourced, which means AMD almost give up it) here whilst NVIDIA offers a full toolchain, there is no comparison.
     
  2. RecessionCone

    RecessionCone Regular Subscriber

    No. cuDNN is written in assembly anyway.
     
    Razor1 and pharma like this.
  3. Frenetic Pony

    Frenetic Pony Regular

    That's what they want you to think, a toolchain, even a complex one, could be millions. But the cost of lock in is potentially almost infinite. The only "reason" old code doesn't get updated is because "well it runs on the old harware and we have the old hardware so why bother?" By your logic the systems still running Windows 3.1 do so because it's somehow too expensive to re-write the software.

    DNN is extremely hardware intensive, it's not like 4 Titan X's to a case is somehow inexpensive. And the toolchain Nvidia provides wouldn't be a huge cost to replace. It's the software built on it that took all that money, and only now that the money is spent, and everyone realizes they just locked themselves into a single IHV just to save a bit of time up front, do people rationalize that it was "worth it" the whole time.

    The exact same bullshit has always applied, to 3DFX Glyde, to building a Direct X based rendering engine instead of a higher level open abstraction, to etc. etc. It's brilliant on Nvidia's part, it worked. Someone, anyone, could put out a GPU with 64gb of Ram and quadruple the performance of a Nvidia card and DNN builders would still hesitate. It's why Nvidia did what it did in the first place. Vendor lock in, software, hardware, etc. pretty much always costs more than it's worth. It's only after you've gone down the hole too far to get out that most tend to realize their mistake.
     
  4. RecessionCone

    RecessionCone Regular Subscriber

    Here's a benchmark of AMD CNN performance.
    A FirePro W9100 (5.24 TFlops peak) is 11X slower than an original Titan (4.5 TFlops peak) using CUDNN 2.0. CUDNN 3.0 is much faster, at least on the Titan X we use, so this comparison isn't super useful, other than to show that the software for AMD is just non-existent.

    https://github.com/vmarkovtsev/veles-benchmark/blob/master/src/alexnet.md
     
  5. RecessionCone

    RecessionCone Regular Subscriber

    It's not a mistake: AMD is not a realistic alternative. Even assuming the software existed (which it doesn't), their perf/W is not good enough to allow 8 high-performance GPUs to sit together in a box.
    The fact that their highest performing GPU has 4 GB of memory also disqualifies them.

    Given these constraints, of course no one will write software for them.
     
  6. RecessionCone

    RecessionCone Regular Subscriber

    Less heat because it's slower. Might as well put a Radeon 6750 in the box if all you need is less heat.
     
  7. RecessionCone

    RecessionCone Regular Subscriber

    You can already have 400GB on a GPU. Just use host mapped memory. I do it all the time to exceed GPU memory limits. Of course, accessing memory over PCIe is slow.
    Guess what, KNL also accesses its 400GB memory through a narrow interface. So no, my application is not the use case for KNL. Turns out that deep learning needs a balance of compute, memory capacity, memory bandwidth, cost, and software support. KNL has worse compute than its 2016 competitors (remember, KNL specs aren't even public yet), its memory bandwidth looks to be pretty bad (~400 GB/s going up against 800 GB/s or so I'm guessing with HBM2), capacity of its HMC is ok, but not that amazing, cost is probably going to be terrible as the other Xeon Phis were expensive, and software support is getting better, but still not great. I'm expecting MKL to get deep learning support only 2 years after CUDNN appeared.
     
    Razor1, silent_guy and pharma like this.
  8. silent_guy

    silent_guy Veteran Subscriber

    Very cool to hear the perspective from people in the field who run more than just toy examples.
    .
     
  9. Dade

    Dade Newcomer

    There are many people running 8xAMD GPUs in a box (or more), a couple of examples:

    - http://luxmark.info/node/836
    - http://luxmark.info/node/1398

    I agree with the absence of software for AMD GPUs but the hardware is well competitive with NVIDIA: it is often better in term of perf/$ and perf/Watt (in GPU computing tasks). It is more a CUDA developer attitude: do they need to convince them-self that there are not alternative to vendor lock in ?
     
  10. CarstenS

    CarstenS Legend Subscriber

    In the (short) end what I am thinking is that it would make the most sense to see almost completely separate chips for consumer and HPC/whathaveyou with the former keeping 1/32th rate or so in order to enable programmers.

    So, GP100 as a 1:2:4 (with DP(1) being separate clusters and FP16(4) being done through native FP32(2) units) ratio monster as that would be able to maximize utilization of the datapaths. At the same time, the will be a smaller 1:32:64 ASIC (similar to GM204 from a ratio perspektive and with dedicated FP64 units) for gaming parts and maybe dense neural network stuff if those really dig the FP16 precision.

    If need be, there can be a GP102 chip as another big GPU, but optimized for gaming with 1:32:64 as soon as the process has matured and gotten cheaper to have a refresh of GP104.

    Reasoning:
    The architecture cycle has slowed down and we are at the start of a new process node that is the first one where power almost completely dominates area.
     
  11. Brodda Thep

    Brodda Thep Newcomer

    It isn't even close to the same issue. There is no software lock. nVidia was the first to specifically target CNN applications with software libraries. But those libraries are just a single linkt of a long chain of development tools. Intel has also done work to specifically target neural network applications (https://software.intel.com/en-us/ar...d-training-on-intel-xeon-e5-series-processors). Nervana has also written these tools for CUDA. AMD is the one coming up short. If they want customers to buy their hardware for neural networks they need to do the work. However, there is an openCL implementation of Torch out there. You can see how it performs here. (https://github.com/soumith/convnet-benchmarks)

    People doing work with CNNs love nVidia because they are aggressively pursuing applications for those neural networks. TX1 is a great example.

    At any rate nVidia is offering great value for the money to developers. They should be applauded for their work.
     
    homerdog, Razor1, nnunn and 2 others like this.
  12. gamervivek

    gamervivek Regular

    There's OpenCl initiative for Caffe from AMD and some benchmarks too, not sure how it compares with nvidia.

    https://github.com/amd/OpenCL-caffe

    From here,

    http://stackoverflow.com/questions/30622805/opencl-amd-deep-learning
     
  13. lanek

    lanek Veteran


    You dont hope to convince us with one benchmark or ? And from what the hell, i have no idea from where is coming this benchmark..
     
  14. pharma

    pharma Veteran

    https://github.com/soumith/convnet-benchmarks

    As mentioned above the benchmark contains the openCL implementation of Torch tested on AMD hardware for comparison purposes.
     
    Last edited: Dec 5, 2015
  15. lanek

    lanek Veteran


    I still dont know how the implementation of OpenCL have been made, and for be more serious i have no idea what is testing this benchmark .... i have never seen an OpenCL benchmark slower on AMD gpu#s than on Nvidia cards, who only support OpenCL 1.1 ( and for be honest, are seriously bad on 1.1 too ) .

    So , is this particular benchmark is more reprensatative than an other for the performance of AMD gpus or is it a particular case ?
     
  16. I.S.T.

    I.S.T. Veteran

  17. lanek

    lanek Veteran


    Only half way, ( they have never implemented all the ressources of 1.1, let alone 1.2, their driver for it, are just a damn nightmare .. ) , and we are at openCL 2.1 now ....

    And i have never seen any plan from Nvidia to support OpenCL 2.x...

    So, dont blame me when i see a benchmark with Nvidia gpu's who use OpenCL.. to laugh a little bit ..
     
    Last edited: Dec 6, 2015
    Ext3h, Razor1 and BRiT like this.
  18. silent_guy

    silent_guy Veteran Subscriber

    Dude, it's not difficult.

    This is just one particular kind of workload that happens to be an important one in an emerging field. Nobody is saying that AMD GPUs are in theory worse than Nvidia GPUs. It's just that, in practice, nobody of the smart kids can be bothered to spend a lot of precious time on libraries and optimize them like crazy. As a result, their non-optimized libraries perform an order of magnitude worse.

    That is all.
     
  19. Razor1

    Razor1 Veteran


    When a new market is created like the compute, hardware supported neural net, etc. There will always be reasons for companies to pick one hardware over another, and when you start looking at the amount of money these companies are shelling out to do this type of research, if the support isn't there from the person selling the hardware they will not be adopted. Once they aren't adopted optimizations for them will automatically be dropped, and this is what we are seeing. AMD if they pushed their initiative well they could regain some of that, but its a up hill battle and will cost more now, since nV was first out to market, than Intel (both nV and Intel have continued support and increased their initiatives where AMD really hasn't), than AMD.
     
  20. Jawed

    Jawed Legend

    KNL is about 90GB/s supposedly.

    SGEMM isn't main memory bandwidth bound in any meaningful fashion (<< 0.1 byte per FLOP in any decent implementation), so ~480GB/s should support around 7 TFLOPs without breaking a sweat.

    It's interesting how badly optimised Caffe is:

    https://software.intel.com/en-us/ar...d-training-on-intel-xeon-e5-series-processors

    That's with 28 cores at a worst case of 1.9GHz (AVX2 base clock, assuming thermals force it to that clock):

    http://www.intel.com/content/dam/ww...on-e5-v3-advanced-vector-extensions-paper.pdf

    But agreed, a Titan X at ~$1000 is cheaper than KNL at ~$8000 (guess).
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

Loading...