Nvidia Volta Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 19, 2013.

Tags:
  1. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    946
    Likes Received:
    413
    If you want to position something on a 2048 pixel screen, then yes, otherwise, no.
     
  2. RecessionCone

    Regular Subscriber

    Joined:
    Feb 27, 2010
    Messages:
    505
    Likes Received:
    189
    The output is FP32 (which has 23 bits of mantissa). So you'll be able to position things far more precisely than that.
     
    pharma likes this.
  3. Anarchist4000

    Veteran

    Joined:
    May 8, 2004
    Messages:
    1,439
    Likes Received:
    359
    Tiling perhaps? Makes sense for all the interpolation work and devs should be able to work around the precision issue in screen space easily enough. Would save a lot of space in various distributed caches.

    That won't improve accuracy much. For deep learning it works because the operations are summing many consecutive FP16 multiplications. Essentially calculating the sum of many binary results. In graphics I can't think of m/any good examples of doing that. Maybe kinematics or tessellation where the results can be a bit more fungible with longer dependencies. Will be interesting to see what devs come up with as a decade ago tessellation and kinematics weren't very practical to possibly warrant less than FP24. Blending is mostly multiplication, but downsampling would involve some addition.
     
  4. keldor314

    Newcomer

    Joined:
    Feb 23, 2010
    Messages:
    132
    Likes Received:
    13
    Normals and other directional vectors don't need that much precision. 23 bits of mantissa give you enough precision to point a laser pointer at a specific person's house... assuming you're an astronaut currently standing on the moon. To further put this in perspective, this beats out the resolving capabilities of any of the world's most powerful telescopes (such as Hubble) by half an order of magnitude.

    10 bits of mantissa lets you point to any given pixel in a 2k cubemap. This is more than enough, even by movie production standards, for any sort of lighting calculations that directional vectors are generally used in (assume that if you need direct specular reflections in a movie, you'll raytrace them).
     
    pharma likes this.
  5. Ethatron

    Regular Subscriber

    Joined:
    Jan 24, 2010
    Messages:
    946
    Likes Received:
    413
    Don't forget perspective divide if you talk about 3d positioning. It's very limiting when you can only target 3x3 pixel in the center of the view for objects at the precision boundary of your FP16 universe (90 degree FoV -> x/z, y/z -> +-2047/2047 = +-1). If you bring exponent into it, it gets difficult to explain how things between 2047 and 65504 snap weirdly, while the targetable viewport for objects at the 2047 boundary only raises to ~65x65 (+-65504/2047 = +- 32).
    I'm sure it's possible to construct a valid application-set for this math even in 3d, but's far from universally recommendable.
     
    #585 Ethatron, Sep 9, 2017
    Last edited: Sep 9, 2017
    Lightman and BRiT like this.
  6. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    Dayman1225 and pharma like this.
  7. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,891
    Likes Received:
    4,539
    NVIDIA Volta Tesla V100 GPU Accelerator Compute Performance Revealed – Features A Monumental Increase Over Pascal Based Tesla P100

    [​IMG]

    In terms of specifications, this machine rocks eight Tesla V100 GPUs with 5120 cores each. This totals 40,960 CUDA Cores and 5120 Tensor Cores. The DGX-1 houses a total of 128 GB of HBM2 memory on its eight Tesla V100 GPUs. The system features dual Intel Xeon E5-2698 V4 processors that come with 20 cores, 40 threads and clock in at 2.2 GHz. There’s 512 GB of DDR4 memory inside the system. The storage is provided in the form of four 1.92 TB SSDs configured in RAID 0, network is a dual 10 GbE with up to 4 IB EDR. The system comes with a 3.2 KW PSU. You can find more details here.

    The system can be compared to a HP Z8 G4 Workstation which features a total of nine PCIe slots and features a score of 278706 points in the OpenCL API with the Quadro GP100 which is essentially a Tesla P100 spec’d card. Moving over to the fastest Tesla P100 listing, we see a total of 8 PCIe cards configured to reach a score of 320031 in the CUDA API. But let’s take a look at the mind boggling Tesla V100 scores. A DGX-1 system with 8 SXM2 Tesla V100 cards scores 481504 in OpenCL API and a monumental 743537 points with the CUDA API.

    The score puts the Tesla V100 in an impressive lead over its predecessor which is something we are excited to see. It also shows that we can be looking at a generational leap in the gaming GPU segment if the performance numbers from the chip architecture carry over to the mainstream markets. Another thing which should be pointed out is the incredible tuning of compute output with the new CUDA API and related libraries. Not only is the Tesla V100 seeing big improvements over OpenCL but the same can be seen for the Tesla P100 which means that NVIDIA is really doing some hard work with their CUDNN framework and it’s expected to get even better in the coming generations. So there you have it, NVIDIA’s fastest GPU showing off some killer performance in its specified compute related workloads.
    http://wccftech.com/nvidia-volta-tesla-v100-gpu-compute-benchmarks-revealed/
     
    #587 pharma, Sep 17, 2017
    Last edited: Sep 17, 2017
    DrYesterday, Lightman, xpea and 4 others like this.
  8. Grall

    Grall Invisible Member
    Legend

    Joined:
    Apr 14, 2002
    Messages:
    10,801
    Likes Received:
    2,176
    Location:
    La-la land
    Madness. lol

    Wicked, wicked madness!
     
  9. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Are there detailed results available? Without them, things could end up like in that one SPEC CPU2006 test, where one sub-result ran enormously fast and since they were not normalized, distorted the whole thing.
     
  10. troyan

    Regular

    Joined:
    Sep 1, 2015
    Messages:
    605
    Likes Received:
    1,126
  11. pharma

    Veteran

    Joined:
    Mar 29, 2004
    Messages:
    4,891
    Likes Received:
    4,539
    #591 pharma, Sep 18, 2017
    Last edited: Sep 18, 2017
    xpea likes this.
  12. CarstenS

    Legend Subscriber

    Joined:
    May 31, 2002
    Messages:
    5,800
    Likes Received:
    3,920
    Location:
    Germany
    Weird results - comparing V100 OpenCL and Cuda
     
  13. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    Below P100 vs V100 on CUDA:
    https://browser.geekbench.com/v4/compute/compare/485469?baseline=1098943

    Code:
                              P100          |       DGX-1 V100
    CUDA Score               320031         |         743537
    Sobel                    528482         |        1382119
                        23.3 Gpixels/sec    |     60.9 Gpixels/sec
    Histogram Equalization    455379        |        996475
                        14.2 Gpixels/sec    |     31.1 Gpixels/sec
    SFFT                     66489          |        101670
                          165.7 Gflops      |     253.5 Gflops
    Gaussian Blur            538403         |        1897300
                        9.43 Gpixels/sec    |      33.2 Gpixels/sec
    Face Detection          49263           |         108700
                    14.4 Msubwindows/sec    |     31.7 Msubwindows/sec
    RAW                     1139825         |        2743361
                      11.0 Gpixels/sec      |      26.6 Gpixels/sec
    Depth of Field          571644          |        1499040
                      1.66 Gpixels/sec      |     4.35 Gpixels/sec
    Particle Physics        397917          |        786603
                         62904.7 FPS        |     124350.1 FPS
    We see that performance increase is regular on all workloads. Looks like new Volta SMs do wonders :runaway:

    Edit: formatting

    Edit2: note that Geekbench only use one GPU, even if a system has multiples GPUs. Thus the comparison is one P100 vs one V100. Source:
    http://support.primatelabs.com/discussions/geekbench/16171-geekbench-4-multiple-gpu-benchmark

    Edit3: the fastest Vega RX score in Geekbench 4 compute reaches 204,593 (compared to 481,504 for V100 in OpenCL mode)

    Edit4: V100 score is even more impressive knowing that Geekbench compute test doesn't use Tensor cores
     
    #593 xpea, Sep 18, 2017
    Last edited: Sep 19, 2017
    Alexko, pharma, Lightman and 3 others like this.
  14. Malo

    Malo Yak Mechanicum
    Legend Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    8,931
    Likes Received:
    5,530
    Location:
    Pennsylvania
    Well you still need to take into account the large increase in cuda cores, you could work out there potential performance increase per sm for those types of workloads. Also can look at perf/mm comparison since V100 is massive.
     
  15. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    Yes sure but if you look at Sobel, Histogram and Gaussian blur scores, they all show 3 times performance improvement, which is way beyond the V100 CUDA core increase...
     
  16. Malo

    Malo Yak Mechanicum
    Legend Subscriber

    Joined:
    Feb 9, 2002
    Messages:
    8,931
    Likes Received:
    5,530
    Location:
    Pennsylvania
    I'm not saying the numbers are wrong or in any way incomprehensible, I'm saying that you can't simply extrapolate potential V100 performance delta from these numbers and that additional factors such as CUDA core count need to be considered as well. It's stupid click bait articles from the usual sites (like the "source" here) that lead to nvidiots posting on forums saying that GV104 will be 2-3 times as fast as Pascal.
     
  17. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    Hopefully, not a single sane person will claim that :|

    Some remarks regarding my last post:

    1/ Geekbench only use one GPU, even if a system has multiples GPUs. Thus the comparison is one P100 vs one V100

    2/ the fastest Vega RX score in Geekbench 4 compute reaches 204,593 (compared to 481,504 for V100 in OpenCL mode). Thus V100 is more than 2 times faster than Vega 10 in this benchmark under OpenCL and more than 3.5 times faster with CUDA !

    3/ V100 score is even more impressive knowing that Geekbench compute test doesn't use Tensore cores

    Of course, we don't know yet how much Volta gaming SM will be close to V100, but one thing is sure, V100 compute uarch seems extremely solid. In fact, this kind of generation performance jump doesn't happen very often. I see something like G80 revolution here :yep2:
     
    Grall, Malo, Lightman and 2 others like this.
  18. DavidGraham

    Veteran

    Joined:
    Dec 22, 2009
    Messages:
    3,976
    Likes Received:
    5,213
    NVIDIA often supports OpenCL only at the bare minimum level. GP100 scores 320K with CUDA, but only 278K with OpenCL.
     
    pharma and xpea like this.
  19. xpea

    Regular

    Joined:
    Jun 4, 2013
    Messages:
    551
    Likes Received:
    783
    Location:
    EU-China
    True and it's also interesting to note that Volta shows much bigger performance gap between OpenCL and CUDA than with Pascal. Maybe Nvidia put all their initial effort on CUDA, leaving OpenCL behind for now...
     
  20. nnunn

    Newcomer

    Joined:
    Nov 27, 2014
    Messages:
    40
    Likes Received:
    31
    That "combined L1 Data Cache and Shared Memory subsystem", plus boost to HBM2 is all our code wanted. Since our code is bandwidth limited (P100 has more than enough flops), if we could dial back core speed and voltage (reduce power), while nudging main mem towards 1000 GB/sec... happy days.
     
    pharma and DavidGraham like this.
Loading...

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...