Nvidia Volta Speculation Thread

Ethatron · Sep 8, 2017

psurge said:
Is fp16 enough for the transform matrix?

If you want to position something on a 2048 pixel screen, then yes, otherwise, no.

RecessionCone · Sep 9, 2017

Ethatron said:
If you want to position something on a 2048 pixel screen, then yes, otherwise, no.

The output is FP32 (which has 23 bits of mantissa). So you'll be able to position things far more precisely than that.

Anarchist4000 · Sep 9, 2017

Ethatron said:
If you want to position something on a 2048 pixel screen, then yes, otherwise, no.

Tiling perhaps? Makes sense for all the interpolation work and devs should be able to work around the precision issue in screen space easily enough. Would save a lot of space in various distributed caches.

RecessionCone said:
The output is FP32 (which has 23 bits of mantissa). So you'll be able to position things far more precisely than that.

That won't improve accuracy much. For deep learning it works because the operations are summing many consecutive FP16 multiplications. Essentially calculating the sum of many binary results. In graphics I can't think of m/any good examples of doing that. Maybe kinematics or tessellation where the results can be a bit more fungible with longer dependencies. Will be interesting to see what devs come up with as a decade ago tessellation and kinematics weren't very practical to possibly warrant less than FP24. Blending is mostly multiplication, but downsampling would involve some addition.

keldor314 · Sep 9, 2017

Normals and other directional vectors don't need that much precision. 23 bits of mantissa give you enough precision to point a laser pointer at a specific person's house... assuming you're an astronaut currently standing on the moon. To further put this in perspective, this beats out the resolving capabilities of any of the world's most powerful telescopes (such as Hubble) by half an order of magnitude.

10 bits of mantissa lets you point to any given pixel in a 2k cubemap. This is more than enough, even by movie production standards, for any sort of lighting calculations that directional vectors are generally used in (assume that if you need direct specular reflections in a movie, you'll raytrace them).

Ethatron · Sep 9, 2017

Don't forget perspective divide if you talk about 3d positioning. It's very limiting when you can only target 3x3 pixel in the center of the view for objects at the precision boundary of your FP16 universe (90 degree FoV -> x/z, y/z -> +-2047/2047 = +-1). If you bring exponent into it, it gets difficult to explain how things between 2047 and 65504 snap weirdly, while the targetable viewport for objects at the 2047 boundary only raises to ~65x65 (+-65504/2047 = +- 32).
I'm sure it's possible to construct a valid application-set for this math even in 3d, but's far from universally recommendable.

xpea · Sep 14, 2017

From Inside HPC news, the just released PGI v17.7 (compiler and tools) supports now Volta:
https://insidehpc.com/2017/09/new-pgi-17-7-release-supports-nvidia-volta-gpus/

the PGI page:
http://www.pgroup.com/products/new-in-pgi.htm

Deleted member 2197 · Sep 17, 2017

NVIDIA Volta Tesla V100 GPU Accelerator Compute Performance Revealed – Features A Monumental Increase Over Pascal Based Tesla P100

NVIDIA-Volta-Tesla-V100-Compute-Performance-1013x1030.png

In terms of specifications, this machine rocks eight Tesla V100 GPUs with 5120 cores each. This totals 40,960 CUDA Cores and 5120 Tensor Cores. The DGX-1 houses a total of 128 GB of HBM2 memory on its eight Tesla V100 GPUs. The system features dual Intel Xeon E5-2698 V4 processors that come with 20 cores, 40 threads and clock in at 2.2 GHz. There’s 512 GB of DDR4 memory inside the system. The storage is provided in the form of four 1.92 TB SSDs configured in RAID 0, network is a dual 10 GbE with up to 4 IB EDR. The system comes with a 3.2 KW PSU. You can find more details here.

The system can be compared to a HP Z8 G4 Workstation which features a total of nine PCIe slots and features a score of 278706 points in the OpenCL API with the Quadro GP100 which is essentially a Tesla P100 spec’d card. Moving over to the fastest Tesla P100 listing, we see a total of 8 PCIe cards configured to reach a score of 320031 in the CUDA API. But let’s take a look at the mind boggling Tesla V100 scores. A DGX-1 system with 8 SXM2 Tesla V100 cards scores 481504 in OpenCL API and a monumental 743537 points with the CUDA API.

The score puts the Tesla V100 in an impressive lead over its predecessor which is something we are excited to see. It also shows that we can be looking at a generational leap in the gaming GPU segment if the performance numbers from the chip architecture carry over to the mainstream markets. Another thing which should be pointed out is the incredible tuning of compute output with the new CUDA API and related libraries. Not only is the Tesla V100 seeing big improvements over OpenCL but the same can be seen for the Tesla P100 which means that NVIDIA is really doing some hard work with their CUDNN framework and it’s expected to get even better in the coming generations. So there you have it, NVIDIA’s fastest GPU showing off some killer performance in its specified compute related workloads.
http://wccftech.com/nvidia-volta-tesla-v100-gpu-compute-benchmarks-revealed/

Grall · Sep 17, 2017

Madness. lol

Wicked, wicked madness!

CarstenS · Sep 18, 2017

Are there detailed results available? Without them, things could end up like in that one SPEC CPU2006 test, where one sub-result ran enormously fast and since they were not normalized, distorted the whole thing.

troyan · Sep 18, 2017

PCIe P100 (fastest one): https://browser.geekbench.com/v4/compute/485469
V100: https://browser.geekbench.com/v4/compute/1098943

Deleted member 2197 · Sep 18, 2017

Clickable links for detail:
https://browser.geekbench.com/v4/compute?dir=desc&sort=score

CarstenS · Sep 18, 2017

Weird results - comparing V100 OpenCL and Cuda

xpea · Sep 18, 2017

CarstenS said:
Weird results - comparing V100 OpenCL and Cuda

Below P100 vs V100 on CUDA:
https://browser.geekbench.com/v4/compute/compare/485469?baseline=1098943

Code:

                          P100          |       DGX-1 V100
CUDA Score               320031         |         743537
Sobel                    528482         |        1382119
                    23.3 Gpixels/sec    |     60.9 Gpixels/sec
Histogram Equalization    455379        |        996475
                    14.2 Gpixels/sec    |     31.1 Gpixels/sec
SFFT                     66489          |        101670
                      165.7 Gflops      |     253.5 Gflops
Gaussian Blur            538403         |        1897300
                    9.43 Gpixels/sec    |      33.2 Gpixels/sec
Face Detection          49263           |         108700
                14.4 Msubwindows/sec    |     31.7 Msubwindows/sec
RAW                     1139825         |        2743361
                  11.0 Gpixels/sec      |      26.6 Gpixels/sec
Depth of Field          571644          |        1499040
                  1.66 Gpixels/sec      |     4.35 Gpixels/sec
Particle Physics        397917          |        786603
                     62904.7 FPS        |     124350.1 FPS

We see that performance increase is regular on all workloads. Looks like new Volta SMs do wonders :runaway:

Edit: formatting

Edit2: note that Geekbench only use one GPU, even if a system has multiples GPUs. Thus the comparison is one P100 vs one V100. Source:
http://support.primatelabs.com/discussions/geekbench/16171-geekbench-4-multiple-gpu-benchmark

Edit3: the fastest Vega RX score in Geekbench 4 compute reaches 204,593 (compared to 481,504 for V100 in OpenCL mode)

Edit4: V100 score is even more impressive knowing that Geekbench compute test doesn't use Tensor cores

Malo · Sep 18, 2017

Well you still need to take into account the large increase in cuda cores, you could work out there potential performance increase per sm for those types of workloads. Also can look at perf/mm comparison since V100 is massive.

xpea · Sep 18, 2017

Malo said:
Well you still need to take into account the large increase in cuda cores, you could work out there potential performance increase per sm for those types of workloads. Also can look at perf/mm comparison since V100 is massive.

Yes sure but if you look at Sobel, Histogram and Gaussian blur scores, they all show 3 times performance improvement, which is way beyond the V100 CUDA core increase...

Malo · Sep 18, 2017

xpea said:
Yes sure but if you look at Sobel, Histogram and Gaussian blur scores, they all show 3 times performance improvement, which is way beyond the V100 CUDA core increase...

I'm not saying the numbers are wrong or in any way incomprehensible, I'm saying that you can't simply extrapolate potential V100 performance delta from these numbers and that additional factors such as CUDA core count need to be considered as well. It's stupid click bait articles from the usual sites (like the "source" here) that lead to nvidiots posting on forums saying that GV104 will be 2-3 times as fast as Pascal.

xpea · Sep 18, 2017

Malo said:
I'm not saying the numbers are wrong or in any way incomprehensible, I'm saying that you can't simply extrapolate potential V100 performance delta from these numbers and that additional factors such as CUDA core count need to be considered as well. It's stupid click bait articles from the usual sites (like the "source" here) that lead to nvidiots posting on forums saying that GV104 will be 2-3 times as fast as Pascal.

Hopefully, not a single sane person will claim that :|

Some remarks regarding my last post:

1/ Geekbench only use one GPU, even if a system has multiples GPUs. Thus the comparison is one P100 vs one V100

2/ the fastest Vega RX score in Geekbench 4 compute reaches 204,593 (compared to 481,504 for V100 in OpenCL mode). Thus V100 is more than 2 times faster than Vega 10 in this benchmark under OpenCL and more than 3.5 times faster with CUDA !

3/ V100 score is even more impressive knowing that Geekbench compute test doesn't use Tensore cores

Of course, we don't know yet how much Volta gaming SM will be close to V100, but one thing is sure, V100 compute uarch seems extremely solid. In fact, this kind of generation performance jump doesn't happen very often. I see something like G80 revolution here :yep2:

DavidGraham · Sep 18, 2017

CarstenS said:
Weird results - comparing V100 OpenCL and Cuda

NVIDIA often supports OpenCL only at the bare minimum level. GP100 scores 320K with CUDA, but only 278K with OpenCL.

xpea · Sep 18, 2017

DavidGraham said:
NVIDIA often supports OpenCL only at the bare minimum level. GP100 scores 320K with CUDA, but only 278K with OpenCL.

True and it's also interesting to note that Volta shows much bigger performance gap between OpenCL and CUDA than with Pascal. Maybe Nvidia put all their initial effort on CUDA, leaving OpenCL behind for now...

nnunn · Sep 18, 2017

xpea said:
V100 compute uarch seems extremely solid.

That "combined L1 Data Cache and Shared Memory subsystem", plus boost to HBM2 is all our code wanted. Since our code is bandwidth limited (P100 has more than enough flops), if we could dial back core speed and voltage (reduce power), while nudging main mem towards 1000 GB/sec... happy days.

Nvidia Volta Speculation Thread

Ethatron

RecessionCone

Anarchist4000

keldor314

Ethatron

xpea

Deleted member 2197

Guest

Grall

Invisible Member

CarstenS

Moderator

troyan

Deleted member 2197

Guest

CarstenS

Moderator

xpea

Malo

Yak Mechanicum

xpea

Malo

Yak Mechanicum

xpea

DavidGraham

xpea

nnunn

Similar threads