Nvidia Volta Speculation Thread

Rootax · Sep 18, 2017

xpea said:
True and it's also interesting to note that Volta shows much bigger performance gap between OpenCL and CUDA than with Pascal. Maybe Nvidia put all their initial effort on CUDA, leaving OpenCL behind for now...

CUDA maybe allows more "hardware" functions to be exploited than OpenCL, or be closer to the metal (whatever than mean in 2017), hence the performances boost ?

CarstenS · Sep 18, 2017

DavidGraham said:
NVIDIA often supports OpenCL only at the bare minimum level. GP100 scores 320K with CUDA, but only 278K with OpenCL.

That's why I said weird: Open CL and Cuda are all over the place, with RAW conversion for example being massively faster in oCL than in Cuda. Also DoF is a bit faster there. Normally, I'd agree, you would expect Cuda to be faster than Open CL.

BoMbY · Sep 18, 2017

The NVidia DGX-1 system is supposed to have 8x Tesla V100. A shame Geekbench doesn't show how many Compute devices it used. 743537 seems to be too good for just one of them, by the specs, but it would be really bad for 8.

Deleted member 2197 · Sep 18, 2017

BoMbY said:
The NVidia DGX-1 system is supposed to have 8x Tesla V100. A shame Geekbench doesn't show how many Compute devices it used. 743537 seems to be too good for just one of them, by the specs, but it would be really bad for 8.

Xpea said:
Edit2: note that Geekbench only use one GPU, even if a system has multiples GPUs. Thus the comparison is one P100 vs one V100

https://forum.beyond3d.com/posts/2001559/

BoMbY · Sep 18, 2017

pharma said:
https://forum.beyond3d.com/posts/2001559/

Yeah, it just doesn't add up. By the specs V100 has maybe 150% of the raw performance of a P100. The V100 result is 232% of the best P100 result. The individual results are 2x up to 3.5x higher than a P100. If the score was in the 500k range, it would be believable, but not more than 700k for a single V100.

Deleted member 2197 · Sep 18, 2017

CUDA 9 has significant improvements over CUDA 8, so that may also be coming into play with the results.
Edit: I would be interesting to see V100 vs P100 with both using CUDA 9.

xpea · Sep 18, 2017

pharma said:
CUDA 9 has significant improvements over CUDA 8, so that may also be coming into play with the results.
Edit: I would be interesting to see V100 vs P100 with both using CUDA 9.

in OpenCL, so using same API version, V100 reaches 481k vs 278k for P100.
Still a MA-SSI-VE perf increase

BoMbY · Sep 18, 2017

Yeah, but 50% raw performance, plus 23% architecture improvement is much more believable than a 50% raw performance, plus 82% architecture improvement.

LiXiangyang · Sep 18, 2017

BoMbY said:
Yeah, it just doesn't add up. By the specs V100 has maybe 150% of the raw performance of a P100. The V100 result is 232% of the best P100 result. The individual results are 2x up to 3.5x higher than a P100. If the score was in the 500k range, it would be believable, but not more than 700k for a single V100.

The V100 has 2.85x the bandwidth of shared memory comparing to Pascal, and greatly improved L1 cache, greatly improved memory controller, it also have a more flexible SIMD execuation method at warp level, and it can also execute FP and INT computation (usually to compute index) at the same time, all above could reduce latency/instruction stall significantly.

You cannot just compare the gflops there, saving the tensor core, the gflops is probabiliy the least improved dimension of performance of V100.

CarstenS · Sep 19, 2017

Is that shared memory bandwidth figure aggregate over all SMs?

BoMbY · Sep 19, 2017

LiXiangyang said:
The V100 has 2.85x the bandwidth of shared memory comparing to Pascal, and greatly improved L1 cache, greatly improved memory controller, it also have a more flexible SIMD execuation method at warp level, and it can also execute FP and INT computation (usually to compute index) at the same time, all above could reduce latency/instruction stall significantly.

You cannot just compare the gflops there, saving the tensor core, the gflops is probabiliy the least improved dimension of performance of V100.

So you are saying improved cache speed, and 1.2x memory bandwidth, is responsible for almost double overall performance? Okay ... I think I'll wait for independent benchmarks, with an actual and comparable baseline, before believing any of this.

Hmm, why isn't there any edit function? I also want to add: Even NVidia is only claiming 1.5x HPC performance gain over P100 on their V100 product page ...

Malo · Sep 19, 2017

BoMbY said:
Hmm, why isn't there any edit function?

You're a new member with only a few messages, you don't have any Edit rights yet.

Matoka · Sep 19, 2017

CarstenS said:
Is that shared memory bandwidth figure aggregate over all SMs?

It's per SM.

BoMbY said:
Hmm, why isn't there any edit function? I also want to add: Even NVidia is only claiming 1.5x HPC performance gain over P100 on their V100 product page ...

That's because most HPC apps are memory bandwidth bound, and 1.5x is how much improvement there is in Volta versus Pascal (much bigger improvement than ratio of peak bandwidths comes from Volta's much improved efficiency from the high 70s percent of peak to mid 90s percent of peak).

One other improvement worth mentioning is the reduction in back-to-back math op latency:

Kepler: 9 clocks
Maxwell / Pascal: 6 clocks
Volta: 4 clocks

xpea · Sep 19, 2017

BoMbY said:
Hmm, why isn't there any edit function? I also want to add: Even NVidia is only claiming 1.5x HPC performance gain over P100 on their V100 product page ...

the key word here is HPC.
Geekbench is not an HPC bench. Maybe this light workload fits in Volta massive cache and get big performance boost...

CSI PC · Sep 20, 2017

xpea said:
the key word here is HPC.
Geekbench is not an HPC bench. Maybe this light workload fits in Volta massive cache and get big performance boost...

Yeah,
will be worth keeping an eye out on the AMBER benchmarks that will give a better indicator in general for HPC, I would expect the V100 to appear in next month or two on there.
http://ambermd.org/gpus/benchmarks.htm#Benchmarks
Downside is that they never do a full node so it does not really reflect the mesh improvement, however from an interest perspective for many they do both PCIe and also NVLINK paired GPU setup.

Cheers

xpea · Sep 21, 2017

SUMMIT details emerge via nextplatform. A small extract:

As we have previously reported, the Summit supercomputer at Oak Ridge will pair two Power9 chips with six Volta GPU accelerators. Oak Ridge said that it would build Summit from around 4,600 nodes, up a bit from its previous estimate a few years back, and that each node would have 512 GB of main memory and 800 GB of flash memory. That’s 2.24 PB of main memory, 3.5 PB of flash memory, and nearly 72 GB of HBM2 memory across the cluster, which will be linked with 100 Gb/sec EDR InfiniBand. (The 200 Gb/sec HDR InfiniBand from Mellanox Technologies was not quite ready in time for the initial installations in July.) Those extra GPUs push the power envelope of the Summit machine up to around 13 megawatts, and they should deliver around 207 petaflops of peak theoretical performance (absent the Power9 floating point) at double precision. Oak Ridge had been planning for around 40 petaflops of performance per node, and it looks like it is getting 45 petaflops.

What we are hearing on the street is that IBM’s Witherspoon kicker to the Minksy/Garrison system will support either four or six Volta GPU accelerators and is being used in both the Summit and Sierra boxes, which makes sense if you want to amortize the $325 million cost of the two systems across a single architecture. If this is true, that means, in theory, that Lawrence Livermore will be able to boost its per node performance by 33 percent just by adding GPU accelerators to empty slots.

Full article here at the source: https://www.nextplatform.com/2017/09/19/power9-rollout-begins-summit-sierra/

Alexko · Sep 21, 2017

So with tensor cores, it could be the first machine to reach 1 exaflop—at weird, limited precision, not FP64, but still.

Infinisearch · Sep 25, 2017

Matoka said:
One other improvement worth mentioning is the reduction in back-to-back math op latency:

Kepler: 9 clocks

Maxwell / Pascal: 6 clocks

Volta: 4 clocks

Where did you read this?

MDolenc · Sep 26, 2017

Here for example:

...Dependent instruction issue latency is also reduced for core FMA math operations, requiring only four clock cycles on Volta, compared to six cycles on Pascal...

gamervivek · Sep 26, 2017

Would that make much of a difference on gaming side of things? I was more interested in the fact that they have dedicated INT32 units alongside FP32, how much can nvidia cut down on the GV100 for a gaming chip?

Unlike Pascal GPUs, which could not execute FP32 and INT32 instructions simultaneously, the Volta GV100 SM includes separate FP32 and INT32 cores, allowing simultaneous execution of FP32 and INT32 operations at full throughput

I don't think they'd like to go more FP32 cores in the gaming chip than GV100 unless they're really hurting for the gpu performance crown, so 400-500mm2 chip perhaps.

Nvidia Volta Speculation Thread

Rootax

CarstenS

Moderator

BoMbY

Deleted member 2197

Guest

BoMbY

Deleted member 2197

Guest

xpea

BoMbY

LiXiangyang

CarstenS

Moderator

BoMbY

Malo

Yak Mechanicum

Matoka

xpea

CSI PC

xpea

Alexko

Infinisearch

MDolenc

gamervivek

Similar threads