Nvidia Volta Speculation Thread

CSI PC · Dec 27, 2017

Frenetic Pony said:
I'm basically arguing that Nvidia has added a lot of features that are, and may always be, only for Nvidia. AMD is trying to get into the market (And failing so far), and Intel's GPUs, and other chips, will probably be based heavily on training neural nets. But they probably also won't have the exact features put in by Nvidia so any work you do for Volta, any research for it, is simply non transferable. And Nvidia's CUDA libraries were even worse. You got to set up Neural Nets faster, but hey now you're locked into Nvidia (they hope) too bad you can't ever run those programs on anyone else's hardware! Come buy our $5k card because you spent all those months writing for our own tech and have no choice! The point is simple, Nvidia didn't build their libraries out of kindness. They did it because they gambled they'd make more money off it, by locking people in, than it cost in the first place.

Nvidia didn't make anything you couldn't with OpenCL, they just made it exclusive to them and tried lure you into their proprietary ecosystem. And that spells nothing but trouble, it's what Sony used to do. Buy products that only work with other Sony products! It's what Apple and Android both did, or tried to, with their apps, you'll hesitate to switch if you've invested hundreds in apps that suddenly won't work anymore (not that anyone buys apps other than games anymore, and those are F2P so who cares). Point is, they lure you in by making it seem easy, then trap you by trying to lock all the work you've done exclusively to their hardware.

Although that ignores one of the big selling points and that being for many CUDA and its integration with various frameworks while also having very optimised libraries with quite a lot of flexibility when considering diverse solutions and implementations.
Whichever large scale HW solution scientists/devs use they will have to spend a lot of time learning and optimising their code, especially if they require both modelling-simulation and training.
Importantly Nvidia heavily support a broad range of frameworks.

But as a reference even moving from traditional Intel Xeon to Phi meant a lot of reprogramming/optimsing to make it worthwhile, one of the HPC labs investigated this and published their work.
I agree CUDA will split opiniions though, with some looking to avoid it while others embrace it from an HPC perspective.

OlegSH · Dec 31, 2017

It seems even high precision HPC computations can benefit from tensor cores

Shaklee3 · Jan 10, 2018

@Ryan Smith could you elaborate more on how you did the matrix multiply? You mentioned the size earlier, but did you use the tensor core example gemm code that comes with cuda? I tried your exact size using that code and only got 50TFLOPS.

RecessionCone · Jan 10, 2018

Shaklee3 said:
@Ryan Smith could you elaborate more on how you did the matrix multiply? You mentioned the size earlier, but did you use the tensor core example gemm code that comes with cuda? I tried your exact size using that code and only got 50TFLOPS.

The CUDA example is using WMMA, the CUDA abstraction for tensor cores. 50 TFlops is about right for the WMMA interface with current CUDA. To get full performance, use CUBLAS.

pharma · Jan 10, 2018

Training with Mixed Precision - Nvidia Developer Documentation
January 2018
https://docs.nvidia.com/deeplearning/sdk/pdf/Training-Mixed-Precision-User-Guide.pdf

Shaklee3 · Jan 11, 2018

RecessionCone said:
The CUDA example is using WMMA, the CUDA abstraction for tensor cores. 50 TFlops is about right for the WMMA interface with current CUDA. To get full performance, use CUBLAS.

Thanks. It seems to me that the wmma functions should be equivalent to an intrinsic that compiles very closely to sass. I wouldn't have expected more than 100% throughput difference. Do they expect that anyone who doesn't want to use their libraries we'll just have to pay a performance penalty?

CSI PC · Jan 11, 2018

Titan V and V100 scientific benchmarks.
Been keeping an eye out as was expecting this test benchmarking around now or a bit earlier.
Titan V and V100 PCIe (not Mezzanine) with Amber with various models single precision: http://ambermd.org/gpus/benchmarks.htm#Benchmarks

Wow in performance, and compared to the P100 with a nice price/performance benefit, although Titan V is crazy good in that respect.
So Titan V is a nice buy for universities/small labs that just want a few scaled nodes, not sure how much larger Nvidia would allow this to be scaled but they do try to support and assist academic and labs within reason.
1xTitan V is faster than a dual Quadro PCIe GP100 NVlinked setup with these SP solvents......
Insane value with the Titan V, while V100 PCIe has top performance and easier to build efficient nodes/clusters around.
Just to reiterate these are single precision solvent models.

Ryan Smith · Jan 11, 2018

Shaklee3 said:
@Ryan Smith could you elaborate more on how you did the matrix multiply? You mentioned the size earlier, but did you use the tensor core example gemm code that comes with cuda? I tried your exact size using that code and only got 50TFLOPS.

RecessionCone said:
The CUDA example is using WMMA, the CUDA abstraction for tensor cores. 50 TFlops is about right for the WMMA interface with current CUDA. To get full performance, use CUBLAS.

RecessionCone is correct. This was all a pretty thin wrapper calling up the appropriate CUBLAS functions.

RecessionCone · Jan 11, 2018

Shaklee3 said:
Thanks. It seems to me that the wmma functions should be equivalent to an intrinsic that compiles very closely to sass. I wouldn't have expected more than 100% throughput difference. Do they expect that anyone who doesn't want to use their libraries we'll just have to pay a performance penalty?

The intrinsic is fine. The missing performance is because the CUDA compiler can’t optimally schedule and register allocate the code that uses the intrinsic. Hopefully that will improve with time. Getting 100% utilization of the tensor cores requires the whole chip to work at full tilt, doing anything slightly suboptimally reduces performance measurably.

Shaklee3 · Jan 11, 2018

RecessionCone said:
The intrinsic is fine. The missing performance is because the CUDA compiler can’t optimally schedule and register allocate the code that uses the intrinsic. Hopefully that will improve with time. Getting 100% utilization of the tensor cores requires the whole chip to work at full tilt, doing anything slightly suboptimally reduces performance measurably.

Thanks @RecessionCone and @Ryan Smith. I was about to get 92TFLOPS using your matrix size, which is close enough for me.

pharma · Jan 24, 2018

Benchmarking Tensorflow Performance on Next Generation GPUs
Jan 22, 2018

To test how these modern GPUs perform on typical ML tasks, I trained a Faster R-CNN/resnet101 object detection model on Nvidia’s most recent GPUs. The object detection model was implemented in Tensorflow and operated on 300x300px image inputs, with training minibatch sizes of 10, 15, and 20 images.

The GPUs that were benchmarked:

Paperspace Volta (16GB — $2.30/hour)

Google Cloud P100 (16GB — $1.73/hour)

Amazon EC2 p3.2xlarge Volta (16GB— $3.06/hour)

Nvidia 1080Ti (11GB — Personal Machine)

Note: This benchmark focuses specifically on newer GPUs and thus excludes the older K80 and Quadro GPUs. These GPUs were benchmarked last April.

From a cost perspective, the Paperspace Voltas offer good value for money; adjusting for cost, Google’s P100 is approximately 10% more expensive while Amazon finishes at a full 40% more.

https://medium.com/initialized-capi...formance-on-next-generation-gpus-e68c8dd3d0d4

pharma · Feb 2, 2018

In this video from SC17 in Denver, Eric Nielsen from NASA presents:Unstructured-Grid CFD Algorithms on the NVIDIA Pascal and Volta Architectures.

“In the field of computational fluid dynamics, the Navier-Stokes equations are often solved using an unstructured-grid approach to accommodate geometric complexity. Furthermore, turbulent flows encountered in aerospace applications generally require highly anisotropic meshes, driving the need for implicit solution methodologies to efficiently solve the discrete equations. To prepare NASA Langley Research Center’s FUN3D CFD solver for the future HPC landscape, we port two representative kernels to NVIDIA Pascal and Volta GPUs and present performance comparisons with a common multi-core CPU benchmark.”

https://insidehpc.com/2018/01/unstructured-grid-cfd-algorithms-nasa-volta-gpus/

Volta architecture

xpea · Feb 3, 2018

pharma said:
In this video from SC17 in Denver, Eric Nielsen from NASA presents:Unstructured-Grid CFD Algorithms on the NVIDIA Pascal and Volta Architectures.

“In the field of computational fluid dynamics, the Navier-Stokes equations are often solved using an unstructured-grid approach to accommodate geometric complexity. Furthermore, turbulent flows encountered in aerospace applications generally require highly anisotropic meshes, driving the need for implicit solution methodologies to efficiently solve the discrete equations. To prepare NASA Langley Research Center’s FUN3D CFD solver for the future HPC landscape, we port two representative kernels to NVIDIA Pascal and Volta GPUs and present performance comparisons with a common multi-core CPU benchmark.”

https://insidehpc.com/2018/01/unstructured-grid-cfd-algorithms-nasa-volta-gpus/

Extract from Video:

CSI PC · Feb 5, 2018

His talk on the cache also is probably an indicator on why Amber is so much more efficient/faster on Volta than Pascal, much more than the spec sheet when considering cores/clocks.
In my earlier post on the Amber scientific solvent/model-simulation results show a single Titan V or single GV100 are faster than 2xTitan Xp or 2x1080ti setup in all their solvent tests.
Results are FP32.

Some applications will see quite a dramatic increase.
Edit:
I did not bother mentioning the P100 as it has lower FP32 than either the Xp or 1080ti even as the DGX-1 (which they tested as well for P100 as a single card).

manux · Feb 21, 2018

Wrong thread, but I'm too lazy to start a separate appropriately named thread for google news. TPU is now "out in the wild" or more concretely one can now pay to compute on google cloud using tpu.

I can only surmise that the pricing is in part due to the high costs of the extra HBM2 memory chips, widely believed to be over $300 per TPU die, (or $900 more than the 16GB needed for Volta).

The following table shows that the 4-die Google Cloud TPU costs over twice as much as the NVIDIA Volta GPU available on the Amazon AWS cloud while delivering ~67% more performance, based on the training time for the ResNet50 neural network. Net it out, and the Google part costs ~33% more to do the same work.

https://www.forbes.com/sites/moorin...e-announces-expensive-cloud-tpu-availability/

If true hbm2 seems to be fairly expensive. I wonder if that price is accurate or not.

pharma · Feb 21, 2018

Well, chip for chip TPU is 2 times slower so wonder whether efficiency is playing a part in the increased cost. From the Forbes article:

But the real disappointment comes from the pricing strategy: why would anyone pay more to get the same job done? Yes, a 4 die Cloud TPU can get the training done faster, but it is >2X slower than a 4 GPU instance on AWS. I can only surmise that the pricing is in part due to the high costs of the extra HBM2 memory chips, widely believed to be over $300 per TPU die, (or $900 more than the 16GB needed for Volta). Keep in mind, though, that Google is getting the ASIC at manufacturing costs, so even the HBM2 delta does not fully explain the higher pricing. Hopefully, these prices will come down after Google irons out whatever wrinkles are limiting the quantities.

Arun · Feb 21, 2018

It is rather ironic that the TPU2 chip is practically 2 independent processors glued together on the same chip, each with its own independent HBM controller... they could probably get away with less memory (footprint - not bandwidth - so fewer HBM2 dies per stack but the same number of stacks) if they let both controllers work for both parts of the chip with a proper bus infrastructure.

I wonder if Google has any kind of TPU roadmap now that the lead HW engineers moved to Groq (groq.com), and if not, whether they even care about making it competitive outside of internal Google projects?

entity279 · Feb 21, 2018

My hunch is that the TPU is a doomed project.

It had its window when they were the first with mixed-precision tensor processors. Now that nV & Co are in, how could Google keep up?

Kaotik · Feb 21, 2018

entity279 said:
My hunch is that the TPU is a doomed project.

It had its window when they were the first with mixed-precision tensor processors. Now that nV & Co are in, how could Google keep up?

Why couldn't they? NVIDIA (or any other company) isn't some almighty deity that makes everything better than the rest

entity279 · Feb 21, 2018

No, it doesn't take a semiconductors diety to "stop" someone who's new in the market in a resource and project management battle.

Not excluding the fact that Google (or anyone) can make a breakthrough. But as I've said, I fear their window of opportunity is running out.

Nvidia Volta Speculation Thread

CSI PC

OlegSH

Shaklee3

RecessionCone

pharma

Shaklee3

CSI PC

Ryan Smith

RecessionCone

Shaklee3

pharma

pharma

xpea

CSI PC

manux

pharma

Arun

Unknown.

entity279

Kaotik

Drunk Member

entity279

Similar threads