Nvidia Volta Speculation Thread

What will also be interesting going forward is whether GV102 and the other lower GPUs (or some at least) will have a Int8/DP4a instruction version of the Tensor core; more for Int8 inferencing/convolution.
Just like the P100 they are careful not to talk about this, and makes sense from a differentiation perspective between GV100 and GV102.
Quite a few of the Nvidia Deep Learning related/GEMM libraries are also optimised for Int8 and maturing in terms of use and development.
Cheers

Given the cloud-based NGC, I think we are in the process of crossing the divergence threshold between gaming and specialized HPC. Maybe with Volta, one foot's already through the door.
 
On a higher level comparative between GP100 CUDA cores and Tensor cores using same fp16 instruction.
The 8 Tensor cores in a single SM are around 4x theoretical peak throughput better than 64 FP32 CUDA cores in their single SM (P100 was the reference).
So per core it is 32x 'faster' with that mixed precision GEMM instruction than the P100 CUDA core.
Raises the question just what the limit is for number of Tensor cores per SM as implemented with Volta without further changes required to the architecture, and whether they will/can in future introduce say a GV100b that has reduced FP64 ratio but with more Tensor cores.
Or is 8 Tensor cores per SM currently the hard limit similar way we see with the 64 FP32 cores per SM specifically with the P100/V100 - seems quite probable.

Just as a note.
In the Nvidia devblog they mention 8x faster than P100 at a per SM level, but crucially that reference was using those CUDA cores as 'standard' FP32 rather than FP16 instruction.

Cheers
 
Last edited:
Given the cloud-based NGC, I think we are in the process of crossing the divergence threshold between gaming and specialized HPC. Maybe with Volta, one foot's already through the door.
Yeah possibly seeing it now (definitely a thought for the future) but there already exists such specialist clouds using the P100 and DGX-1 for HPC compute to hire and yeah I appreciate NGC is just taking that to the next service level.
I think the dilemma for Nvidia sooner rather than later is how they will continue to differentiate between their flagship multi-purpose mixed precision DP top GPU (P100 and V100 and whatever comes next) and the next tier below (P40 and 'V40'), especially as Nvidia comes under more pressure from other DL/compute hardware and requirement of a complete range of mixed-precision provided on all GPUs (meaning loss of current differentiation we see with GP100 to rest of range and seems GV100 still has this for now).
And the GP102/GV102 must feed their solution such as compute version/instructions/CUDA/libraries/etc also into the lower GPU models (at least for Tesla also with DL and Quadro) as they are a viable alternative for many.
All of this compounded by more specific instruction cores being added to the architecture (now Tensor) and required with ever greater performance.
Especially when one considers scale up/out costs and purpose of node.
As you say at some point Nvidia will have to specialise this a bit more but will also need careful consideration how to do this for such a broad encompassing design that has interconnected-dependent R&D through all segments from consumer to 'Tegra'.

Cheers
 
Last edited:
Not sure if you are actually disagreeing about the cores full precision computation if you agree it may be technically correct
No I don't disagree with that. I suppose it's just the naming, to me this is really fp16 multiplies with higher precision output, and not fp32 multiplies with fp16 inputs. Just because that's a lot closer to what the hw is actually doing.
So I take issue that the fp16 inputs is "only due to storage" (e.g. less register bandwidth required). The multipliers would have definitely be more expensive with fp32 inputs (if it would only be due to storage, the tensor unit should support fp32 "half-matrix" multiplies at the same rate, and I very highly doubt it can do this).
 
Last edited:
No I don't disagree with that. I suppose it's just the naming, to me this is really fp16 multiplies with higher precision output, and not fp32 multiplies with fp16 inputs. Just because that's a lot closer to what the hw is actually doing.
It could be debatable *shrug*.
The same can be said about the P100 with compute version SM_60 and use of SGEMMEx instruction if wanting FP16/FP16/FP32; no difference with that and how Nvidia manages double FP16 TFLOPs relative to FP32 with the P100; both are in effect FP32 cores involving FP16 and the same or very similar instruction, albeit the Tensor cores are more optimised-specialised for matrix multiplication and so with 4x greater throughput in theory when per SM comparison.

Cheers
 
Last edited:
Probably makes sense to put it here as well as Vega speculation as the interesting aspect is GDDR6 from SK Hynix (if one believes Nvidia has moved away from Samsung and the recent newsbrief from SK Hynix for a client using this 2018).

SK Hynix Q2 '17 Graphics memory catalogue:
GDDR6 8GB 12 & 14 Gbps available Q4 '17
GDDR5 8GB 10Gbps Q4 '17 (needs 1.55V)

More relevent to competition rather than Nvidia:
HBM2 4GB 1.6Gbps Only 4-Hi stack Q2 '17 - so looks like this is not changing anytime soon and has implications for others from both capacity and BW, especially as Samsung is now very close to hitting 2Gbps


I really cannot see 8-Hi anytime soon from any of the manufacturers, especially for GPUs.
Anyway looks like in Q4 there will be a choice between 14Gbps GDDR5x (looking that way for Micron) or 12/14Gbps GDDR6 (if SK Hynix is not being over optimistic) or Samsung and their GDDR6.
Cheers
 
Last edited:
Not so great news for Nvidia Volta V100 as Google revealed some details of it's TPU2.
180 TFLOP/s both for training and inferencing.
I believe they are comparing against Nvidia’s K80 GPU, not Volta. Looks like they are using 4 to get to 180 TFLOPS.
 
Not so great news for Nvidia Volta V100 as Google revealed some details of it's TPU2.
180 TFLOP/s both for training and inferencing.
well it's 180/4 = 45 TFLOPS per ASIC, very poor performance in my opinion for dedicated silicon. The important sentence in the source article:
A server with four of the so-called Cloud TPUs delivers 180 TFlops that will be used both for training and inference tasks.
GV100 is 120 TFLOPS per GPU (960TFLOPS in HGX rack) and can also be used for HPC (strong FP64) and any other more challenging workflows (with the new thread scheduler)
 
well it's 180/4 = 45 TFLOPS per ASIC, very poor performance in my opinion for dedicated silicon. The important sentence in the source article:

GV100 is 120 TFLOPS per GPU (960TFLOPS in HGX rack) and can also be used for HPC (strong FP64) and any other more challenging workflows (with the new thread scheduler)

Yeah the TPU board size sort of reminds me also of 2GPUs in a blade.
So one needs to consider 2xV100 as a possible comparison.

However this highlights why IMO Nvidia cannot wait too long for the GV102 that one could sort of expect having 8xTensor cores per SM but operating as Int8 with associated instructions and optimised libraries; meaning theoretical peak possibly double of the 120 FP16 TFLOPs on the V100 - yeah reality will not be that but it will be at a very competitive real world figure.
It would also mean a more cohesive platform across Volta for training and inference when it comes moving from V100 to V102 when considering CUDA and Library versions compatibility and coding - One area Google commented upon regarding complexity/delay moving design from one system to the other and such a POV can also extend when applying this between Pascal and Volta with regard to Cuda/optimised library/framework support version levels and coding.
Still not as ideal as having it all on one node as TPU2 does (does TPU2 losing any peak throughput/optimisation doing this though), but with the software-platform support Nvidia builds into their ecosystem it should still be acceptable until next generation of Nvidia tech.

CHeers
 
Last edited:
120 FP16 TFLOPs on the V100 - yeah reality will not be that but it will be at a very competitive real world figure.
that's the only V100 bench published so far (by Nvidia):
image10.jpg

Figure 6: Tesla V100 Tensor Cores and CUDA 9 deliver up to 9x higher performance for GEMM operations. (Measured on pre-production Tesla V100 using pre-release CUDA 9 software.)
 
that's the only V100 bench published so far (by Nvidia):
image10.jpg
Not keen on the chart myself as I think they skewed P100 a bit, but then it is not far from the 8x performance over P100 standard FP32 compute core comment from their blog I quote below.

There are others such as the Caffe2-ResNet one, but they may not be ideally optimised (will notice below).
But with TPU2 and Nvidia, it is best for now IMO to use what they state as their theoretical peak FMA.
Seems TPU2 is 180 TFLOPs FP16 Tensor DL/matrices, and V100 is 30 TFLOPs FP16 or 120TFLOPs Tensor DL/matrices.

I mentioned it in other posts, but Nvidia has stated on their site that Tensor 8 cores are 8x greater throughput than 64 cuda cores operating as 'Standard FP32' (their wording I think).
So that means per SM the ratio is 4x FP16 theoretical throughput and that also comes to 120 TFLOPs FP16 for V100 lining up with the official 30 TFLOPs FP16.

Here is one of the Caffe-Resnet charts, the one worth referencing and context for this discussion is the far right that is FP16 Inferencing and pretty close to the 4x figure and aligns to the above comment:
res_net50_v100-1.png


Yeah I appreciate as this is a real performance it means one also has to adjust to the fact V100 has around 31% more SM as well, so I guess between this chart and yours it does come to maybe 4x greater performance over FP16 P100 in DL.

But I think we are digressing, although it is interesting.
Edit:
Here is the only comparison comment to the general CUDA mixed precision FP32 core stated by Nvidia, this is per SM and so scales.
Nvidia blog said:
This is a dramatic 8X increase in throughput for deep learning applications per SM compared to Pascal GP100 using standard FP32 operations
So for FP16 theoretical it becomes 4x increase with DL using Tensor cores, which ties in with the right FP16 picture and your chart when making for allowances.
And that follows through with the official V100 spec:
FP32 15TFLOPs
FP16 30TFLOPs
Tensor 120TFLOPs. (FP16 mixed precision matrices DL)

Cheers
 
Last edited:
Not so great news for Nvidia Volta V100 as Google revealed some details of it's TPU2.
180 TFLOP/s both for training and inferencing.

Why should Nvidia worry. The TPU2 is 45 TFLOPS not 180 TFLOPS.

Google brings 45 teraflops tensor flow processors to its compute cloud

Google has developed its second-generation tensor processor—four 45-teraflops chips packed onto a 180 TFLOPS tensor processor unit (TPU) module

https://arstechnica.com/information...s-tensor-flow-processors-to-its-compute-cloud
 
Is google going to sell tpu2 to competing cloud providers? If not then tpu2 only increases market for AI acceleration chips as microsoft, amazon,... would need to somehow compete with google. These are fun times as the market hasn't been divided yet.

TPU2 looks interesting but way too little information out to really understand what the chip is useful for and what not. What is precision of computation? Is there big difference between how inference and training is implemented/performs? Bandwidth and amount of memory is unknown(dataset sizes that tpu2 can handle).

There is tradeoff between flexibility and performance. My hunch is tpu2 is not as flexible as gpu's which again are not as flexible as cpu's. I don't think the ai is solved to the point where perfect algorithm and accelerator can be built. If design is not flexible enough it could be a dead end outside current use cases. This is not to say current use cases wouldn't be valid but it's just a bit of more research and work before skynet is here.

What to me looks most interesting in volta is it's flexibility. Deploy one type of gpu to cloud and you can sell computing time for dnn training, inferencing and also generic hpc workloads(strong 64bit floating point performance). Also the scalability via nvlink2 especially together with ibm cpu's could be game changer.
 
Last edited:
Putting the second-generation TPU in the Google Cloud Platform will certainly send some users that way for large-scale training, but as noted, there will also be high-end GPUs as well as CPUs for those workloads. The ability for users to use TensorFlow at scale on an architecture designed just for that purpose will be compelling however. We imagine that this move will light a fire under Amazon and Microsoft with its Azure cloud to step to when it comes to offering latest-generation GPUs, something they have been slow about doing (the highest-end GPU available on Amazon is the Tesla K80, but Pascal P100s are now available on Azure).

For those who keep wondering why Google doesn’t commercialize its chips, read above and see how Google is already doing this–albeit via a less direct route (and one with less risk, for that matter). If indeed these deep learning markets expand at the level predicted, the differentiation provided by TPU and TensorFlow will be enough to give Google Cloud Platform an edge like it’s never had before. That gets them around mass production–and into a mass user base, and one that can help it build out TensorFlow in the process.

https://www.nextplatform.com/2017/05/17/first-depth-look-googles-new-second-generation-tpu/
 
Back
Top