Nvidia Volta Speculation Thread

well it's 180/4 = 45 TFLOPS per ASIC, very poor performance in my opinion for dedicated silicon. The important sentence in the source article:

GV100 is 120 TFLOPS per GPU (960TFLOPS in HGX rack) and can also be used for HPC (strong FP64) and any other more challenging workflows (with the new thread scheduler)

Right, the article title is a bit misleading "Machine-learning ASIC doubles performance", as TPU1 already did 90 TOPS/s 8 bit inferencing. In that respect it indeed looks rather poor.
 
What is the TDP of the TPU2?
looking at the huge heat sinks, I would say at least 80W per ASIC:
tpu-v2-3.2e16d0ba.fill-1592x896.jpg


which IMHO looks really bad when related to performance
 
Why should Nvidia worry. The TPU2 is 45 TFLOPS not 180 TFLOPS.

Google brings 45 teraflops tensor flow processors to its compute cloud

Well it combines both Training and Inference in one node while Nvidia has shown the V100 to date to be more like the P100 (focus on training down to FP16) and depends upon size of the board and power demand (important in nodes/cluster), and importantly it would be a fair bit cheaper than V100 (Nvidia need cheaper models for Inferencing).
It is also quite probable the Inference from TPU2 would have higher peak throughput than the 45 TFLOPs (which looks to be FP16 figure) when it comes to Int8 Inference.
I doubt it worries Nvidia but it is more competitive and more cohesive in some ways without other Volta backing up the V100 for DL ecosystem.

Edit:
Yeah as mentioned by another TPU had 90 TFLOPs Int8 peak performance.
It looks like it all comes down to how well optimised the performance is when comparing workflow vs matrix.
Cheers
 
Last edited:
However this highlights why IMO Nvidia cannot wait too long for the GV102 that one could sort of expect having 8xTensor cores per SM but operating as Int8 with associated instructions and optimised libraries; meaning theoretical peak possibly double of the 120 FP16 TFLOPs on the V100 - yeah reality will not be that but it will be at a very competitive real world figure.
V100 doesn't do 120 FP16 TFLOPS, it does 120 TFLOPS only with very specific Tensor ops.
 
Well it combines both Training and Inference in one node while Nvidia has shown the V100 to date to be more like the P100 (focus on training down to FP16) and depends upon size of the board and power demand (important in nodes/cluster), and importantly it would be a fair bit cheaper than V100 (Nvidia need cheaper models for Inferencing).
It is also quite probable the Inference from TPU2 would have higher peak throughput than the 45 TFLOPs (which looks to be FP16 figure) when it comes to Int8 Inference.
I doubt it worries Nvidia but it is more competitive and more cohesive in some ways without other Volta backing up the V100 for DL ecosystem.

Edit:
Yeah as mentioned by another TPU had 90 TFLOPs Int8 peak performance.
It looks like it all comes down to how well optimised the performance is when comparing workflow vs matrix.
Cheers

Loosing a high profile customer like Google, undoubltly must worry Nvidia.
Also there is the possible prospect that other high profile customers, with deep pockets, may get inspired by this and start fabricating their own ASICs for DL.
 
That would be 320 Watt / 180 TFLOP/s vs 300 Watt / 120 TFLOP/s for Volta, so actually better.

For reference here is the power demand of TPU still 4 per board according to the * note.
JouppiTable2.jpg


Not sure how to correlate TPU2 to Nvidia in terms of size and power demand, because so far Nvidia has been coy about Int8 inference and looks like they will do the same differentiation as before with GV100 for FP16 training and with GV102 for int8.
So that would also mean taking into account multiple nodes increasing size and overall power demand, although I guess one could argue TPU2 is doing one or the other so for the Nvidia environment maybe only take into account only one of the nodes for power.
Edit:
And yeah I appreciated one cannot use this as a direct reflection of current TPU with FP16 DL training.
Cheers
 
Last edited:
V100 doesn't do 120 FP16 TFLOPS, it does 120 TFLOPS only with very specific Tensor ops.
You really love arguing with me and being pedantic while ignoring context - something you have been doing to me now over last 6-12months.
You did notice I mention Tensor cores in relation to that 120TFLOPs in the part you quoted?
I think everyone who has been following this thread understand now what function the Tensor core is used for; in theory though it goes beyond just DL and potentially for further matrices math/algorithms and instructions, but my context was in response to others and TPU meaning DL but anyway it is quite clear I am not talking about general CUDA core performance.

Look back and several of my posts over the last couple of days makes my context is quite clear.
BTW you did not correct Xpea or Voxilla who used 120TFLOPs or related figures themselves when they did not provide full semantics in all posts on said subject, and yet they understand the context.
 
Last edited:
Loosing a high profile customer like Google, undoubltly must worry Nvidia.
Also there is the possible prospect that other high profile customers, with deep pockets, may get inspired by this and start fabricating their own ASICs for DL.
Yeah to some extent and also now Google is looking to offer this as a service and expand upon it with others.
But Nvidia has their own plan and they are pretty competitive if they rollout more of the Volta GPUs for the DL ecosystem and provide ease in terms of cohesiveness between the different training/inference nodes-GPUs in regards to compatibility environment versions and instructions to some extent (Cuda/libraries/compute SM version), one of the reasons I think Volta rollout will be faster than most expect as Nvidia is experiencing higher competition in this field and Intel will also have their specialised solution in the future (albeit from a catchup position).
The headache for Nvidia like I mentioned earlier is that at some point they will need the node to be able to offer both training and inference to a pretty high performance level (some will still want dedicated training/inference nodes and others will not just like Google and some other large scale deployers), something I have argued about for some time and it will affect how they position the Gx100 and Gx102 in future.

Cheers
 
Last edited:
If an ASIC for tensor operations is in the same ballpark , with power consumption , as a full fledged GPU isnt't it pretty mutch a perf/w failure? (sure there's pricing and production costs that could be factored in)
 
well we can't compare like that. GV100 300W is with rasterizer, geometry engine, texture unit, FP64, FP32, INT32, hardware scheduler when TPU2 only does FP16 matrix. in other words, I highly doubt GV100 will consume 300W when only using the tensor cores...

The 'only' tensor cores are 128 FP16/FP32 mixed precision FMA, compared to 16 FP32 per SM.
 
If an ASIC for tensor operations is in the same ballpark , with power consumption , as a full fledged GPU isnt't it pretty mutch a perf/w failure? (sure there's pricing and production costs that could be factored in)
If you only do tensor operations, you don't care about any other things GPU can do.
 
If an ASIC for tensor operations is in the same ballpark , with power consumption , as a full fledged GPU isnt't it pretty mutch a perf/w failure? (sure there's pricing and production costs that could be factored in)

I think more data is needed about the TPU2 in how exactly that 180 TFLOPs FP16 is applicable to the DL workflow beyond BLAS-GEMM matrices computation and instructions that is applicable to Nvidia's Tensor cores.
Maybe reading too much into the article but seems to infer they expanded the FP16 operation to more of the DL workflow such as loading-analysing-understanding the data/the data objects/compute error/other operations: https://medium.com/the-downlinq/establishing-a-machine-learning-workflow-530628cfe67
Quite a lot of tasks and operations associated with DL workflow and I guess more may be being made of TensorFlow in this way with the TPU giving it a compute number advantage if so.
Nvidia states they also optimise/accelerate their library for TensorFlow framework on Volta but still to be shown how effective.
Cheers
 
Last edited:
If you only do tensor operations, you don't care about any other things GPU can do.
I think he referred to the usually massive advantages in either raw performance or, in more recent times, performance per watt that arise from and show the true value in using ASICs.
 
Since when were general purpose accelerators compared to ASICs? Yes v100 has tensor cores now to speed up those certain types of algorithms but obviously a client who only wants to do tensor operations wouldn't be looking at GPUs anyway?
 
I think he referred to the usually massive advantages in either raw performance or, in more recent times, performance per watt that arise from and show the true value in using ASICs.

Yup, this one is not an order of magnitude better, not even close. I'd be curious to know why

Or what Malo above says, how come we ended up comparing these two very different chips?Which should each have their very disjoint usages, even when both are used accelerators for Deeplearning.

I'm just asking what for me is an obvious question, sorry if the answers are just as obvious from some of you
 
Last edited:
Since when were general purpose accelerators compared to ASICs? Yes v100 has tensor cores now to speed up those certain types of algorithms but obviously a client who only wants to do tensor operations wouldn't be looking at GPUs anyway?
Since around the low to mid 280's in this thread, I'd say.

And I think it's a fair point, especially when your alternative is to have separate installations for all special cases or if you can swat a couple of HPC-flies with the same installation. Depending on your needs, of course, the former or the latter might make more sense.
 
Yup, this one is not an order of magnitude better, not even close. I'd be curious to know why

Or what Malo above says, how come we ended up comparing these two very different chips?Which should each have their very disjoint usages, even when both are used accelerators for Deeplearning.

I'm just asking what for me is an obvious question, sorry if the answers are just as obvious from some of you


There is an overlap area for both, and Google does use nV GPU's for AI tasks outside of training so this is where Volta will come in handy.
 
Seems that people value different things. Those tensor processor's don't interest me much... but I am awestruck about that configurable 128 KB L1 cache design.

Nvidia implies that their new L1 cache is as fast as groupshared memory. That's going to change the way GPUs are programmed. Nvidia did show a benchmark where they reached 93% of groupshared mem optimized algorithm performance without groupshared memory (thanks to the huge fast L1 caches). Soon GPU compute shaders aren't as hard to program as Cell SPUs. I need to learn new tricks :)
 
Back
Top