Nvidia Volta Speculation Thread

V100 doesn't do 120 FP16 TFLOPS, it does 120 TFLOPS only with very specific Tensor ops.
So in the context of a TPU2 discussion, it does 120 TFLOPS?

Loosing a high profile customer like Google, undoubltly must worry Nvidia.
Also there is the possible prospect that other high profile customers, with deep pockets, may get inspired by this and start fabricating their own ASICs for DL.
Worrying or not, it definitely shouldn't have been a surprise. All big non-semi tech companies are hiring semiconductor engineers. I don't think Jim Keller went to Tesla for his expertise in the suppression of rotary engine induced vibrations.
 
The headache for Nvidia like I mentioned earlier is that at some point they will need the node to be able to offer both training and inference to a pretty high performance level (some will still want dedicated training/inference nodes and others will not just like Google and some other large scale deployers), something I have argued about for some time and it will affect how they position the Gx100 and Gx102 in future
It's not as if those tensor cores couldn't be used for inference either. A GV100 using FP16 tensor cores will still be much faster than a GP102 using INT8 math.
 
It's not as if those tensor cores couldn't be used for inference either. A GV100 using FP16 tensor cores will still be much faster than a GP102 using INT8 math.
I think we are in agreement just you are thinking now while my point is about product line and what Nvidia will do when there is a notable overlap between their top two Tesla GPU cards for DL ecosystem.
My post is in context of how Nvidia differentiate between Gx100 and GX102 and the headache it is causing them down the road an even a bit now, especially as some want a powerful single node doing both Training and Inference.
What you just said agrees with my post in some ways, you do not need both with their DL ecosystem (I agree others will want independent nodes though for training/inference); and the GV102 without DP cores will be a full uncut GPU meaning more SMs (with 8 Tensor per SM) and slighlty higher clock speed, meaning it will have greater performance than GV100 if one ignores DP.

Nvidia need to decide a better way to differentiate the Gx100 and Gx102 rather than by FP16 and Int8 down the road and probably by next generation.
It does not make sense to limit FP16 to just the GPU that has less SM and lower clocks (due to also supporting DP cores) and lower allowance for yield , especially as the Gx102 is importantly a smaller die with greater performance in this DL and FP32 context.
Market demands and competitors will probably force them to change IMO.
Maybe they can keep NVLink2 and its benefits to the Gx100 as the differentiator, but then again at some point they may have to consider this also on a Gx102 variant.
Would be nice though if they give the general CUDA cores full Vec2 FP16 throughput on the GV102 even if they decide to limit use of Tensor cores in some way.
Cheers
 
Last edited:
It's not as if those tensor cores couldn't be used for inference either. A GV100 using FP16 tensor cores will still be much faster than a GP102 using INT8 math.
I wonder whether those Tensorcores can be scheduled in overlap with the regular FP32/2×FP16-Units, i.e. utilized together and if so, how it would impact the power profile - or going further, if this is already factored into the 300 watts TDP.
 
I wonder whether those Tensorcores can be scheduled in overlap with the regular FP32/2×FP16-Units, i.e. utilized together and if so, how it would impact the power profile - or going further, if this is already factored into the 300 watts TDP.
Wouldn't the 1:2 ratio FP64 operations be the most demand heavy?
2nd even top demand maybe the parallel operations of In32 and FP32 available with V100?
This was one reason Nvidia used to downclock in the past regarding DP; as an example was it the 1st Titan model that enabling DP to 1:3 dropped the boost (it defaulted to something like 1:24 otherwise and higher clocks)?
Worth remembering the previous P100 Mezzanine was also 300W TDP.
CHeers
 
Last edited:
I wonder whether those Tensorcores can be scheduled in overlap with the regular FP32/2×FP16-Units, i.e. utilized together and if so, how it would impact the power profile - or going further, if this is already factored into the 300 watts TDP.
I doubt it. One way or the other, the tensor core needs to be fed with data. It's probably using the register file BW from the regular cores.
A more likely option is that the tensor core itself doesn't have all the HW for 4x4+4, but that it's already using some of the regular units, such as the FP32 for the final add.
 
One of the Nvidia blog writers also a senior Nvidia engineer Olivier Giroux confirmed the following regarding instruction scheduled occupy and mixing/issue slots of various instructions (including precision) and rate/cycle :
Question to Olivier at Nvidia in Blog said:


  • Bulat Ziganshin
    2 days ago


    it seems that each FP32/INT32 instruction sheduled occupy sub-SM for a two cycles, so on the next sycle other type of instruction should be issued - pretty similar to LD/ST instructions on all NVidia GPUs as well as sheduling on SM 1.x GPUs

    So, the new architecture allows to run FP32 instructions at full rate, and use remaining 50% of issue slots to execute all other type of istructions - INT32 for index/cycle calculations, load/store, branches, SFU, FP64 and so on. And unlike Maxwell/Pascal, full GPU utilization doesn't need to pack pairs of coissued instructions into the same thread - each next cycle can execute instructions from differemnt thread, so one thread perfroming series of FP32 instructions and other thread perfroming series of INT32 instructions, will load both blocks by 100%

    is my understanding correct?


Olivier Giroux
Bulat Ziganshina day ago
That is correct.

Olivier background:
About Olivier Giroux
4be147cc5118d28be7348eecba1ab1fc

Olivier Giroux has contributed to eight GPU, and four SM architecture generations released by NVIDIA. Lately, he works to clarify the shapes and semantics of valid GPU programs, present and future. He is a member of WG21, the ISO C++ committee.

Cheers
 
Last edited:
So in the context of a TPU2 discussion, it does 120 TFLOPS?


Worrying or not, it definitely shouldn't have been a surprise. All big non-semi tech companies are hiring semiconductor engineers. I don't think Jim Keller went to Tesla for his expertise in the suppression of rotary engine induced vibrations.
I'm surprised to hear Jim is an expert in rotary engine design, even if Tesla didn't hire him for that skill. ;-)
 
… happens to be also a senior GPU Architect at Nvidia, who probably was hired for his expertise in blogging. ;)
Yeah good point, why I felt comfortable mentioning it as well, should had made it clearer myself for acceptance - in fact will do that now :)
Quite a few of the writers or contributors are heavily involved in the design process whether GPU architecture or ecosystem/library/instruction-operation design such as with CUDA/DL/cuBLAS-GEMM/DIGITS/etc.
Yeah far from being a standard blog.
But then most are using content they create and put into their presentations they use over the year on their subject matter, so not too much more work tbh.
Cheers
 
Last edited:
Yeah has already been tested with Amber some time ago, and also with the NVLink active and without.

factorix_nve_amber16.0.0.png


One example, but performance gains with NVLink varies between 15%-27% depending upon the physics model, this of course does not represent all frameworks-applications though and optimisation is not complete (look at 12GB P100 as an example).
It could be interesting with NVLink 2 and Volta.
Cheers
 
Last edited:


To shamelessly steal from that comment section:

C.J. Muse - Evercore Group LLC

Very helpful. I guess as my follow-up, on the inventory side, that grew I think 3% sequentially. Can you walk through the moving parts there? What's driving that, and is foundry diversification part of that? Thank you.

Jen-Hsun Huang - NVIDIA Corp.

The driving reasons for inventory growth is new products, and that's probably all I ought to say for now. I would come to GTC. Come to the keynote tomorrow. I think it will be fun.

Whether or not this is the example of the aforementioned, the fact that Nvidia has been stockpiling actual inventory for the "New", at that time unannounced products is in of itself noteworthy.
 
Back
Top