Nvidia Volta Speculation Thread

Well star citizen, that is just an exercise in bad project management lol (I don't think they ever hit a single drop date yet for any of their modules), designers should never be project managers or producers, sorry just the way it is, designers want to get as much things into games just because its cool and a great idea. Its just like making an artist into a project manager, they are never happy with their work, they feel like they can do better. Sometimes bugets and timings are more important than giving the best of all aspects.
 
oh yeah of course, by end of this year or so :) working up to a kick starter campaign, but need to get certain things up and done, so its not like what happened with star citizen :).

I'm the only one on my team now working, just finishing up as much of the art assets as possible, before the programmers start doing there work. Design is pretty much complete and on paper everything seems to be evening out on the RPG side, still working on the RTS side of things but that module will be an expansion of the game and won't see day of late till a couple of years after the single player and multi player are done. MMO will come around the same time too, and the math for that is pretty much equalized on the game design portion.

So many moving parts, but I'm going on track with what I started with last year.
 
I haven't seen this Micron press release posted yet.

https://www.micron.com/about/blogs/2017/june/what-drives-our-commitment-to-16gbps-graphics-memory
  • GDDR6 will feature 16 Gb die running up to 16 Gbps.
  • GDDR5X has hit 16 Gbps in their labs.
Their speed projections are below. No 16 Gbps stuff until 2019 (though I seriously doubt anyone expected it sooner).

blog_image_gddr_3.png


And then they start digging into the differences between GDDR5X and GDDR6.

blog_image_gddr_4.png


It looks like GDDR5X might have a place in future markets for Micron. We haven't seen 16Gb GDDR5X yet.

I'm interested in how Volta will use one or both technologies in varying parts of their lineup.

EDIT And just to properly source things, I got this from WCCFTech. They aren't without some merit!

http://wccftech.com/micron-gddr6-gddr5x-memory-mass-production-early-2018/
 
I haven't seen this Micron press release posted yet.

https://www.micron.com/about/blogs/2017/june/what-drives-our-commitment-to-16gbps-graphics-memory
  • GDDR6 will feature 16 Gb die running up to 16 Gbps.
  • GDDR5X has hit 16 Gbps in their labs.
Their speed projections are below. No 16 Gbps stuff until 2019 (though I seriously doubt anyone expected it sooner).

blog_image_gddr_3.png


And then they start digging into the differences between GDDR5X and GDDR6.

blog_image_gddr_4.png


It looks like GDDR5X might have a place in future markets for Micron. We haven't seen 16Gb GDDR5X yet.

I'm interested in how Volta will use one or both technologies in varying parts of their lineup.

EDIT And just to properly source things, I got this from WCCFTech. They aren't without some merit!

http://wccftech.com/micron-gddr6-gddr5x-memory-mass-production-early-2018/

If you stick a keyboard in front of an untrained immortal monkey and let it just hammer at it for eternity it will eventually churn out the complete works of Shakespeare, so there's that.
 
If you stick a keyboard in front of an untrained immortal monkey and let it just hammer at it for eternity it will eventually churn out the complete works of Shakespeare, so there's that.

There were experiments, and they failed. It seems this is a myth actually. ;)
I would compare it to "countable infinities'. Yes, it has a "real probability", that doesn't mean it can be achieved in reality. You can't count infinite natural numbers in reality.

On the other hand: :D
18e299csvare0jpg.jpg
 
Volta DXG Station package pricing - $69,000
  • 4 NVIDIA Tesla V100 “Volta” GPUs with NVLink 2.0 links
  • 20,480 NVIDIA CUDA cores, total
  • 2,560 NVIDIA Tensor cores, total
  • Total of 64GB high-bandwidth GPU memory
  • 480 TFLOPS FP16 half-precision performance
  • NVIDIA-certified & supported software stack for Deep Learning workloads
  • One 20-core Intel Xeon E5-2698v4 CPU
  • 256GB DDR4 System Memory
  • Dual X540 10GbE Ethernet ports (10GBase-T RJ45 ports)
  • Four 1.92TB SSD (one for OS and three in RAID0 for high-speed cache)
  • Quiet, liquid-cooled tower form factor ( < 35dB for office use)
  • Power Consumption: 1500W at full load (for standard office electrical outlet)
  • Ubuntu Desktop Linux operating system
NVIDIA-DGX-Station-Deep-Learning-Workstation-680x383.jpg

https://www.microway.com/preconfiguredsystems/nvidia-dgx-station-deep-learning-workstation/
 
Artistic representation on top of a die shot from the looks of it. Obscuring the actual logic underneath.


Nvidia didn't even list FP16 rates for Volta that I've seen, so the assumption is they don't support the packed math like prior architectures and Vega. Only Tensor ops with very limited functionality. Google's TPU only supported 5-6 instructions as I recall. So Vega's FP16 and Volta's FP32/16 rates would be comparable until that changes. The rest of your post is just more of your usual mental gymnastics. I'm not sure why you're so caught up on colleges needing to each ROCm. The end result will likely be accelerating all applicable languages through LLVM and it seems reasonable that all languages but CUDA will hold a larger marketshare for some time. The real difference is that AMD won't be reliant on their own programming language to make their products work.
Nvidia would still need to provide a more traditional way for GEMM/cuBLAS.
This looks like it can be seen with these actual results using the updated Caffe2 framework that supports the updates to cuDNN7 and TensorRT, separate to actual Tensor cores.
A key difference is how Caffe2 supports FP16 for training now with cuDNN7 available with Volta.
As these are actual results also need to take into consideration the increased core count with V100.

cuDNN_benchmark_1.png


Caffe2-FP16-Chart.png


So yeah it looks to still be there, and would be a shocker if they removed it tbh on the V100 as it just would not make sense, they did not talk about it IMO as the focus is on new features such as the mixed precision Tensor cores.
Point is just to show Vec2 looks to still be there for V100, not regarding the facts about P100 in this instance.
Cheers
 
Last edited:
So yeah it is still there, and would be a shocker if they removed it tbh on the V100 as it just would not make sense, they did not talk about it IMO as the focus is on new features such as the mixed precision Tensor cores.
Point is just to show Vec2 is still there for V100, not regarding the facts about P100 in this instance.
I'd agree, but it might be separated exclusively into the tensor cores. Packed math was originally for accelerating the deep learning which a tensor core obviously is designed to do. It still seems an interesting omission and there is still a question of if the feature arrives on consumer products. Which has baring on the whole discussion of which architecture is targeted.

Need this discussion moved to a more relevant thread.
 
I'd agree, but it might be separated exclusively into the tensor cores. Packed math was originally for accelerating the deep learning which a tensor core obviously is designed to do. It still seems an interesting omission and there is still a question of if the feature arrives on consumer products. Which has baring on the whole discussion of which architecture is targeted.

Need this discussion moved to a more relevant thread.


FP 16 calculations are part of nV's mixed precision capabilities in Pascal, another words the ALU's can do it, they are not removing that in Volta. Why would they remove that in an already done pipeline and just do it with Tensor cores? They already have it, they don't need to mention it again. They even added it to maxwell tegra x1.
 
What about the possibility that those „cores“ (FP32, FP64, Tensor, INT32) are not as distinct as Nvidias depicts them in the first place? Most of it could be a matter of data flow.
 
What about the possibility that those „cores“ (FP32, FP64, Tensor, INT32) are not as distinct as Nvidias depicts them in the first place? Most of it could be a matter of data flow.
I would say the Tensor cores are distinct primarily due to the much improved clock cycles involved with its mixed precision instruction-operation.
I do think though the Tensor cores can do more than what they have been presented with to date albeit still with said matrices structure (and some of the high experienced CUDA dev feel the same way in the forum), anyway here are some of the more well known CUDA devs perspective on Tensor cores.
In fact just reading it supports my previous post that those Caffe2/ResNet/etc results are Vec2 packed FP16:

txbob

It's based on the use of TensorCore, which is a new computation engine in the Volta V100 GPU.

The TensorCore is not a general purpose arithmetic unit like an FP ALU, but performs a specific 4x4 matrix operation with hybrid data types. If your algorithm (whatever it may be) can take advantage of that, then you may witness a perf improvement. It has to be coded for, and this operation does not trivially map into a C or C++ operator (like multiply) so the exposure will probably primarily be through libraries, and the library in question for deep learning would be CuDNN.

It's likely that future versions of cuDNN will use the TensorCores on V100 (when V100 becomes available, in the future) and to the extent that these then become "available" to operations from e.g. TensorFlow that use the GPU, it should be possible (theoretically) to achieve a speed up for certain operations in Tensorflow.

You should in the future be able to use the TensorCore conceptually similarly to the way novel compute modes like INT8 and FP16 are currently exposed via CuDNN. You will have to specify the right settings, and format your data correctly, and after that it should "just work" for a particular cuDNN library call.

Using it as a standalone operation in pure CUDA C/C++ should theoretically be possible, but it remains to be seen exactly how it will be exposed (if at all) in future versions of CUDA.

His later post also shows the difference in approach with regards to mix precision operation between the traditional 'FP32 CUDA' core and Tensor core and how they operate with input/output/compute precision.
The Tensor core would have greater throughput than SGemmEx.

Sounds like you want a full functional spec and all the details now. I don't think all that information has been disclosed yet.

I think what I've heard so far does not map into INT8 at all, and does not map directly into an ordinary FP16 matrix-matrix multiply, because there the output datatype would be FP16.

If you had some imaginary BLAS operation that did a FP16xFP16 matrix-matrix multiply that produced a FP32 result matrix, you could probably use this feature to good effect there. You might want to look at some of the existing exposed capability in the SgemmEx function in CUBLAS.

txbob has been around a long time on the Nvidia Devtalk forum and has a lot of experience with Nvidia products and CUDA, one of the best for information IMO.
https://devtalk.nvidia.com/default/topic/1009558/perfomance-question-for-tesla-v100/?offset=4

Cheers
 
Last edited:
In fact just reading it supports my previous post that those Caffe2/ResNet/etc results are Vec2 packed FP16:
It even says to in the foot notes
This of course again is apples and oranges, since we all deemed GP100 to also support 2×FP16, right?


But that's not what I was getting at. I am talking about shared circuits between those „cores", i.e. the Tensor-part re-using some of the adders and/or multipliers from the „regular“ FP32 and/or FP64-ALUs.
 
It even says to in the foot notes

This of course again is apples and oranges, since we all deemed GP100 to also support 2×FP16, right?


But that's not what I was getting at. I am talking about shared circuits between those „cores", i.e. the Tensor-part re-using some of the adders and/or multipliers from the „regular“ FP32 and/or FP64-ALUs.
Well I am emphasising the point as confirmation albeit specific to V100 as we do not know anything about the lower models; this conversation initially came about debating whether V100 still supported Vec2 packed FP16 compared to Vega as raised earlier by Anarchist.
The last post did say " It still seems an interesting omission", so making the point with the information provided by txbob makes sense to me.
But anyway txbob provides further interesting background beyond this.
Cheers
 
Last edited:
Maybe this should be done in the Volta thread?
The rest of the posts that kicked this off are over there.
Like I mention and show the performance data is Vec2 FP16 with Caffe2 (this was only recently updated to FP16 training but with cuDNN7), then I quoted one of the senior and experienced CUDA devs at devtalk that also mention cuDNN to date for real world use with certain frameworks does not currently support the Tensor cores, which also ties into the Caffe2 results with Volta 100 vs GP100.

You need to be able to explain this for it to be removed; how would every current traditional HPC implementation out there using cuBLAS and GEMM or designed around cuDNN/TensorRT work with totally different instruction and structure requirement of Tensor mixed precision cores compared to existing Vec2/SGemmEx/HGemm if Nvidia for some crazy reason removed Vec2/HGemm/SGemmEX from V100?
Cheers
Higher performance doesn't imply the packed math though. It could simply come down to a wider processor with improved clocks. We already know the chip is a good deal larger with more SIMDs.

What about the possibility that those „cores“ (FP32, FP64, Tensor, INT32) are not as distinct as Nvidias depicts them in the first place? Most of it could be a matter of data flow.
That's what I've been wondering. Especially with the parallel INT32 pipeline. Tensors logically might be replacing the traditional operations to keep the design compact. Ruling out certain instructions would also allow clocks to change. Possibly have the packed math, but results fed to that integer pipeline for FP32 math. Tensors and FP32 math wouldn't run concurrently under those conditions. Nvidia is just presenting the capability differently on the diagrams.

I would say the Tensor cores are distinct primarily due to the much improved clock cycles involved with its mixed precision instruction-operation.
Not necessarily given the temporal nature of the tensor operations. As I suggested as a Vega possibility, the most efficient arrangement would be a FP16 ALU alternating between two adders. FP32 taking longer to propagate than FP16 and piping data through that arrangement. Refactor the timing around FP16 with clocks doubling. Standard FP32 operations then taking two or more cycles to complete. A tensor core being two SMs working together in a special arrangement. The actual tensor math was far simpler as established in the tensor thread. Without the data sharing it should be straight forward.
 
Higher performance doesn't imply the packed math though. It could simply come down to a wider processor with improved clocks. We already know the chip is a good deal larger with more SIMDs.

RPM is just another way of saying 2 fp 16 operations in one 32 fp unit.

The only thing making nV's Volta die bigger is its tensor and DP cores, take those out, it seems like its going to be around 600mm2, which by all accounts is close to what Vega is to be (Volta has more ALU's).
 
What about the possibility that those „cores“ (FP32, FP64, Tensor, INT32) are not as distinct as Nvidias depicts them in the first place? Most of it could be a matter of data flow.
There may be considerations between the FP and INT units that could keep them separate, since there is now dual-issue between the unit types and possibly variations in how their scheduling hardware would work.

The data flow for the tensor unit is one area that (if we take Nvidia's statements about operations per cycle at face value) could have a significant need for specialized routing.
They do outright say it in their blog about Volta:
Tensor Cores and their associated data paths are custom-crafted to dramatically increase floating-point compute throughput at only modest area and power costs. Clock gating is used extensively to maximize power savings.

Getting 64 parallel multiplies sourced from 16-element operands, followed by a full-precision accumulate of those multiplies and a third 16-element operand into a clock cycle efficiently and at the target clocks sounds non-trivial. Shoehorning that into the existing pipelines may not be good for them, either.
 
Back
Top