Nvidia Volta Speculation Thread

No, it doesn't take a semiconductors diety to "stop" someone who's new in the market in a resource and project management battle.

Not excluding the fact that Google (or anyone) can make a breakthrough. But as I've said, I fear their window of opportunity is running out.

It does feel like it. For training GPUs are fairly great as it is. Nvidia will no doubt keep optimizing for AI training, as will AMD and Intel with it's new GPUs most likely. There's already heavy competition there for price and advancement. It doesn't feel like the TPU is a particularly necessary project.
 
It probably all depends on cost. Google will likely want to install tens on thousands of TPU racks.
If they can produce them at a much lower cost than a DXG appliance, then it might still make sense?
 
Ehh, yeah. They are probably not paying more than $50-$100 per TPU, but $5000-$10000 per NVidia V100 ... So 100x TPU or 1x V100 - you chose.
 
We’ll see about the TPU/siblings market. There are a number of interested parties at different tiers.
I’d like to remind everyone about how, when Intel decided to enter the mobile market, many thought that they would soon come to dominate completely. They had enormous CPU design experience, the best fabs, coffers deep enougn to soak up several billion dollars worth of losses (and they did) in order to crack the market and gain dominance, the full weight of the x86 software stack, ability to strike cross market deals with players like ASUS, Lenovo, Acer...
I think strong predictions about the TPU market are a bit premature.
 
Ehh, yeah. They are probably not paying more than $50-$100 per TPU, but $5000-$10000 per NVidia V100 ... So 100x TPU or 1x V100 - you chose.
I expect that, apples to apples, Google’s TPU cost will be much higher than that, since you’re quoting V100 system prices. (Enclosure, XEON server, infiniband, ...) So let’s say 10x instead of 100x.

Also, while R&D costs are something Wall Street doesn’t care about, it may be something Google takes into account internally, because there is an alternative that they can simply buy.
 
What do you think a 330 mm² chip in 28nm costs these days? Or a small board, and 2x4 GB DDR3 memory? My $50-$100 guess already includes development cost writeof.
 
What do you think a 330 mm² chip in 28nm costs these days? Or a small board, and 2x4 GB DDR3 memory? My $50-$100 guess already includes development cost writeof.

But Google TPU 2nd gen is not a single or basic PC, it is designed for massive scale out high bandwidth Tensor computations, the costs are quite different.
Worth noting the board is also 4xTPU processors rather than just a single accelerator.

Separately here is a photo from Google on how it is installed to get a sense of scale and integration required for a design that influences manufacturing cost.

tpu-v2-1.width-1000-640x364.png
 
What do you think a 330 mm² chip in 28nm costs these days? Or a small board, and 2x4 GB DDR3 memory? My $50-$100 guess already includes development cost writeof.
I don’t know. But I don’t think the question is very relevant.

One of the requirements for training is memory bandwidth, so that kind of memory just isn’t going to cut it.

And the thing must run in a system. So either you compare chip (plus memory) to chip or system to system.

A $10k V100 is part of a complex system. It makes no sense to compare that against a $100 chip.

After all, we already know that you can buy a V100 (almost) for $3k.
 
Just got my 2 Titan Vs today, have tested on a few kernels, the results are good, but it seems that the boost clock is overrated, in my tests, the GPU boost clock only reach to 1355MHz, vastly lower than my GP102, which can reach to 1850+MHz.

The most interesting part is GEMM test with CUBLAS_TENSOR_OP_MATH enabled:

With tensor core enabled for GEMM with fp16 x fp16=fp32, Titan V can reach 83Tflops/sec, which is quite impressive, espeically considering it only have 3/4 of the bandwidth of V100.

And the most unexpected result is, when tensor core is enabled, it seems that it can accerlate sgemm as well for whatever reason yet to know:

Without tensor core, the SGEMM test on Titan V can get just ~12Tflops

But with tensor core enabled, the SGEMM on Titan V can reach to 30-40Tflops.

I dont know how this is possible, maybe Nvidia forget to mention their tensor core can accerlate sgemm as well? just hope this is a hidden feature, instead of a bug of CUDA 9.1.
 
Last edited:
Just got my 2 Titan Vs today, have tested on a few kernels, the results are good, but it seems that the boost clock is overrated, in my tests, the GPU boost clock only reach to 1355MHz, vastly lower than my GP102, which can reach to 1850+MHz.

The most interesting part is GEMM test with CUBLAS_TENSOR_OP_MATH enabled:

With tensor core enabled for GEMM with fp16 x fp16=fp32, Titan V can reach 83Tflops/sec, which is quite impressive, espeically considering it only have 3/4 of the bandwidth of V100.

And the most unexpected result is, when tensor core is enabled, it seems that it can accerlate sgemm as well for whatever reason yet to know:

Without tensor core, the SGEMM test on Titan V can get just ~12Tflops

But with tensor core enabled, the SGEMM on Titan V can reach to 30-40Tflops.

I dont know how this is possible, maybe Nvidia forget to mention their tensor core can accerlate sgemm as well? just hope this is a hidden feature, instead of a bug of CUDA 9.1.
Interesting.
The SGEMM result with it enabled looks more like the FP16 CUDA mixed precision result one could expect (within P100 and V100/Titan V), but I assume you are not using that.
 
Interesting.
The SGEMM result with it enabled looks more like the FP16 CUDA mixed precision result one could expect (within P100 and V100/Titan V), but I assume you are not using that.

Never mind, I just checked CUDA 9.1 documents, it seems that cublasSgemm will just covert FP32 to FP16 when tensor core is enabled:

cublasSgemm(), cublasGemmEx(), cublasSgemmEx(), cublasGemmBatchedEx(), cublasGemmStridedBatchedEx() NOTE: A conversion from CUDA_R_32F to CUDA_R_16F with round to nearest on the input values A/B is performed when Tensor Core operations are used
 
Never mind, I just checked CUDA 9.1 documents, it seems that cublasSgemm will just covert FP32 to FP16 when tensor core is enabled:
Kinda makes sense using mixed-precision function, although I can appreciate it is changing the concept of Sgemm, Tensor is blurring the boundaries.

BTW if you get the time please could you consider trying Hgemm to compare performance of Sgemm with Tensor enabled.
 
Last edited:
I'm curious if you have "Above 4G Decoding" enabled in the motherboard bios for 64-bit decoding above the 4G address space?
What exactly does that option do? I have it in my current main rig's UEFI, and it doesn't explain (which isn't a surprise, because ASUS fairly sucks.)
 
What exactly does that option do? I have it in my current main rig's UEFI, and it doesn't explain (which isn't a surprise, because ASUS fairly sucks.)
Supposedly you should set it to Enable when using cards like Tesla or Quadro to allow memory mapped I/O to use greater than 4GB address space for a 64-bit PCIe card/device. Not sure if it does anything but I noticed the bios option in my Asus mb as well.

Edit:
Tomshardware mentioned it recently in one of their mining articles:
http://www.tomshardware.com/news/msi-bios-cryptocurrency-mining-performance,34972.html

Interesting fact ... enabling the 4GB address option on my motherboard removed some of the OS exclamation marks within Device Manager for some of my devices. Similar to what happened in the Tomshardware article.
 
Last edited:
Contracted a local nvidia guy, it seems that the boost on Titan V is just that low (1335MHz for my two cards, and the boost is much less flexiable than average Geforce, more like the case in Tesla/Quardo, so maybe the Titan V should be renamed to Tesla V80 instead), but when play games, the card can boost to 1800MHz or so.

I suspect that must has something do with the FP64 thing, the original Titan will down clock significantly when full speed FP64 is enabled.

Its a shame the driver can no longer disable full speed FP64 on Titan V.

@LiXiangyang
I'm curious if you have "Above 4G Decoding" enabled in the motherboard bios for 64-bit decoding above the 4G address space?

I always leave that option enabled.
 
Last edited:
Depending on which API you use, the boost can go well beyond that point. In OpenCL I saw 1335 MHz as Boost as well. In D3D (which has a compute part also) it went up there with the Pascal cards. Until of course the cooler could not get rid of the heat anymore, which is also pascal-like.
 
Simply speculation: If nVidia releases a series of new gaming cards that offer poor mining bang for the buck it could create an interesting situation. At current prices, existing owners of GTX 1050s/1060s/1070s who don't mine could sell their cards for what they paid for them, or more, and use that cash to subsidize the upgrade to the the new, hypothetical, gaming card.

On the other hand, nVidia would be foolish to not sell new cards to miners, so they'd likely release something that would have the effect of lowering the value to miners of existing cards. Interesting times indeed, waiting to see how this all turns out. Interesting times in a bad way for those who need/want a new card.
 
Back
Top