Nvidia Volta Speculation Thread

Found this on reddit, don't know how accurate it is:
Code:
GPU             Price     FP64    FP64/price
Titan V         $ 2.999   6900    2,301
Titan Black     $ 999     1707    1,709
Quadro GP100    $ 6.999   5200    0,743
Tesla P100      $ 7.999   5200    0,650
1060 3GB        $ 199     123     0,618
1070 Ti         $ 449     256     0,570
1060 6GB        $ 249     137     0,550
1070            $ 379     202     0,533
1080 Ti         $ 699     354     0,506
1080            $ 549     277     0,505
Titan Xp        $ 1.200   380     0,317
 
Hm, I gotta say that a NVLink bridge costing 600 fricken dollars for a piece of board with some plastic and a few connectors on it is beyond monster cable pricing territory. Nevermind that it is aimed at pro users and doesn't work with the T V; a thing like that should come with the card itself, not be a silly expensive extra.
 
Judging by the spec, I have a hard time to believe this Titan V can achieve 110T flops of DL, the memory bandwidth on V100 is barely sufficient to feed the mixed precision computation, and now they cut 1/4 of them off.
 
Judging by the spec, I have a hard time to believe this Titan V can achieve 110T flops of DL, the memory bandwidth on V100 is barely sufficient to feed the mixed precision computation, and now they cut 1/4 of them off.
If you look at the real-world workload of ResNet-50, you see a speed-up factor of 2.4x when using tensor cores for training (compared to P100 FP32) and 3.7x for inference (compared to P100 FP16).

Titan V will be slower, of course, but it should be enough of a speedup to be worth it, especially since P100 was never available as a 'cheap' Titan product to begin with.
 
How many bytes are required per DL OP?

According to their own doc, they did the DL thing actually through 256x256 matrix mul within a warp, for a mixed precision mul, note that there is insufficient storage in either register (warp-wide) or shared-memeory to store temp results, so they have to write back results to main memory.

Which means, even with the best case scenario, they can only achieve a 256*256*256*2/(256*256*4)=128 DLops/byte

Therefore, to achieve 110T flops, you will need roughly about 1TB/sec of memory bandwidth, althrough L2 cache can reduce such requirement a little bit (which depend alot on accessing pattern, but given Nvidia's previous GPU archs, I doubt it will help much in most GEMM cases since usually you need much larger Matrix in the first place to achieve high efficiency).
 
I looked at systems with the same Skylake 6700K that this mystery 'Generic VGA' score uses.
RAM speed and single vs dual channel... other tasks running in the background.

But you confused me since you said 'than a '980ti even', I thought you were saying it was GPU dependent.
 
According to WCCFTech, the card reaches 1.9GHz boost clock. Oc'ed easily to 2.0GHz just like Pascal. Some benchmarks:

FireStrike:
TitanV (stock): 32K
TitanXp(stock): 28K

Unigine Superpoisition:
TitanV (stock): 9431
TitanXp(stock): ~6000
1080Ti (OC'ed to 2.6GHz): 8642

https://wccftech.com/nvidia-titan-v-volta-gaming-benchmarks/

Some gaming comparisons:

Gears of War 4:
Titan V OC - 166 fps
1080 Ti OC - 124 fps
 
Last edited:
According to their own doc, they did the DL thing actually through 256x256 matrix mul within a warp, for a mixed precision mul, note that there is insufficient storage in either register (warp-wide) or shared-memeory to store temp results, so they have to write back results to main memory.

Which means, even with the best case scenario, they can only achieve a 256*256*256*2/(256*256*4)=128 DLops/byte

Therefore, to achieve 110T flops, you will need roughly about 1TB/sec of memory bandwidth, althrough L2 cache can reduce such requirement a little bit (which depend alot on accessing pattern, but given Nvidia's previous GPU archs, I doubt it will help much in most GEMM cases since usually you need much larger Matrix in the first place to achieve high efficiency).

Worth noting though part of the performance limitation is the interconnect meaning it is a bit more difficult to really know the limitation of the HBM2 memory in this context; NVlink2 125 TFLOPS, PCIe3 112 TFLOPS - both with same HBM2 bandwidth.
But I agree one would also think 110 TFLOPS still seems optimistic with the HBM2 BW/bus reduction on the Titan V.

Edit:
Good grief can tell I am hung over and brain on planet Elsewhere, yeah different core GPU clocks sigh :)
Still shows the full bus/BW manages 125 TFLOPs with full spec core clocks (NVLink2 card) and we cannot tell what the limitation is at that.
 
Last edited:
Didn't Jensen Huang at one point this year mention Volta's manufacturing cost per unit was pretty high?
Estimates put it around $600 to $1K and that is just the HW related costs, Jensen I think confirmed it more than most thought and put it closer to the upper $1k estimates.
 
VC have posted OCed synthetic benchmarks but barring superposition 1080p extreme, where it's ~50% faster, it doesn't show much improvement over 1080Ti OC, :-?

https://videocardz.com/74382/overclocked-nvidia-titan-v-benchmarks-emerge

Superposition 1080p extreme is different only in shader quality, so maybe that's doing something. They mention it showing clocks over 2Ghz with overclocked HBM so probably not throttling enough to explain the low difference.
Worth remembering though this is more of a science/dev GPU Titan rather than a gaming/visual card, a lot of space will be taken up by FP64 and some with the Tensor cores.
While not a great indicator for gaming, worth noting the Titan V has 13.8 TFLOPs FP32 while the TitanxP was 12.1 TFLOPS FP32 with both having similar GPC/Polymorph count; would be much greater performance difference across the board if it was not for the fact it is a huge mixed precision GPU.
I would need to find the documentation but I thought as well the higher SM count of these HPC models is not that efficient towards gaming workloads; basically doubles the SM per GPC and TPC, or another way to look at it is 64 CUDA cores per SM rather than 128 as found with all other recent Nvidia GPUs.
 
It's 14.8 TFLOPS for the Titan V. And the number is not even true when the chip is running close to 2.0GHz, that effectively makes it a 20 TFLOPS GPU.
You can say the same about TitanxP when it comes to clocks spec and OCing.
You are probably right but some reports have it as 13.8 TFLOPs such as Anandtech *shrug* - also to break over 14 TFLOPs by a fair bit it may need to go up to the 300W spec of the NVLink model rather than the 250W spec it is sold as, possible but I assume it means changing power settings.
Anyway point is it is not massive over the TitanxP (21.5% more relevant CUDA cores with the Titan V), which is reflected in some of those scores for the reason I mentioned, the larger gap scores can possibly make better use of some of that arch change but like I mentioned the double number of SMs per GPC/TPC may be detrimental to gaming type workloads.

It is a really great card for the price do not get me wrong, but it is not necessarily perfect for most prosumers.
 
Last edited:
Back
Top