Nvidia Volta Speculation Thread

CSI PC · Dec 8, 2017

Also this is the cheapest accelerated 2xFP16 card from Nvidia for now; Volta can behave same as GP100 in this regard and does not need the Tensor cores function according to the Nvidia table when looking at SM_70 (V100,Titan V) to SM_60 (P100 GPU).
http://docs.nvidia.com/cuda/cuda-c-...ns__throughput-native-arithmetic-instructions

rcf · Dec 8, 2017

Found this on reddit, don't know how accurate it is:

Code:

GPU             Price     FP64    FP64/price
Titan V         $ 2.999   6900    2,301
Titan Black     $ 999     1707    1,709
Quadro GP100    $ 6.999   5200    0,743
Tesla P100      $ 7.999   5200    0,650
1060 3GB        $ 199     123     0,618
1070 Ti         $ 449     256     0,570
1060 6GB        $ 249     137     0,550
1070            $ 379     202     0,533
1080 Ti         $ 699     354     0,506
1080            $ 549     277     0,505
Titan Xp        $ 1.200   380     0,317

Grall · Dec 9, 2017

Hm, I gotta say that a NVLink bridge costing 600 fricken dollars for a piece of board with some plastic and a few connectors on it is beyond monster cable pricing territory. Nevermind that it is aimed at pro users and doesn't work with the T V; a thing like that should come with the card itself, not be a silly expensive extra.

LiXiangyang · Dec 9, 2017

Judging by the spec, I have a hard time to believe this Titan V can achieve 110T flops of DL, the memory bandwidth on V100 is barely sufficient to feed the mixed precision computation, and now they cut 1/4 of them off.

iMacmatician · Dec 9, 2017

LiXiangyang said:
Judging by the spec, I have a hard time to believe this Titan V can achieve 110T flops of DL, the memory bandwidth on V100 is barely sufficient to feed the mixed precision computation, and now they cut 1/4 of them off.

How many bytes are required per DL OP?

Dayman1225 · Dec 9, 2017

https://www.3dmark.com/fs/14371444

Titan V seems to appear on FSU

silent_guy · Dec 9, 2017

LiXiangyang said:
Judging by the spec, I have a hard time to believe this Titan V can achieve 110T flops of DL, the memory bandwidth on V100 is barely sufficient to feed the mixed precision computation, and now they cut 1/4 of them off.

If you look at the real-world workload of ResNet-50, you see a speed-up factor of 2.4x when using tensor cores for training (compared to P100 FP32) and 3.7x for inference (compared to P100 FP16).

Titan V will be slower, of course, but it should be enough of a speedup to be worth it, especially since P100 was never available as a 'cheap' Titan product to begin with.

LiXiangyang · Dec 9, 2017

iMacmatician said:
How many bytes are required per DL OP?

According to their own doc, they did the DL thing actually through 256x256 matrix mul within a warp, for a mixed precision mul, note that there is insufficient storage in either register (warp-wide) or shared-memeory to store temp results, so they have to write back results to main memory.

Which means, even with the best case scenario, they can only achieve a 256*256*256*2/(256*256*4)=128 DLops/byte

Therefore, to achieve 110T flops, you will need roughly about 1TB/sec of memory bandwidth, althrough L2 cache can reduce such requirement a little bit (which depend alot on accessing pattern, but given Nvidia's previous GPU archs, I doubt it will help much in most GEMM cases since usually you need much larger Matrix in the first place to achieve high efficiency).

Florin · Dec 9, 2017

Dayman1225 said:
https://www.3dmark.com/fs/14371444

Titan V seems to appear on FSU

If I'm not mistaken, the graphics score is about the same or slightly better than a factory OC 1080 Ti.
But physics score seems less than half, good bit lower than 980 Ti even.

Infinisearch · Dec 9, 2017

Florin said:
But physics score seems less than half, good bit lower than 980 Ti even.

Isn't the physics score CPU dependent?

Florin · Dec 9, 2017

Infinisearch said:
Isn't the physics score CPU dependent?

Yes I believe that is correct.
I looked at systems with the same Skylake 6700K that this mystery 'Generic VGA' score uses.

Infinisearch · Dec 9, 2017

Florin said:
I looked at systems with the same Skylake 6700K that this mystery 'Generic VGA' score uses.

RAM speed and single vs dual channel... other tasks running in the background.

But you confused me since you said 'than a '980ti even', I thought you were saying it was GPU dependent.

DavidGraham · Dec 9, 2017

According to WCCFTech, the card reaches 1.9GHz boost clock. Oc'ed easily to 2.0GHz just like Pascal. Some benchmarks:

FireStrike:
TitanV (stock): 32K
TitanXp(stock): 28K

Unigine Superpoisition:
TitanV (stock): 9431
TitanXp(stock): ~6000
1080Ti (OC'ed to 2.6GHz): 8642

https://wccftech.com/nvidia-titan-v-volta-gaming-benchmarks/

Some gaming comparisons:

Gears of War 4:
Titan V OC - 166 fps
1080 Ti OC - 124 fps

CSI PC · Dec 9, 2017

LiXiangyang said:
According to their own doc, they did the DL thing actually through 256x256 matrix mul within a warp, for a mixed precision mul, note that there is insufficient storage in either register (warp-wide) or shared-memeory to store temp results, so they have to write back results to main memory.

Which means, even with the best case scenario, they can only achieve a 256*256*256*2/(256*256*4)=128 DLops/byte

Therefore, to achieve 110T flops, you will need roughly about 1TB/sec of memory bandwidth, althrough L2 cache can reduce such requirement a little bit (which depend alot on accessing pattern, but given Nvidia's previous GPU archs, I doubt it will help much in most GEMM cases since usually you need much larger Matrix in the first place to achieve high efficiency).

Worth noting though part of the performance limitation is the interconnect meaning it is a bit more difficult to really know the limitation of the HBM2 memory in this context; NVlink2 125 TFLOPS, PCIe3 112 TFLOPS - both with same HBM2 bandwidth.
But I agree one would also think 110 TFLOPS still seems optimistic with the HBM2 BW/bus reduction on the Titan V.

Edit:
Good grief can tell I am hung over and brain on planet Elsewhere, yeah different core GPU clocks sigh

Still shows the full bus/BW manages 125 TFLOPs with full spec core clocks (NVLink2 card) and we cannot tell what the limitation is at that.

CSI PC · Dec 9, 2017

pharma said:
Didn't Jensen Huang at one point this year mention Volta's manufacturing cost per unit was pretty high?

Estimates put it around $600 to $1K and that is just the HW related costs, Jensen I think confirmed it more than most thought and put it closer to the upper $1k estimates.

gamervivek · Dec 9, 2017

VC have posted OCed synthetic benchmarks but barring superposition 1080p extreme, where it's ~50% faster, it doesn't show much improvement over 1080Ti OC, :-?

https://videocardz.com/74382/overclocked-nvidia-titan-v-benchmarks-emerge

Superposition 1080p extreme is different only in shader quality, so maybe that's doing something. They mention it showing clocks over 2Ghz with overclocked HBM so probably not throttling enough to explain the low difference.

CSI PC · Dec 9, 2017

gamervivek said:
VC have posted OCed synthetic benchmarks but barring superposition 1080p extreme, where it's ~50% faster, it doesn't show much improvement over 1080Ti OC,

https://videocardz.com/74382/overclocked-nvidia-titan-v-benchmarks-emerge

Superposition 1080p extreme is different only in shader quality, so maybe that's doing something. They mention it showing clocks over 2Ghz with overclocked HBM so probably not throttling enough to explain the low difference.

Worth remembering though this is more of a science/dev GPU Titan rather than a gaming/visual card, a lot of space will be taken up by FP64 and some with the Tensor cores.
While not a great indicator for gaming, worth noting the Titan V has 13.8 TFLOPs FP32 while the TitanxP was 12.1 TFLOPS FP32 with both having similar GPC/Polymorph count; would be much greater performance difference across the board if it was not for the fact it is a huge mixed precision GPU.
I would need to find the documentation but I thought as well the higher SM count of these HPC models is not that efficient towards gaming workloads; basically doubles the SM per GPC and TPC, or another way to look at it is 64 CUDA cores per SM rather than 128 as found with all other recent Nvidia GPUs.

DavidGraham · Dec 9, 2017

CSI PC said:
worth noting the Titan V has 13.8 TFLOPs FP32 while the TitanxP was 12.1 TFLOPS FP32

It's 14.8 TFLOPS for the Titan V. And the number is not even true when the chip is running close to 2.0GHz, that effectively makes it a 20 TFLOPS GPU.

CSI PC · Dec 9, 2017

DavidGraham said:
It's 14.8 TFLOPS for the Titan V. And the number is not even true when the chip is running close to 2.0GHz, that effectively makes it a 20 TFLOPS GPU.

You can say the same about TitanxP when it comes to clocks spec and OCing.
You are probably right but some reports have it as 13.8 TFLOPs such as Anandtech *shrug* - also to break over 14 TFLOPs by a fair bit it may need to go up to the 300W spec of the NVLink model rather than the 250W spec it is sold as, possible but I assume it means changing power settings.
Anyway point is it is not massive over the TitanxP (21.5% more relevant CUDA cores with the Titan V), which is reflected in some of those scores for the reason I mentioned, the larger gap scores can possibly make better use of some of that arch change but like I mentioned the double number of SMs per GPC/TPC may be detrimental to gaming type workloads.

It is a really great card for the price do not get me wrong, but it is not necessarily perfect for most prosumers.

Ryan Smith · Dec 9, 2017

CSI PC said:
You are probably right but some reports have it as 13.8 TFLOPs such as Anandtech *shrug*

Those are the numbers NV gave me. Their perf figures seem to be calculated against a ~1350MHz clockspeed, rather than the boost clock.

Nvidia Volta Speculation Thread

CSI PC

rcf

Grall

Invisible Member

LiXiangyang

iMacmatician

Dayman1225

silent_guy

LiXiangyang

Florin

Merrily dodgy

Infinisearch

Florin

Merrily dodgy

Infinisearch

DavidGraham

CSI PC

CSI PC

gamervivek

CSI PC

DavidGraham

CSI PC

Ryan Smith

Similar threads