Nvidia Volta Speculation Thread

Just for clarity.
SXM2 32GB V100 is still 300W, shown as that in parts list and also on sites such as Anandtech.
That won't be a static power consumption though. It's just a matter of what power profile Nvidia configures. What I'm suggesting is that the IO power consumption is pushed onto a separate chip in some configurations. Increasing the distance between chips for example will increase the energy expended. Leaving the switch chip to absorb most of the work of driving IO over longer distances and capacitance. Effectively an inline buffer/repeater. In the presence of a switch, the power usage of the GPU may drop and the 300W figure somewhat arbitrary.
 
That won't be a static power consumption though. It's just a matter of what power profile Nvidia configures. What I'm suggesting is that the IO power consumption is pushed onto a separate chip in some configurations. Increasing the distance between chips for example will increase the energy expended. Leaving the switch chip to absorb most of the work of driving IO over longer distances and capacitance. Effectively an inline buffer/repeater. In the presence of a switch, the power usage of the GPU may drop and the 300W figure somewhat arbitrary.
Not sure I follow or how you feel this affects SXM2 (this is non-NVSwitch model), although I agree they subtly change the envelope which is why TDP does not usually increase due to memory capacity; mentioned this in response to another poster earlier.
What your quoting from me is the actual TDP-TBP that Nvidia always uses for the actual GPU board or actual SMX card; it must include all aspects of the board VRM-stages-GPU-memory-IO and always has recently with how Nvidia reports their TDP-TBP.
The power behaviour will not be that different between a 6 brick hybrid mesh to a 6 brick-to-NVSwitch and yeah the NVSwitch will have higher power demand.
But it will not reduce the accelerator's requirements or physical characteristics to a point it would be of notable significance; some SMX3 will be very close to a NVSwitch some further away and made even more complicated that each brick for all dGPUs are aggregated across various NVSwitches.
Point is this would not be possible if the variance was that great from an EE physical transmission perspective.
The NVSwitch is completely separate in context of TDP with its own.

Remember the 300W is the SMX2 and not NVSwitch configuration (that is SMX3), that said TDP by Nvidia is quoted at maximum spec boost-configuration and yeah it is difficult to get completely accurate due to the very dynamic nature of the power-thermal-performance/clock management and IO, but usually Nvidia is pretty good with their figures.
However it is all relative between the Nvidia GPUs/accelerators as can use their methodology and apply it to all their modern dGPUs and accelerators (whether SMX2 or SMX3 or PCIe).

The 300W (or 250W for the PCIe models) TDP is also complicated due to how one measures (the more sensitive the interval say below 10ms the more variance you will see) and also what process (compute FP32-FP64, rendering-gaming,etc) one uses to measure the TDP, but that is a different topic and covered in depth in past when discussions about the TDP-TBP of 480 and 960.
 
Last edited:
Google’s Cloud TPU Matches Volta in Machine Learning at Much Lower Prices
New benchmarks from RiseML put both Nvidia and Google’s TPU head-to-head — and the cost curve strongly favors Google.
...

The comparison is between four Google TPUv2 chips (which form one Cloud TPU) against 4x Nvidia Volta GPUs. Both have 64GB of total RAM and the data sets were trained in the same fashion. RiseML tested the ResNet-50 model (exact configuration details are available in the blog post) and the team investigated both raw performance (throughput), accuracy, and convergence (an algorithm converges when its output comes closer and closer to a specific value).
ResNet-50-768x475.png

The suggested batch size for TPUs is 1024, but other batch sizes were tested at reader request. Nvidia does perform better at those lower batch sizes. In accuracy and convergence, the TPU solution is somewhat better (76.4 percent top-1 accuracy for Cloud TPU, compared with 75.7 percent for Volta). Improvements to top-end accuracy are difficult to come by, and the RiseML team makes the small difference between the two solutions out to be more important than you might think. But where Google’s Cloud TPU really wins, at least right now, is on pricing.

ResNet50-Cost-768x562.png

...
the current pricing of the Cloud TPU allows to train a model to 75.7 percent on ImageNet from scratch for $55 in less than 9 hours! Training to convergence at 76.4 percent costs $73. While the V100s perform similarly fast, the higher price and slower convergence of the implementation results in a considerably higher cost-to-solution.

Google may be subsidizing its cloud processor pricing, and the exact performance characteristics of ML chips will vary depending on implementation and programmer skill. This is far from the final word on Volta’s performance, or even Volta as compared with Google’s Cloud TPU. But at least for now, in ResNet-50, Google’s cloud TPU appears to offer nearly identical performance at substantially lower prices.
https://www.extremetech.com/extreme...ance-in-machine-learning-at-much-lower-prices

Article at HPC Wire:
https://www.hpcwire.com/2018/04/30/riseml-benchmarks-google-tpuv2-against-nvidia-v100-gpu/
 
Last edited by a moderator:
KInda unfortunate they did their article just before Google offered V100s on their GCP; although possibly restricted due to their location.
Would be interesting to compare pricing structure between GCP and AWS though for V100s.
 
That's interesting, but the conclusion hardly seems final, since the V100 costs over $8000,*and could conceivably be priced much lower than that while remaining comfortably profitable. Of course, NVIDIA probably wouldn't want to take it anywhere near Titan-level prices, but $3000 doesn't sound impossible.

*まさか!
 
That's interesting, but the conclusion hardly seems final, since the V100 costs over $8000,*and could conceivably be priced much lower than that while remaining comfortably profitable. Of course, NVIDIA probably wouldn't want to take it anywhere near Titan-level prices, but $3000 doesn't sound impossible.

*まさか!
AWS price has also been that for quite awhile now without dropping,which is why it would be interesting to try and get a comparable price comparison with GCP now that it is available for V100 (or very soon will be more generally).
Also it would be nice if they investigated their issues with MXNet, other ML-DL companies seem to have used it fine with multi-V100s and AWS and image classification, although one should not necessarily expect radically greater performance boost.
An important aspect missing is for those that want to scale beyond 4-GPUs as 8-GPU is also an option (would nice to see comparable test-pricing with similar TPUv2 at such scaling) but then that is more about performance time rather than cost efficiency due to usually working out slightly more expensive I think on a cost perspective (although comes back to price structure and best way to use such service).
 
Last edited:
I think Alexko's point was that, with competition, BOTH Tesla and Titan prices have room to move. A lot!

I actually didn't realize that the latest Titan was so crazily expensive, but yes, that was the idea, thanks! NVIDIA can afford to be much more aggressive on pricing, if need be.
 
Although the price (context from manufacturer segment-tech perspective) also needs to take into consideration that the V100 is a mixed-precision accelerator that is useful for multiple functions and not just FP16 DL like TPU2 (has no FP32 capability let alone FP64).
There are also techniques available to improve accuracy for FP16 training specifically on the V100, which can be important if moving away from FP32.

If these specific dedicated DL accelerators start to dominate in cloud, then Nvidia could still launch the 150W half length card V100 designed more for this purpose, and worth noting all new V100s are 32GB HBM2.
But Google is starting to offer V100 as part of their GCP now as well, just a shame the article was a bit early and could not use both options from Google as it is cheaper than AWS, albeit still not as cheap as TPU2.
 
Last edited:
Did anyone try yet to try the double precision shootout I mentioned above ?

My poor Titan X Pascal is half speed of my 8 core CPU.
 
Back
Top