Nvidia Volta Speculation Thread

Last edited:
the highlight of this review is that, or the first time, we get some real numbers of the Tensor cores performance. And these things rock !!!

93674.png
 
Why not make a specialized chip with just those tensors? Google's approach makes more sense.
There is the half length V100 at 150W, closest one would get to a dedicated Tensor GPU IMO; the design is still integral to the Nvidia core CUDA-GPC-SM-compute level-instruction model.
 
Not all workloads are tensor. Big driver for Volta was supercomputer deals requiring more generic computing than tensors.
 
Very likely no in in traditional sense. On the other hand could there be new types of algorithms using neural networks to enhance graphics?
Nvidia did a presentation showing how it can be used professionally in that context, one involved rendering quality improvement to a car image (I cannot remember much about that as a little while ago and not sure if it was just Anti-aliasing but thought it involved more), they also talk a lot about AI rendering in general:
https://blogs.nvidia.com/blog/2017/07/31/nvidia-research-brings-ai-to-computer-graphics/
 
There is the half length V100 at 150W, closest one would get to a dedicated Tensor GPU IMO; the design is still integral to the Nvidia core CUDA-GPC-SM-compute level-instruction model.

I mean even if it's nice for a first release there will soon be a faster, more power efficient and much cheaper product. There's so much cruft there.

Maybe they can or will too. But I'm sure an "all around" card can be needed and wanted too.

Plus I guess they can have more feedback this way.

Not all workloads are tensor. Big driver for Volta was supercomputer deals requiring more generic computing than tensors.

A dedicated produtct would make possible for denser, cheaper nodes.

I guess they are. On the new Xavier SoC, they have a hardware-block dedicated to "machine learning" called NVIDIA Deep Learning Accelerator (NVDLA).
They have also open sourced this implementation.

http://nvdla.org/

But that's for another market, we can't put those on the datacenter.
 
I mean even if it's nice for a first release there will soon be a faster, more power efficient and much cheaper product. There's so much cruft there.
.
You do not see the appeal of a dense populated 150W solution from Nvidia competing against other products in the next 12 months?
Which current product do you see matching this?

Sure it will be superceded by the next generation from Nvidia, but then so is every generation.

The profit margin/costs-logistics/R&D probably makes more sense for Nvidia to continue with half-length 150W GPUs going forward that target more of the DL aspects where clients do not require the full hybrid mixed-precision implementation; though there is a large market of HPC-science that require the full hybrid and especially so as AI-DL matures.
There is the Tegra solution, who knows what will happen down the line with ARM tech as a server solution as Nvidia has never given up that HPC research.
 
Last edited:
These supercomputers would not have happened if Volta wasn't Volta


https://blogs.nvidia.com/blog/2017/...uters-to-supercharge-ai-scientific-discovery/

Amazon&co would have hard time selling GPU cloud if Volta was tensor only. Fp64/HPC performance matters. I guess the misconception is people thinking volta as dnn only chip which it isn't.

There really isn't yet tensor optimized software. Even Google isn't yet providing tpu to anybody outside Google. What is the publically available and popular tensor only accelerator volta competes with today? Year from now market could be different but volta is out today.
 
I was about to comment on how enjoyable it is to read a well written piece of journalism. And then I stumbled into a repeat of this thing:
The same story, but that sounded more like a Crysis meme to me, like... this new GPU is cool, but does it contain hardware scheduling? :D
 
Ryan (or anyone else from Anand) - how did you setup/implement the GEMM tests? I'm guessing it's cuBLAS multiplying two matrices which are *both* very large? I agree memory bandwidth is a key question for efficiency here - I'm thinking the way the tensor cores are used for cuDNN might have different characteristics in terms of "external bandwidth required per amount of computation" (depending on what they're doing). Might also be interesting to try downclocking core and/or memory separately and see what happens.

BTW on the Beyond3D "Estimated FMA latency" test - it doesn't really make sense for GCN to be 4.5 cycles :( There are possible HW explanations for non-integer latencies but they're not very likely. The test inherently has some overhead (which can be amortised by trading off how long it runs in various ways) so maybe it's just higher on GCN for some reason which makes it "look" like 4.5 cycles when it's really 4 cycles; I'm not quite sure.

I always thought it'd be interesting to get power consumption numbers when running that test btw (would probably have to be changed to run in a loop) - it's effectively using the GPU as little as theoretically possible, but still keeping it active non-stop (1 lane of 1 warp/wave). So in a way it's the smallest possible step up from "idle" and shows what's the minimum power when you're not allowed to just power gate (or shutdown) everything!

I'm getting my Titan V in early/mid January - I'll definitely write some microbenchmarks to test a few things I'm curious about, thinking of maybe writing articles describing the deep learning HW landscape too, we'll see...

(P.S.: Agreed with silent_guy, every instance of "scheduling hardware" should really be replaced by "dependency tracking hardware"!)

EDIT: And needless to say, thanks for the really nice article with original analysis and tests - happy to see you guys spending the time to do that! :)
 
Last edited:
Back
Top