Nvidia Volta Speculation Thread

fellix · Dec 19, 2017

silent_guy said:
I wasn’t aware that Maxwell was a regression in shared memory bandwidth compared to Kepler. (To be honest: I’ve never given it any thought!)

AFAIK, Kepler (GK110) was much less efficient at sustained transfers from the shared memory, despite the theoretical high rates, compared to Maxwell. I have to check my sources for exact numbers.

Update: Dissecting GPU Memory Hierarchy through Microbenchmarking

Dayman1225 · Dec 20, 2017

The NVIDIA Titan V Preview - Titanomachy: War of the Titans

Malo · Dec 20, 2017

Dayman1225 said:
The NVIDIA Titan V Preview - Titanomachy: War of the Titans

lol a page for "Can it run Crysis", thanks @Ryan Smith

Even if the Titan V isn't a major leap in gaming performance, we couldn't help ourselves. We have a Titan, we have Crysis. The ultimate question must be answered. Can it run Crysis?

4k 4xSSAA and 60fps

xpea · Dec 20, 2017

Dayman1225 said:
The NVIDIA Titan V Preview - Titanomachy: War of the Titans

the highlight of this review is that, or the first time, we get some real numbers of the Tensor cores performance. And these things rock !!!

HKS · Dec 20, 2017

And we also confirmed that GV100 support fast FP16 math, even though it is currently only exposed in CUDA.

firstminion · Dec 21, 2017

Why not make a specialized chip with just those tensors? Google's approach makes more sense.

CSI PC · Dec 21, 2017

firstminion said:
Why not make a specialized chip with just those tensors? Google's approach makes more sense.

There is the half length V100 at 150W, closest one would get to a dedicated Tensor GPU IMO; the design is still integral to the Nvidia core CUDA-GPC-SM-compute level-instruction model.

Rootax · Dec 21, 2017

firstminion said:
Why not make a specialized chip with just those tensors? Google's approach makes more sense.

Maybe they can or will too. But I'm sure an "all around" card can be needed and wanted too.

Plus I guess they can have more feedback this way.

HKS · Dec 21, 2017

firstminion said:
Why not make a specialized chip with just those tensors? Google's approach makes more sense.

I guess they are. On the new Xavier SoC, they have a hardware-block dedicated to "machine learning" called NVIDIA Deep Learning Accelerator (NVDLA).
They have also open sourced this implementation.

http://nvdla.org/

manux · Dec 21, 2017

Not all workloads are tensor. Big driver for Volta was supercomputer deals requiring more generic computing than tensors.

rcf · Dec 21, 2017

Could Tensors be used for graphics?

manux · Dec 21, 2017

rcf said:
Could Tensors be used for graphics?

Very likely no in in traditional sense. On the other hand could there be new types of algorithms using neural networks to enhance graphics?

CSI PC · Dec 21, 2017

manux said:
Very likely no in in traditional sense. On the other hand could there be new types of algorithms using neural networks to enhance graphics?

Nvidia did a presentation showing how it can be used professionally in that context, one involved rendering quality improvement to a car image (I cannot remember much about that as a little while ago and not sure if it was just Anti-aliasing but thought it involved more), they also talk a lot about AI rendering in general:
https://blogs.nvidia.com/blog/2017/07/31/nvidia-research-brings-ai-to-computer-graphics/

firstminion · Dec 21, 2017

CSI PC said:
There is the half length V100 at 150W, closest one would get to a dedicated Tensor GPU IMO; the design is still integral to the Nvidia core CUDA-GPC-SM-compute level-instruction model.

I mean even if it's nice for a first release there will soon be a faster, more power efficient and much cheaper product. There's so much cruft there.

Rootax said:
Maybe they can or will too. But I'm sure an "all around" card can be needed and wanted too.

Plus I guess they can have more feedback this way.

manux said:
Not all workloads are tensor. Big driver for Volta was supercomputer deals requiring more generic computing than tensors.

A dedicated produtct would make possible for denser, cheaper nodes.

HKS said:
I guess they are. On the new Xavier SoC, they have a hardware-block dedicated to "machine learning" called NVIDIA Deep Learning Accelerator (NVDLA).
They have also open sourced this implementation.

http://nvdla.org/

But that's for another market, we can't put those on the datacenter.

CSI PC · Dec 21, 2017

firstminion said:
I mean even if it's nice for a first release there will soon be a faster, more power efficient and much cheaper product. There's so much cruft there.
.

You do not see the appeal of a dense populated 150W solution from Nvidia competing against other products in the next 12 months?
Which current product do you see matching this?

Sure it will be superceded by the next generation from Nvidia, but then so is every generation.

The profit margin/costs-logistics/R&D probably makes more sense for Nvidia to continue with half-length 150W GPUs going forward that target more of the DL aspects where clients do not require the full hybrid mixed-precision implementation; though there is a large market of HPC-science that require the full hybrid and especially so as AI-DL matures.
There is the Tegra solution, who knows what will happen down the line with ARM tech as a server solution as Nvidia has never given up that HPC research.

manux · Dec 21, 2017

These supercomputers would not have happened if Volta wasn't Volta

https://blogs.nvidia.com/blog/2017/...uters-to-supercharge-ai-scientific-discovery/

Amazon&co would have hard time selling GPU cloud if Volta was tensor only. Fp64/HPC performance matters. I guess the misconception is people thinking volta as dnn only chip which it isn't.

There really isn't yet tensor optimized software. Even Google isn't yet providing tpu to anybody outside Google. What is the publically available and popular tensor only accelerator volta competes with today? Year from now market could be different but volta is out today.

OlegSH · Dec 21, 2017

rcf said:
Could Tensors be used for graphics?

They can be used for graphics via neural networks, which are universal function approximators. Since graphics is all about functions - input data->transform via some function->output, CNNs can approximate any pixel shader - http://deep-shading-datasets.mpi-inf.mpg.de/

silent_guy · Dec 21, 2017

Dayman1225 said:
The NVIDIA Titan V Preview - Titanomachy: War of the Titans

I was about to comment on how enjoyable it is to read a well written piece of journalism. And then I stumbled into a repeat of this thing:

This also means that a limited amount of scheduling hardware is back in NV’s GPUs.

OlegSH · Dec 21, 2017

silent_guy said:
I was about to comment on how enjoyable it is to read a well written piece of journalism. And then I stumbled into a repeat of this thing:

The same story, but that sounded more like a Crysis meme to me, like... this new GPU is cool, but does it contain hardware scheduling?

Arun · Dec 21, 2017

Ryan (or anyone else from Anand) - how did you setup/implement the GEMM tests? I'm guessing it's cuBLAS multiplying two matrices which are *both* very large? I agree memory bandwidth is a key question for efficiency here - I'm thinking the way the tensor cores are used for cuDNN might have different characteristics in terms of "external bandwidth required per amount of computation" (depending on what they're doing). Might also be interesting to try downclocking core and/or memory separately and see what happens.

BTW on the Beyond3D "Estimated FMA latency" test - it doesn't really make sense for GCN to be 4.5 cycles

There are possible HW explanations for non-integer latencies but they're not very likely. The test inherently has some overhead (which can be amortised by trading off how long it runs in various ways) so maybe it's just higher on GCN for some reason which makes it "look" like 4.5 cycles when it's really 4 cycles; I'm not quite sure.

I always thought it'd be interesting to get power consumption numbers when running that test btw (would probably have to be changed to run in a loop) - it's effectively using the GPU as little as theoretically possible, but still keeping it active non-stop (1 lane of 1 warp/wave). So in a way it's the smallest possible step up from "idle" and shows what's the minimum power when you're not allowed to just power gate (or shutdown) everything!

I'm getting my Titan V in early/mid January - I'll definitely write some microbenchmarks to test a few things I'm curious about, thinking of maybe writing articles describing the deep learning HW landscape too, we'll see...

(P.S.: Agreed with silent_guy, every instance of "scheduling hardware" should really be replaced by "dependency tracking hardware"!)

EDIT: And needless to say, thanks for the really nice article with original analysis and tests - happy to see you guys spending the time to do that!

Nvidia Volta Speculation Thread

fellix

Dayman1225

Malo

Yak Mechanicum

xpea

HKS

firstminion

CSI PC

Rootax

HKS

manux

rcf

manux

CSI PC

firstminion

CSI PC

manux

OlegSH

silent_guy

OlegSH

Arun

Unknown.

Similar threads