Nvidia Volta Speculation Thread

Blazkowicz · May 10, 2017

iMacmatician said:
NVIDIA also announced the DGX-1 with Volta, the DGX Station (4x V100), and the HGX-1 (8x V100 for cloud computing).

(From AnandTech.)

The DGX Station is 1500 W with 4 GPUs so does that imply 375 W per GPU?

2017 Q3 for the DGX-1 Volta. Also,

More like 300W per GPU and 300W for the rest of the system?

xpea · May 10, 2017

ToTTenTranz said:
Maybe 12FFN = 12FFC in risk production?
The only 12nm reported from TSMC so far is 12FFC. I was wondering what 12FFN is.

In 12FFN, the N stands for Nvidia. It's a custom mode the for the green team ;-)

iMacmatician · May 10, 2017

Blazkowicz said:
More like 300W per GPU and 300W for the rest of the system?

Looks like you're right, since NVIDIA mentions in the devblog that the V100 has a 300 W TDP.

Blazkowicz · May 10, 2017

xpea said:
In 12FFN, the N stands for Nvidia. It's a custom mode the for the green team ;-)

So a bit like Samsung's 14nm LPE and 10nm LPE, but nvidia bought everything?

gamervivek · May 10, 2017

A 800mm2 chip is clocked at 1.4-1.5Ghz, so I think it's not out of the realm of reason to expect desktop chips at 2.0Ghz stock. If nvidia put out a gaming chip with the same CUDA cores as the behemoth, it should do 20TFLOPS, about 70% more than the big gaming Pascal. Hopefully, the improvements add another 10-20% and we have a pretty decent gaming card by next year's end.

DavidGraham · May 10, 2017

gamervivek said:
A 800mm2 chip is clocked at 1.4-1.5Ghz, so I think it's not out of the realm of reason to expect desktop chips at 2.0Ghz stock.

They'll have to discard the Tensor cores and the DP units (like they usually do), Also Volta has a new scheduling hardware to handle all of these cores. We could see a reduction in that section as well.

A full GV100 GPU consists of six GPCs, 84 Volta SMs, 42 TPCs (each including two SMs), and eight 512-bit memory controllers (4096 bits total). Each SM has 64 FP32 Cores, 64 INT32 Cores, 32 FP64 Cores, and 8 new Tensor Cores. Each SM also includes four texture units.

With 84 SMs, a full GV100 GPU has a total of 5376 FP32 cores, 5376 INT32 cores, 2688 FP64 cores, 672 Tensor Cores, and 336 texture units. Each memory controller is attached to 768 KB of L2 cache, and each HBM2 DRAM stack is controlled by a pair of memory controllers. The full GV100 GPU includes a total of 6144 KB of L2 cache

Volta GV100 is the first GPU to support independent thread scheduling, which enables finer-grain synchronization and cooperation between parallel threads in a program. One of the major design goals for Volta was to reduce the effort required to get programs running on the GPU, and to enable greater flexibility in thread cooperation, leading to higher efficiency for fine-grained parallel algorithms.

https://devblogs.nvidia.com/parallelforall/inside-volta/?ncid=so-twi-vt-13918

3dilettante · May 10, 2017

The thread scheduling method now tracks thread context per work item, and allows for instructions belonging to both paths to issue rather than the hardware running down one path until it reaches its end and then starting on the other. I'll have to take some time to digest the information. Nvidia seems to describing one benefit to their solution as removing the deadlock threat of synchronization operations being split between diverged paths.

CSI PC · May 10, 2017

ToTTenTranz said:
What's TSMC 's 12 FFN?

Dedicated Tensor cores apparently do 2*FP16 MUL + FP32 ADD at a very high rate (exclusively for 4x4 matrix processing?), hence the 120 mixed TFLOPs.
900GB/s HBM2 means it's using chips running at 1.8Gbps, up from the ~1.4Gbps in P100.

P.S.: Obligatory bullshit moment: showing a trailer for the 1-year-old Kingslaive CG movie as "what Takeshi Nozue thinks games will look like in the future". Yeah ok.

P.S.2: The 5120 ALUs are apparently running at ~1.47GHz to give those 15 FP32 TFLOPs.
Quite the achievement for a >800mm^2 chip.

Yeah.
More impressive though is staying within 300W with the FP64 while also expanding NVLink 2 performance, that is where a lot of the power demand/TDP comes from (more specifically FP64 but NVLink Mezzanine is pretty demanding).

It is a very interesting design and impressive also with the spec, I mentioned to someone else awhile ago it is a bit like Kepler->Maxwell repeated this time Pascal->Volta.
They increase the die by 33.6% while impressively keeping with same 300W and yet they go further:
FP32 compute increases by 41.5% or 2x (yeah depends upon function with Tensor).
FP64 compute increases by 41.5%
FP16 compute increased by 41.5% or 4x (yeah depends upon function with Tensor).
Squeezing into that 33.6% die increase an extra 41% Cuda cores and importantly with additional functions/units.

And other important aspects such as a heavily revised Thread Scheduling and Cache performance behaviour:
Specific sections they are in is: Independent Thread Scheduling, and for L0/L1 Cache both Volta SM (Streaming Multiprocessor) and then ENHANCED L1 DATA CACHE AND SHARED MEMORY
https://devblogs.nvidia.com/parallelforall/inside-volta/

More of a monster than I expected TBH, but fits with what was being said quite awhile ago about how it is another jump from Pascal with arch changes (and also critically efficiency looking at those specs).
It will be interesting to see how GV100 pans out as a Quadro 2nd half next year, shame no-one has yet tested the Quadro GP100 with the dual NVLink to see how well it works with certain Professional applications-devs Nvidia work closely with for Quadros.
Cheers

Edit:
Sorry Graham did not read your post before posting so I see you also reference the additional info on the devblog.
But I think you will find a version of the Tensor cores on certain other CUDA/Volta GPUs.
Also forgot to say, NVLink 2 as thought is increasing the number of links supported from 4 to 6 and now 50GB/s individually rather than 40GB/s.

Edit2:
Was tired just corrected Tensor specifics on proof read.

Deleted member 2197 · May 10, 2017

itsmydamnation · May 11, 2017

Have they confirmed this is one die? To me at 800mm sq it makes far more sense for this to be two dies. Given that its using HBM, there is already an interposer so having two 400mm chips with a high bandwidth fabric between L2 slices* makes far more sense to me.

*or similar position in the architecture

Razor1 · May 11, 2017

they stated reticle limit so one die.

manux · May 11, 2017

This is one monster of a chip. A lot of improvements over pascal + pushing chip size and manufacturing process. Nvidia definitely didn't take foot of the gas pedal and wait for the competition to catch up.

I really liked anandtech piece on volta and also nvidia devblog was surprisingly detailed

http://www.anandtech.com/show/11367...v100-gpu-and-tesla-v100-accelerator-announced

https://devblogs.nvidia.com/parallelforall/inside-volta/?ncid=so-twi-vt-13918

itsmydamnation · May 11, 2017

Razor1 said:
they stated reticle limit so one die.

Well it truly is insane, gotta love the ambition, poor old kights-*.

Razor1 · May 11, 2017

manux said:
This is one monster of a chip. A lot of improvements over pascal + pushing chip size and manufacturing process. Nvidia definitely didn't take foot of the gas pedal and wait for the competition to catch up.

I really liked anandtech piece on volta and also nvidia devblog was surprisingly detailed

http://www.anandtech.com/show/11367...v100-gpu-and-tesla-v100-accelerator-announced

https://devblogs.nvidia.com/parallelforall/inside-volta/?ncid=so-twi-vt-13918

anandtech article, pretty good, not sure why they mentioned less flexibility vs more performance, I know it was talked about in the presentation when comparing to other chip types (FPGA's and CPU's) but I don't think Volta's architecture is going to be less flexible than past GPU architectures from nV, maybe Ryan can explain.

itsmydamnation · May 11, 2017

Razor1 said:
anandtech article, pretty good, not sure why they mentioned less flexibility vs more performance, I know it was talked about in the presentation when comparing to other chip types (FPGA's and CPU's) but I don't think Volta's architecture is going to be less flexible than past GPU architectures from nV, maybe Ryan can explain.

he was talking only about the Tensor Cores which are less flexible, they are targeting a specific subset of workloads.

Razor1 · May 11, 2017

ah ok thx!

DavidGraham · May 11, 2017

CSI PC said:
But I think you will find a version of the Tensor cores on certain other CUDA/Volta GPUs.

Maybe they can retain some of them in a GV102 core for the Titan crowd? Judging by previous trends, a GV104 core will most likely discard them completely. Speaking of which, I think we can expect a full GV104 to be roughly 20~30% faster than a TitanXp (full GP102), If NV managed high enough clocks.

Ryan Smith · May 11, 2017

Razor1 said:
they stated reticle limit so one die.

Yep, one die. It's just insane. And that doesn't even get into the interposer (you can't get a traditional interposer large enough).

In 5 years when everyone starts throwing these out, I'm going to have to get one to add to the GPU collection...

shiznit · May 11, 2017

Is there a game rendering application for the Tensor units?

Razor1 · May 11, 2017

Ryan Smith said:
Yep, one die. It's just insane. And that doesn't even get into the interposer (you can't get a traditional interposer large enough).

In 5 years when everyone starts throwing these out, I'm going to have to get one to add to the GPU collection...

this kinda shows how much they are expecting in sales from DL and HPC to push the limits like this though.

Nvidia Volta Speculation Thread

Blazkowicz

xpea

iMacmatician

Blazkowicz

gamervivek

DavidGraham

3dilettante

CSI PC

Deleted member 2197

Guest

itsmydamnation

Razor1

manux

itsmydamnation

Razor1

itsmydamnation

Razor1

DavidGraham

Ryan Smith

shiznit

Razor1

Similar threads