Nvidia Volta Speculation Thread

NVIDIA also announced the DGX-1 with Volta, the DGX Station (4x V100), and the HGX-1 (8x V100 for cloud computing).

ssp_417.jpg


ssp_422.jpg


ssp_424.jpg


(From AnandTech.)

The DGX Station is 1500 W with 4 GPUs so does that imply 375 W per GPU?

2017 Q3 for the DGX-1 Volta. Also,

More like 300W per GPU and 300W for the rest of the system?
 
A 800mm2 chip is clocked at 1.4-1.5Ghz, so I think it's not out of the realm of reason to expect desktop chips at 2.0Ghz stock. If nvidia put out a gaming chip with the same CUDA cores as the behemoth, it should do 20TFLOPS, about 70% more than the big gaming Pascal. Hopefully, the improvements add another 10-20% and we have a pretty decent gaming card by next year's end.
 
A 800mm2 chip is clocked at 1.4-1.5Ghz, so I think it's not out of the realm of reason to expect desktop chips at 2.0Ghz stock.
They'll have to discard the Tensor cores and the DP units (like they usually do), Also Volta has a new scheduling hardware to handle all of these cores. We could see a reduction in that section as well.

A full GV100 GPU consists of six GPCs, 84 Volta SMs, 42 TPCs (each including two SMs), and eight 512-bit memory controllers (4096 bits total). Each SM has 64 FP32 Cores, 64 INT32 Cores, 32 FP64 Cores, and 8 new Tensor Cores. Each SM also includes four texture units.

With 84 SMs, a full GV100 GPU has a total of 5376 FP32 cores, 5376 INT32 cores, 2688 FP64 cores, 672 Tensor Cores, and 336 texture units. Each memory controller is attached to 768 KB of L2 cache, and each HBM2 DRAM stack is controlled by a pair of memory controllers. The full GV100 GPU includes a total of 6144 KB of L2 cache

Volta GV100 is the first GPU to support independent thread scheduling, which enables finer-grain synchronization and cooperation between parallel threads in a program. One of the major design goals for Volta was to reduce the effort required to get programs running on the GPU, and to enable greater flexibility in thread cooperation, leading to higher efficiency for fine-grained parallel algorithms.

https://devblogs.nvidia.com/parallelforall/inside-volta/?ncid=so-twi-vt-13918
 
The thread scheduling method now tracks thread context per work item, and allows for instructions belonging to both paths to issue rather than the hardware running down one path until it reaches its end and then starting on the other. I'll have to take some time to digest the information. Nvidia seems to describing one benefit to their solution as removing the deadlock threat of synchronization operations being split between diverged paths.
 
What's TSMC 's 12 FFN?

Dedicated Tensor cores apparently do 2*FP16 MUL + FP32 ADD at a very high rate (exclusively for 4x4 matrix processing?), hence the 120 mixed TFLOPs.
900GB/s HBM2 means it's using chips running at 1.8Gbps, up from the ~1.4Gbps in P100.


P.S.: Obligatory bullshit moment: showing a trailer for the 1-year-old Kingslaive CG movie as "what Takeshi Nozue thinks games will look like in the future". Yeah ok.



P.S.2: The 5120 ALUs are apparently running at ~1.47GHz to give those 15 FP32 TFLOPs.
Quite the achievement for a >800mm^2 chip.
Yeah.
More impressive though is staying within 300W with the FP64 while also expanding NVLink 2 performance, that is where a lot of the power demand/TDP comes from (more specifically FP64 but NVLink Mezzanine is pretty demanding).

It is a very interesting design and impressive also with the spec, I mentioned to someone else awhile ago it is a bit like Kepler->Maxwell repeated this time Pascal->Volta.
They increase the die by 33.6% while impressively keeping with same 300W and yet they go further:
FP32 compute increases by 41.5% or 2x (yeah depends upon function with Tensor).
FP64 compute increases by 41.5%
FP16 compute increased by 41.5% or 4x (yeah depends upon function with Tensor).
Squeezing into that 33.6% die increase an extra 41% Cuda cores and importantly with additional functions/units.

And other important aspects such as a heavily revised Thread Scheduling and Cache performance behaviour:
Specific sections they are in is: Independent Thread Scheduling, and for L0/L1 Cache both Volta SM (Streaming Multiprocessor) and then ENHANCED L1 DATA CACHE AND SHARED MEMORY
https://devblogs.nvidia.com/parallelforall/inside-volta/

More of a monster than I expected TBH, but fits with what was being said quite awhile ago about how it is another jump from Pascal with arch changes (and also critically efficiency looking at those specs).
It will be interesting to see how GV100 pans out as a Quadro 2nd half next year, shame no-one has yet tested the Quadro GP100 with the dual NVLink to see how well it works with certain Professional applications-devs Nvidia work closely with for Quadros.
Cheers

Edit:
Sorry Graham did not read your post before posting so I see you also reference the additional info on the devblog.
But I think you will find a version of the Tensor cores on certain other CUDA/Volta GPUs.
Also forgot to say, NVLink 2 as thought is increasing the number of links supported from 4 to 6 and now 50GB/s individually rather than 40GB/s.

Edit2:
Was tired just corrected Tensor specifics on proof read.
 
Last edited:
Have they confirmed this is one die? To me at 800mm sq it makes far more sense for this to be two dies. Given that its using HBM, there is already an interposer so having two 400mm chips with a high bandwidth fabric between L2 slices* makes far more sense to me.

*or similar position in the architecture
 
This is one monster of a chip. A lot of improvements over pascal + pushing chip size and manufacturing process. Nvidia definitely didn't take foot of the gas pedal and wait for the competition to catch up.

I really liked anandtech piece on volta and also nvidia devblog was surprisingly detailed

http://www.anandtech.com/show/11367...v100-gpu-and-tesla-v100-accelerator-announced

https://devblogs.nvidia.com/parallelforall/inside-volta/?ncid=so-twi-vt-13918


anandtech article, pretty good, not sure why they mentioned less flexibility vs more performance, I know it was talked about in the presentation when comparing to other chip types (FPGA's and CPU's) but I don't think Volta's architecture is going to be less flexible than past GPU architectures from nV, maybe Ryan can explain.
 
anandtech article, pretty good, not sure why they mentioned less flexibility vs more performance, I know it was talked about in the presentation when comparing to other chip types (FPGA's and CPU's) but I don't think Volta's architecture is going to be less flexible than past GPU architectures from nV, maybe Ryan can explain.
he was talking only about the Tensor Cores which are less flexible, they are targeting a specific subset of workloads.
 
But I think you will find a version of the Tensor cores on certain other CUDA/Volta GPUs.
Maybe they can retain some of them in a GV102 core for the Titan crowd? Judging by previous trends, a GV104 core will most likely discard them completely. Speaking of which, I think we can expect a full GV104 to be roughly 20~30% faster than a TitanXp (full GP102), If NV managed high enough clocks.
 
Yep, one die. It's just insane. And that doesn't even get into the interposer (you can't get a traditional interposer large enough).

In 5 years when everyone starts throwing these out, I'm going to have to get one to add to the GPU collection...

this kinda shows how much they are expecting in sales from DL and HPC to push the limits like this though.
 
Back
Top