What's TSMC 's 12 FFN?
Dedicated Tensor cores apparently do 2*FP16 MUL + FP32 ADD at a very high rate (exclusively for 4x4 matrix processing?), hence the 120 mixed TFLOPs.
900GB/s HBM2 means it's using chips running at 1.8Gbps, up from the ~1.4Gbps in P100.
P.S.: Obligatory bullshit moment: showing a trailer for the 1-year-old Kingslaive CG movie as "what Takeshi Nozue thinks games will look like in the future". Yeah ok.
P.S.2: The 5120 ALUs are apparently running at ~1.47GHz to give those 15 FP32 TFLOPs.
Quite the achievement for a >800mm^2 chip.
Yeah.
More impressive though is staying within 300W with the FP64 while also expanding NVLink 2 performance, that is where a lot of the power demand/TDP comes from (more specifically FP64 but NVLink Mezzanine is pretty demanding).
It is a very interesting design and impressive also with the spec, I mentioned to someone else awhile ago it is a bit like Kepler->Maxwell repeated this time Pascal->Volta.
They increase the die by 33.6% while impressively keeping with same 300W and yet they go further:
FP32 compute increases by 41.5% or 2x (yeah depends upon function with Tensor).
FP64 compute increases by 41.5%
FP16 compute increased by 41.5% or 4x (yeah depends upon function with Tensor).
Squeezing into that 33.6% die increase an extra 41% Cuda cores and importantly with additional functions/units.
And other important aspects such as a heavily revised Thread Scheduling and Cache performance behaviour:
Specific sections they are in is:
Independent Thread Scheduling, and for
L0/L1 Cache both Volta SM (Streaming Multiprocessor) and then
ENHANCED L1 DATA CACHE AND SHARED MEMORY
https://devblogs.nvidia.com/parallelforall/inside-volta/
More of a monster than I expected TBH, but fits with what was being said quite awhile ago about how it is another jump from Pascal with arch changes (and also critically efficiency looking at those specs).
It will be interesting to see how GV100 pans out as a Quadro 2nd half next year, shame no-one has yet tested the Quadro GP100 with the dual NVLink to see how well it works with certain Professional applications-devs Nvidia work closely with for Quadros.
Cheers
Edit:
Sorry Graham did not read your post before posting so I see you also reference the additional info on the devblog.
But I think you will find a version of the Tensor cores on certain other CUDA/Volta GPUs.
Also forgot to say, NVLink 2 as thought is increasing the number of links supported from 4 to 6 and now 50GB/s individually rather than 40GB/s.
Edit2:
Was tired just corrected Tensor specifics on proof read.