Nvidia Volta Speculation Thread

Why 12nm and not 10nm for manufacturing volta? Did every other tsmc customer go 10nm and nvidia wanted(paid for) 12nm for some reason?

12nm is an optimization of 16nm, so it would be a tweaked version of a mature process--which would probably make a chip that is pushing its size to the very limit possible.
Nvidia would likely be fighting with Apple and other vendors for the next actual shrink, and the timing would not be good since that process would be of uncertain availability and early in the yield maturation curve. A GV100 even shrunk down would be a big chip compared to the SOCs that typically come out that early.

One thought I have is that this might minimize the risk of a process/manufacturing slip relative to Nvidia's HPC contract deadlines.
 
The new ISA and refactored scheduler contradict some predictions for a highly iterative change from Pascal, at least for the HPC variant.
They are high level (one would say artistic) views but the changes are important between GP100 and GV100 SMs:
GP100:
gp100_SM_diagram-624x452.png


GV100:
image3.png


And you can't get such massive power efficiency gain with only few tweaks
 
There are quite a few things, that were not depicted in the artistic impressions (i like that phrase) of earlier SMs. Coming from a simple block diagram and trying to extrapolate the inner workings of those chips is like sailing through uncharted territory. IOW, they paint what they want you to believe.
 
The impressive part for me is more of a "oh shit I can't believe they were bullish enough to do this" than an actually technical achievement.
Each waffer can only make very few chips and most probably the great majority of them comes out with a defect (either for disabled units or for . With a 300mm wafer they're probably getting around 60-65 dies per wafer.
They're only making this because they got the clients to pay for >$15k per GPU, meaning a 2% yield (practically 1 good GPU per wafer) is already providing some profit.
10% yields (6 good chips) means getting them $90K revenue, of which they're probably getting a profit of well over $80K after putting the cards together.



The FP32 and FP64 unit increase is almost a match to the increase in die area. Unlike Pascal P100, the FP32 units don't seem to do 2*FP16 operations anymore, as the Tensor cores do that instead.
So what they saved in smaller FP32 units and general die area from the 12FF transition, they invested in the Tensor cores.



The Tensor cores are definitely unable to unpack the values at any position in the cubic matrixes (otherwise they would be just regular FP16 ALUs). My guess is someone can just multiply 4*4 matrixes using two 4*1 matrixes with "valid" FP16 values and the 3rd dimension could just be filled with 1s, and in the end you just read the first row (EDIT: derp, forgot how to Algebra).
That said, this results in 30 TFLOPs (120/4) of regular FP16 FMAD operations.

Other than being usable as dedicated FP16 units, I don't see any rendering application for the Tensor units. They could be used for AI inferencing in a game, though..

For gaming, they'd probably be better off going back to the FP32 units capable of doing 2*FP16 operations.
Or like what they did with consumer Pascal, just ignore FP16 altogether and just promote all FP16 variables to FP32 and call it a day. This would be risky because in the future there could be developers using a lot of FP16 in rendering, but nvidia's architectures in consumer products aren't exactly known for being extremely future-proof.

The point is they still increased it by 41.5% when the physical side only increased by 33% while also adding even more functionality and all within same TDP on a massive die and similar 16nm node (albeit latest iteration in TSMC typical fashion called 12nm).
They still do packed accelerated 2xFP16 math in V100 just like P100 btw.
You get 30TFLOPs FP16 and also the Tensor matrix function unit/cores, usually Tensor matrix will have more specific uses primarily towards Deep Leaning framework/apps (future it is in theory possible to use this with professional rendering-modelling, not talking about gaming though).
Those Tensor function units/cores can also be used for FP32 operations as well, so I think that works out around 2x faster with DL supported framework/apps.
Cheers
 
Last edited:
12nm is an optimization of 16nm, so it would be a tweaked version of a mature process--which would probably make a chip that is pushing its size to the very limit possible.
Nvidia would likely be fighting with Apple and other vendors for the next actual shrink, and the timing would not be good since that process would be of uncertain availability and early in the yield maturation curve. A GV100 even shrunk down would be a big chip compared to the SOCs that typically come out that early.

One thought I have is that this might minimize the risk of a process/manufacturing slip relative to Nvidia's HPC contract deadlines.

From what I understand it was always meant to be 16nm, the 10nm rumour came out of left field and quite possibly re-inforced by WCCFT that then was spammed.
Like I mentioned before Pascal was a technical risk milestone towards Volta and the big HPC project obligations, albeit one that also added great value to Nvidia.
Pascal only came into being once those very large and high profile projects with IBM were agreed.
It also explains why they started with 610mm2 die 1st with the node shrink even with the massive cost/risk that on its own had let alone adding HBM2/NVLink/etc also to GP100.
Cheers
 
Those Tensor function units/cores can also be used for FP32 operations as well, so I think that works out around 2x faster with DL supported framework/apps.
Cheers
No they can't, they're explicitly FMA units which do FP16 multiplication and FP16/32 accumulation
 
No they can't, they're explicitly FMA units which do FP16 multiplication and FP16/32 accumulation
We have not been told everything about Tensor function unit/cores and it was high level.
Pretty sure Nvidia will be posting soon the results of actual results using Tensor matrix cores against FP32 training and FP16 inferencing or Matrix FP32 and mixed precision, but maybe it is a sleight of hand *shrug*.

Edit:
Caffe blog is showing cuDNN 6 training at FP32 for P100 while cuDNN 7 training at FP16 for V100, performance difference of 2.4X

Cheers
 
Last edited:
No they can't, they're explicitly FMA units which do FP16 multiplication and FP16/32 accumulation

Seems there is more data now posted:
Volta architecture support: CUDA Libraries are optimized to get the best performance on the Volta platform. Built for Volta and accelerated by Tesla V100 Tensor Cores, cuBLAS GEMMs (Generalized Matrix-Matrix Multiply) achieve up to a 9.3x speedup on mixed-precision computation (SGEMMEx with FP16 input, FP32 computation), and up to 1.8x speedup on single precision (SGEMM). Other CUDA libraries are also optimized to deliver out-of-the-box performance on Volta. A detailed performance report on CUDA 9 libraries will be available in the future.

image10.jpg

I appreciate this is not exactly the same as what you were inferring, but my point was beyond that anyway in that it does help FP32 as well (in context of my posts).
Cheers
 
Last edited:
There are quite a few things, that were not depicted in the artistic impressions (i like that phrase) of earlier SMs. Coming from a simple block diagram and trying to extrapolate the inner workings of those chips is like sailing through uncharted territory. IOW, they paint what they want you to believe.

Effictevly, at best they just represent the numbers of units and differenciation ( and again.. sometimes it is even flawed )
 
So if I understand correctly from their public claims:
- 32-wide warps being scheduled on a 16-wide ALU... (G80 and Rogue say hi! :) )
- This allows them to decode 2 instructions in the time the main FP32 pipeline executes 1, so they can run some other instructions "for free" (G80 and Rogue say hi again!)
- Register file *might* be over-specced, it *looks* like it's still 32-wide despite the ALU being 16-wide, which allows these 2 instructions to have a lot less restrictions than they had on G80
- Thanks to the above, you can run FP32 and INT32 in parallel - and maybe FP32 and FP64 in parallel? Or FP32 and Tensor Cores in parallel?

Alternatively maybe the register file isn't 2x the ALU width, and they rely on their "register cache" (and/or extra register banks) to execute multiple pipelines in parallel?

The one thing I'm most surprised by is that they have *full-speed* INT32; presumably full-speed INT32 MULs, not just INT32 ADDs? If so, that's quite expensive... More expensive than the Vec2 INT16/Vec4 INT8 they had on GP102/104. I wonder why? I can't think of any workload that needs it, the only benefit I can think of is simplifying the scheduler a bit. Are they reusing the INT32 units in some clever way - e.g. sharing them with FP64? There are 'interesting' unusual ways you could share some of the INT logic with some other pipelines (rather than just over-speccing the mantissa for FP32 and clock gating it when not doing INT32) but that wouldn't allow full generality of co-issue with all other pipelines which is what they are implying.

Also for their tensor core performance numbers, they are comparing to "FP16 input with FP32 compute" on Pascal; I'm going to guess that's effectively using the FP32 pipeline rather than the FP16 pipeline on Pascal, so 9x isn't quite as mind-blowing (but still impressive); they could have gotten a *lot* of performance simply by supporting the same Vec2 INT16 dot product instruction they had on GP102/GP104 with FP16 instead (since INT16 accumulating to INT32 is good for inference, but not always good enough for training).

I'm also curious about the effective parallelism required to make use of the tensor cores; it's effectively a 4x4x4 matrix multiply, but according to their blog, that's per-thread so across a warp it becomes a 16x16x16 matrix multiply (based on a 32-wide warp I'd expect 16x16x8, not sure at what level the extra 2x happens). That's a *massive* amount of parallelism required for a single instruction, which is fine for convolutional networks, but it sounds like it might not work as well for e.g. recurrent networks in which case you'd want to stick to the CUDA cores? The ideal scenario would be if the scheduler could efficiently use the tensor cores and CUDA cores in parallel for different warps on the same scheduler.

EDIT: Actually if it's 16x16x16, that sounds like they might be running 4x4x4 matrix multiplies sequentially, so the tensor cores might be exposed with descheduling data fences with a long latency to get results back. If so it seems likely that FP32/FP16 CUDA cores and Tensor Cores can work in parallel (but for workloads where you can use the Tensor Cores, it probably makes more sense to only use them, since they should be more power efficient).
 
Last edited:
Each Tensor Core can perform 64 floating-point FMA hybrid precision operations (FP16 multiplication and FP32 accumulation) per clock, and 8 Tensor Core in one SM unit can perform a total of 1024 floating-point operations per clock. Compared to the Pascal GP100 using the standard FP32, the throughput of each depth learning application under a single SM is increased by a factor of eight, resulting in a total increase in the throughput of the Volta V100 GPU compared to the Pascal P100 GPU. Times The Tensor Core operates on the FP16 input data after the FP32 has been combined. The multiplication of FP16 yields a full-precision result, which is accumulated in the product operation of FP32 and other given 4x4x4 matrix multiplication points. As shown in Figure 8.

Figure 8. Volta GV100 Tensor Core flow chart
980eed01c78e4ce7ab82bda3814c2fdd_th.jpg

During program execution, multiple Tensor Cores are used simultaneously with the execution of a set of warp threads. The thread within the warp provides the Tensor Core to handle large 16 × 16 × 16 matrix operations. CUDA exposes these operations as a Warp-Level matrix operation in the CUDA C ++ API. These C ++ interfaces provide specialized matrix loads such as matrix multiplication and accumulation, and matrix storage operations can take advantage of the Tensor Core in CUDA C ++ programs.

The CUDA 9 cuBLAS and cuDNN libraries also include a new library interface for developing deep learning applications and frameworks using Tensor Core, in addition to the Tensor Core directly from the CUDA C ++ interface. NVIDIA has partnered with many popular deep learning frameworks such as Caffe2 and MXNet to use Tensor Core to conduct in-depth research on the Volta architecture's GPU system. NVIDIA will continue to work with other framework developers to learn Tensor Core more extensively throughout the depth of learning ecosystems.
http://it.sohu.com/20170511/n492686208.shtml
https://translate.google.com/translate?hl=en&sl=zh-CN&tl=en&u=http://it.sohu.com/20170511/n492686208.shtml
 
I really hate how NVIDIA always talks about everything "per SM" when only texturing units/local memory/L1 caches/etc are shared across schedulers inside a SM. For most intents and purposes, I feel it makes more sense to think of "units per warp scheduler", i.e. 2 tensor cores per warp scheduler.
 
The artists impression does credit that aspect a bit more this time. Much of it was already present with Pascal and Maxwell IIRC.
 
Last edited:
So if I understand correctly from their public claims:
- 32-wide warps being scheduled on a 16-wide ALU... (G80 and Rogue say hi! :) )
- This allows them to decode 2 instructions in the time the main FP32 pipeline executes 1, so they can run some other instructions "for free" (G80 and Rogue say hi again!)
- Register file *might* be over-specced, it *looks* like it's still 32-wide despite the ALU being 16-wide, which allows these 2 instructions to have a lot less restrictions than they had on G80
- Thanks to the above, you can run FP32 and INT32 in parallel - and maybe FP32 and FP64 in parallel? Or FP32 and Tensor Cores in parallel?

Alternatively maybe the register file isn't 2x the ALU width, and they rely on their "register cache" (and/or extra register banks) to execute multiple pipelines in parallel?

The one thing I'm most surprised by is that they have *full-speed* INT32; presumably full-speed INT32 MULs, not just INT32 ADDs? If so, that's quite expensive... More expensive than the Vec2 INT16/Vec4 INT8 they had on GP102/104. I wonder why? I can't think of any workload that needs it, the only benefit I can think of is simplifying the scheduler a bit. Are they reusing the INT32 units in some clever way - e.g. sharing them with FP64? There are 'interesting' unusual ways you could share some of the INT logic with some other pipelines (rather than just over-speccing the mantissa for FP32 and clock gating it when not doing INT32) but that wouldn't allow full generality of co-issue with all other pipelines which is what they are implying.

Also for their tensor core performance numbers, they are comparing to "FP16 input with FP32 compute" on Pascal; I'm going to guess that's effectively using the FP32 pipeline rather than the FP16 pipeline on Pascal, so 9x isn't quite as mind-blowing (but still impressive); they could have gotten a *lot* of performance simply by supporting the same Vec2 INT16 dot product instruction they had on GP102/GP104 with FP16 instead (since INT16 accumulating to INT32 is good for inference, but not always good enough for training).

I'm also curious about the effective parallelism required to make use of the tensor cores; it's effectively a 4x4x4 matrix multiply, but according to their blog, that's per-thread so across a warp it becomes a 16x16x16 matrix multiply (based on a 32-wide warp I'd expect 16x16x8, not sure at what level the extra 2x happens). That's a *massive* amount of parallelism required for a single instruction, which is fine for convolutional networks, but it sounds like it might not work as well for e.g. recurrent networks in which case you'd want to stick to the CUDA cores? The ideal scenario would be if the scheduler could efficiently use the tensor cores and CUDA cores in parallel for different warps on the same scheduler.

EDIT: Actually if it's 16x16x16, that sounds like they might be running 4x4x4 matrix multiplies sequentially, so the tensor cores might be exposed with descheduling data fences with a long latency to get results back. If so it seems likely that FP32/FP16 CUDA cores and Tensor Cores can work in parallel (but for workloads where you can use the Tensor Cores, it probably makes more sense to only use them, since they should be more power efficient).
Yeah I think it is parallel as well, but comes down to register pressure/BW and more details about the design and its flexibility, which if they tried something similar with P100 would had hit those limitations but I thought this was the next step for the arch in Volta, some results/figures sort of suggest it is parallel but I guess detail about flexibility/limitations is important.
Cheers
 
Last edited:
With the new info, I think I've finally managed to do the math to get to Xavier's 30 tera ops. Basically it's a sum of tensor ops, FP32 ops, and DLA ops. Initially Xavier was 20 TOPs @ 20 watts. Then NVIDIA bumped it to 30 TOPs @ 30 watts. Basically a 50% power bump. So I bumped the clocks by 50% as well. This gets to 29 teraops. Let me know what you think.
 

Attachments

  • Screen Shot 2017-05-12 at 9.49.51 AM.png
    Screen Shot 2017-05-12 at 9.49.51 AM.png
    62.5 KB · Views: 21
Seems there is more data now posted:

I appreciate this is not exactly the same as what you were inferring, but my point was beyond that anyway in that it does help FP32 as well (in context of my posts).
Cheers
It just says Volta (CUDA9) is faster than Pascal (CUDA 8) with FP32 ops too, but it actually doesn't say it's because of the tensor cores. I simply don't see any way you could use it to speed up FP32 when it only accepts FP16 x FP16 + FP16/32
 
Nvidia seems to be committing more strongly to keeping up the SIMT facad, in part by correcting a major reason why Nvidia's threads weren't threads. It's a stronger departure from GCN or x86, which are more explicitly scalar+SIMD. There are some possible hints of a decoupling of the SIMT thread and hardware thread in some of AMD's concepts of lane or wave packing, but nothing clearly planned as Nvidia's imminent product.
Perhaps it's time for debating how much of a thread their "thread" is again?

I remembered this paper: http://hwacha.org/papers/scalarization-cgo2013.pdf. Section 2.4 describes the way current GPUs work (i.e. with software or hardware divergence stacks), while section 5.2 describes "stackless temporal-SIMT". It provisions a hardware PC per thread and allows "diverged threads to truly execute independently". It even includes a syncwarp instruction for compiler-managed thread reconvergence. To me, the properties described in the paper sound basically the same as those being claimed for Volta.

On the other hand, temporal-SIMT doesn't seem to correspond to the following blog post statement: "Note that execution is still SIMT: at any given clock cycle CUDA cores execute the same instruction for all active threads in a warp just as before, retaining the execution efficiency of previous architectures".

(Edit) These papers also seem interesting and potentially relevant. Disclaimer: I haven't had time to read them:

https://www.ece.ubc.ca/~aamodt/papers/eltantawy.hpca2014.pdf
https://www.ece.ubc.ca/~aamodt/papers/eltantawy.micro2016.pdf
 
Last edited:
They still do packed accelerated 2xFP16 math in V100 just like P100 btw.
You get 30TFLOPs FP16 and also the Tensor matrix function unit/cores
(...)

Source for Volta's FP32 units doing 2*FP16 packed math?
Doesn't make much sense to have that functionality if the Tensor units can also do it and would probably have the same throughput per-SM as if the FP32 units could do with packed math.
Even less considering that in Pascal the FP32 will only do 2*FP16 if it's using the same operation, so the Tensor units would probably get better usage.


I appreciate this is not exactly the same as what you were inferring, but my point was beyond that anyway in that it does help FP32 as well (in context of my posts).
I see nothing in that left picture claiming the Tensor units are responsible for the increase in FP32 throughput.
There's a 43% increase in theoretical FP32 throughput, there's higher memory bandwidth, a lot more RF cache and there's an improved CUDA 9 stack. All of that is contributing to better FP32 benchmark results, but not the Tensor cores.

I simply don't see any way you could use it to speed up FP32 when it only accepts FP16 x FP16 + FP16/32
The result of the FP16*FP16 matrix is a FP32 matrix, to which it's added a second FP32, so I guess the tensor cores could be used for FP32 ADD operations..
However, I doubt the Tensor cores are flexible enough to call a FP32 variable from a place that's not the result of the two FP16 multiply.
 
Back
Top