Nvidia Turing Speculation thread [2018]

Status
Not open for further replies.
You mean Tensor-cores? That's actually highly likely option (along with cutting DP-speed as usual) - was the most likely option until they revealed RTX which seems to utilize Tensor-cores
And part of the discussion one also needs to consider RTX-Optix-Tensor cores from a Quadro perspective; very difficult to cut the overlap between Geforce/Quadro/Tesla until one drops below the Gx104 range.
And then like you say there is the Gameworks integration work including the Tensor cores; a lot of effort and resources if they did not intend to keep this for at least the high performance segment (Gx104 and higher) unless this is just more of a tech introduction until evolution after what comes out this year in terms of gaming dGPUs.
Weighing everything up to me seems quite possible it would be included in some form but one aspect to consider is the relationship between Tensor cores/SM/GPC-TPC and there may be a technical deciding factor also on limitations on how reduced the architecture can be for it to be worthwhile.
Context current Nvidia architecture designs and model differentiations so why it could be limited to Gx104 or maybe only even higher for now, along with price consideration for such functionality to be included but then we are back to the consideration of cost-logistics-R&D splitting Geforce from Quadro (will need Tensor cores for models) and Tesla.
 
And part of the discussion one also needs to consider RTX-Optix-Tensor cores from a Quadro perspective; very difficult to cut the overlap between Geforce/Quadro/Tesla until one drops below the Gx104 range.
And then like you say there is the Gameworks integration work including the Tensor cores; a lot of effort and resources if they did not intend to keep this for at least the high performance segment (Gx104 and higher) unless this is just more of a tech introduction until evolution after what comes out this year in terms of gaming dGPUs.
Weighing everything up to me seems quite possible it would be included in some form but one aspect to consider is the relationship between Tensor cores/SM/GPC-TPC and there may be a technical deciding factor also on limitations on how reduced the architecture can be for it to be worthwhile.
Context current Nvidia architecture designs and model differentiations so why it could be limited to Gx104 or maybe only even higher for now, along with price consideration for such functionality to be included but then we are back to the consideration of cost-logistics-R&D splitting Geforce from Quadro (will need Tensor cores for models) and Tesla.

If there were tensor cores in the G(T(V+))104/102 GPUs, I would expect there to be not many of them, similar to the low presence of double precision.
So more for compute feature compatibility / software development purpose.
 
If there are going to be similarities between Next-Gen-Mainstream Geforces and the current gen, then there is the possibility to have then Tensors stripped down to 1/4th rate pretty easily (1/2 per SM, twice as many HPC-SMs forming a Mainstream-SM). Pure checkbox rates as with double precision would not suffice IMHO if Nvidia was serious about pushing DXR or their own subset. 1/4th rate would probably be good enough especially if the Next-Gen-Mainstream Geforces also inherit the Volta scheduler.

Now, of course it is possible for Nvidia to design variants of the tensors, maybe ones with higher latency that do not occupy as much space. But somehow I do not think this is going to happen. For cost reasons mainly. Remember these are first-gen Tensor cores that probably will evolve yet - contrary to the very much set-in-stone-type DP units, where they could be sure to be able to re-use the small DP-Unit for a while.

I do not think they will mix SM-types on Next-Gen-Mainstream Geforces, as in some have tensors, most not. Even though that would theoretically also be possible.
 
If there were tensor cores in the G(T(V+))104/102 GPUs, I would expect there to be not many of them, similar to the low presence of double precision.
So more for compute feature compatibility / software development purpose.
You would have less anyway due to the relationship between Tensor Cores with SM-TPC/register related/cache/etc; if Nvidia decided to go with it down to Gx104; Gx102 would be an interesting model as it has same number of GPC but structured subtly different to the Gx100 beyond that, however Nvidia position Gx102 and Gx104 as their powerhouse efficient inferencing cards.
Yeah Double Precision would be near enough non-existent just like in the past for non DP specific models; look at Pascal with P100 and the rest of the range.
 
Last edited:
Now, of course it is possible for Nvidia to design variants of the tensors, maybe ones with higher latency that do not occupy as much space. But somehow I do not think this is going to happen. For cost reasons mainly.

They already have 2 types of tensor cores. V100 uses one type, xavier uses the second type. Xaviers tensor cores are much smaller. Xavier also has 2x as many tensor cores as V100 per shader. Volta is nvidias AI training solution, but it's not for inference. At the moment the inference solutions from Nvidia are GP104 and GP102 using Dp4a Int8. Turing will be Nvidias Inference Chip using Tensor Cores and it will have many of them.
 
Last edited:
They already have 2 types of tensor cores. V100 uses one type, xavier uses the second type. Xaviers tensor cores are much smaller. Xavier also has 2x as many tensor cores as V100 per shader. Volta is nvidias AI training solution, but it's not for inference. At the moment the inference solutions from Nvidia are GP104 and GP102 using Dp4a Int8. Turing will be Nvidias Inference Chip using Tensor Cores and it will have many of them.

Xavier has two ways of inferencing, one using tensor cores 20 TOP/s (8bit) and another way using the DLA 5 TOPS/s (FP16) or 10 TOPS (8bit).
The DLA is outside of the GPU shaders, so can be used without impacting GPU performance. The DLA die footprint is also relative small compared to tensor cores. The DLA is fixed function, maybe this is the reason still a lot of tensor cores are included in Xavier, this as Neural Networks al still evolving a lot and DLA might be not flexible enough to support new types of NNs.
If the DLA in Turing would be flexible enought to support most NNs, I don't see much need for a lot of tensor cores.
As said the tensor cores have the drawback of reducing shading/texturing/rasterizing performance.
Edit: apparently Xavier tensor cores are 20 TOP/s 8bit
NVIDIA-Xavier-Chip-Shot-Large.jpg
 
Last edited:
I'm not talking about the dla, that's a different option. I'm talking about the Xavier Int8 Tensor Cores, unlike the FP16/32 TCs from V100:

008_o.jpg


Smaller, lower precision tensor cores, as we will see in Turing.
 
I'm not talking about the dla, that's a different option. I'm talking about the Xavier Int8 Tensor Cores, unlike the FP16/32 TCs from V100:

008_o.jpg


Smaller, lower precision tensor cores, as we will see in Turing.

So why there is both DLA and tensor cores in Xavier ?
 
I'm not talking about the dla, that's a different option. I'm talking about the Xavier Int8 Tensor Cores, unlike the FP16/32 TCs from V100:

008_o.jpg


Smaller, lower precision tensor cores, as we will see in Turing.
Are those by chance beefed up GP104-style "tensor cores"? And can you even call Tensor Cores with only INT-capability (dp2a/dp4a)?
 
So why there is both DLA and tensor cores in Xavier ?

Because they wanted 30 TOPs and the gpu would be too big if they added more SMs. The better question is, why did they include tensor cores in the gpu if they already have a DLA? Maybe the DLA is way too unfelxible, so they wanted to have TCs in the gpu. As Turing will be used for development in the inference space, it'll have tcs also and not only a dla.

Are those by chance beefed up GP104-style "tensor cores"? And can you even call Tensor Cores with only INT-capability (dp2a/dp4a)?

What should GP104-style tensor cores be? Pascal has no tensor cores. 4xInt8 rate vs FP32 on Pascal is not a tensor core and it's much slower then what xavier has. Xavier has 16x Int8 rate vs FP32 with it's tensor cores.
And why should a lower precision tensor core not be called a Tensor Core? Tensor processing units are existing in different accuracies.
 
NVDLA in Xavier is a proof of concept to implement a bigger version of the DLA. But dont ask me where i have read it...

/edit: Bill Dally talks about Xavier and why nVidia uses DLA:
 
Last edited:
NVDLA in Xavier is a proof of concept to implement a bigger version of the DLA. But dont ask me where i have read it...

/edit: Bill Dally talks about Xavier and why nVidia uses DLA:

Interesting, it says DLA is only 27% better in energy efficiency compared to 'tensor core'.
Including the sparse matrix (compression/decompression, zero skipping) and zero activation tricks in DLA hardware, I would expect DLA efficiency to be much higher then 27%.
 
What should GP104-style tensor cores be? Pascal has no tensor cores. 4xInt8 rate vs FP32 on Pascal is not a tensor core and it's much slower then what xavier has. Xavier has 16x Int8 rate vs FP32 with it's tensor cores.
And why should a lower precision tensor core not be called a Tensor Core? Tensor processing units are existing in different accuracies.
So, going from 4x rate to 16x rate transforms "4xINT8" into a Tensor Core? Because that's what you're essentially saying. Is 8x rate also enough to be called a Tensor Core? Note though, that in the slide you showed and I quoted, Nvidia seems not to use the word "tensor" but rather "INT8 TOPS DL" while in the annotated artist's impression of a die shot it is used.
 
So, going from 4x rate to 16x rate transforms "4xINT8" into a Tensor Core? Because that's what you're essentially saying. Is 8x rate also enough to be called a Tensor Core? Note though, that in the slide you showed and I quoted, Nvidia seems not to use the word "tensor" but rather "INT8 TOPS DL" while in the annotated artist's impression of a die shot it is used.

The speedup is a effect of the tensor core. 4x Int8 is a special function, but you can't get so easy to a 16x speedup from that. Therefore nvidia has build tensor cores to get higher TOPs. But you could build less tensor cores and have less speedup. The speedup doesn't matter, whether it's a tensor core or not. I don't get your point, so you believe that it's not tensor cores because on 1 slide they didn't write the word tensor core?
 
Wondering if we'll get "12" nm FF this year and 7nm next year. If so, Ampere/Turning will be a bit pointless unless there's a major architectural change between Pascal and Ampere and we end up getting something similar to Kepler > Maxwell with the Pascal equivalent following up next year with the new process.
 
So, going from 4x rate to 16x rate transforms "4xINT8" into a Tensor Core? Because that's what you're essentially saying. Is 8x rate also enough to be called a Tensor Core? Note though, that in the slide you showed and I quoted, Nvidia seems not to use the word "tensor" but rather "INT8 TOPS DL" while in the annotated artist's impression of a die shot it is used.

These NN computations are really relatively mundaine ALU computations. NV marketing turns that into 'Tensor Core', which sound that more innovative, unique, convicing and impressive.
At 8 bit you can have a 4x4 matrix (activations after ie RELU activation function) you multiply by a another 4x4 matrix (weights), potentially you add the result to another 4x4 matrix.
That you do all in 1 clock cycle (using systolic array, pipelining etc), 64 Multiply-Acummulate per clock.
 
Wondering if we'll get "12" nm FF this year and 7nm next year. If so, Ampere/Turning will be a bit pointless unless there's a major architectural change between Pascal and Ampere and we end up getting something similar to Kepler > Maxwell with the Pascal equivalent following up next year with the new process.

Volta is already a big architectural change to pascal. Turing might be even more, but 7nm will probably be just a fast shrink architecture wise like pascal was. Anyway i wouldn't expect nvidias 7nm before end 19, so Turing will have enough time. I could imagine Turing to last till early 2020 and skip 7nm DUV. EUV process sounds like a much better alternative.
 
The speedup is a effect of the tensor core. 4x Int8 is a special function, but you can't get so easy to a 16x speedup from that. Therefore nvidia has build tensor cores to get higher TOPs. But you could build less tensor cores and have less speedup. The speedup doesn't matter, whether it's a tensor core or not. I don't get your point, so you believe that it's not tensor cores because on 1 slide they didn't write the word tensor core?
My point is: What will a tensor core for the next generation look like?

I am wondering because 4x rate for "DL-Ops" i.e. in GP104 apparently is purely integer-based (l). The GV100 Tensor cores are using mixed floating point precisions (2). WRT Xavier, They explicitly say INT8 TOPs at one point, while at another they talk about mixed precision for the Volta-Cores (+Tensors?).
 
GDDR6 Overclocking Potential Revealed – Micron States Up To 20 Gb/s Possible With Small Voltage Bump on 16 Gb/s Modules
One detail that caught my eye was the Micron has already determined in performance measuring that their GDDR6 memory could extend beyond the 16.5 Gb/s range. The result proved that with a slight but helpful bump in I/O voltage, the memory chips can push speeds as high as 20 Gb/s which is a significant jump from the JEDEC defined 14 Gb/s target.
....
So just what kind of performance we might expect from a 20 Gb/s clock bump, well to put things into perspective, a 256-bit card with such speeds would able to deliver 640 Gb/s bandwidth which is close to Titan V’s 652.8 Gb/s (HBM2). A 384-bit card would almost hit the 1 Tb/s barrier with an approximate bandwidth of 960 GB/s surpassing NVIDIA’s Tesla V100 solution.
https://wccftech.com/micron-gddr6-memory-20-gbps-speeds/
 
My point is: What will a tensor core for the next generation look like?

I am wondering because 4x rate for "DL-Ops" i.e. in GP104 apparently is purely integer-based (l). The GV100 Tensor cores are using mixed floating point precisions (2).

For Turing? Like Xavier Int8 because Int8 is enough for inference. You don't need high accuracy for inference, Ampere as next gen training architecture will get higher precision cores as Volta is using. These are 2 different workloads with 2 different requirements.

WRT Xavier, They explicitly say INT8 TOPs at one point, while at another they talk about mixed precision for the Volta-Cores (+Tensors?).

I can't think of any talk, where they published so many details about Xavier as in the GTC Automotive talk where the slide i posted is from, so i'm sure they just have Int8 Tensor Cores. It's just the in detail term vs marketing term "mixed precision". 2nd, everything else makes no sense for an inference platform. It would just be a waste of space and nvidia is very careful in just adding the stuff to their chips, which is needed. Inference is low precision computing, some people even try inference with lower precision than Int8, so we might even see such stuff in the gen after.
 
Status
Not open for further replies.
Back
Top