Nvidia Turing Product Reviews and Previews: (Super, TI, 2080, 2070, 2060, 1660, etc)

Ryan Smith · Feb 22, 2019

DavidGraham said:
So it has no Tensor or RT cores, so no RTX or DLSS.

But it does have honest to goodness dedicated FP16 CUDA cores, something that TU102/104/106 don't have.

Deleted member 13524 · Feb 22, 2019

So what exactly is the point of GT116 over GP104+GDDR5 (1070 Ti)?
It's a 10% smaller chip that consumes 10% less power and performs up to 10% worse the same.

Why not just reduce the price of the 1070 Ti instead of making a whole new chip with practically the same power, area and performance characteristics?

Rootax · Feb 22, 2019

Maybe GT116 is cheaper to make for nVidia comparer to 1070 ti ?

manux · Feb 22, 2019

ToTTenTranz said:
So what exactly is the point of GT116 over GP104+GDDR5 (1070 Ti)?
It's a 10% smaller chip that consumes 10% less power and performs up to 10% worse the same.

Why not just reduce the price of the 1070 Ti instead of making a whole new chip with practically the same power, area and performance characteristics?

All the other new turing features besides ray tracing and tensors? Mesh shaders, fp16, better compute perf, coarse shading, vr rendering improvements etc?

Deleted member 13524 · Feb 22, 2019

manux said:
All the other new turing features besides ray tracing and tensors? Mesh shaders, fp16, better compute perf, coarse shading, vr rendering improvements etc?

Which as we see in reviews does not translate into better performance nor new features nor better power efficiency? At best, the 1660Ti performs close to a 1070 Ti in games where those features are used (Far Cry 5 on FP16, Ashes on async compute). On DX11 games like witcher 3 it's over 20% slower.
Why not just "shrink" GP104 to 12FFN?

It can't even be considered a pipe cleaner, since this is the 4th turing chip and the 5th 12FFN chip.

Either this GT116 runs great on laptops or developing new chips had better be super cheap for nvidia.

manux · Feb 22, 2019

ToTTenTranz said:
Which as we see in reviews does not translate into better performance nor new features nor better power efficiency? At best, the 1660Ti performs close to a 1070 Ti in games where those features are used (Far Cry 5 on FP16, Ashes on async compute). On DX11 games like witcher 3 it's over 20% slower.
Why not just "shrink" GP104 to 12FFN?

It can't even be considered a pipe cleaner, since this is the 4th turing chip and the 5th 12FFN chip.

Either this GT116 runs great on laptops or developing new chips had better be super cheap for nvidia.

Fine wine effect once the features get used? Turing seems to be very forward looking architecture.

Deleted member 13524 · Feb 22, 2019

manux said:
Fine wine effect once the features get used? Turing seems to be very forward looking architecture.

Finewine with less VRAM?

A1xLLcqAgt0qc2RyMz0y · Feb 22, 2019

Ryan Smith said:
But it does have honest to goodness dedicated FP16 CUDA cores, something that TU102/104/106 don't have.

With the TU102/104/106 the Tensor Cores are used for FP16 so there is no need to duplicate them as dedicated ones.

The Curious Case of FP16: Tensor Cores vs. Dedicated Cores

Something that escaped my attention with the original TU102 GPU and the RTX 2080 Ti was that for Turing, NVIDIA changed how standard FP16 operations were handled. Rather than processing it through their FP32 CUDA cores, as was the case for GP100 Pascal and GV100 Volta, NVIDIA instead started routing FP16 operations through their tensor cores.

https://www.anandtech.com/show/13973/nvidia-gtx-1660-ti-review-feat-evga-xc-gaming/2

manux · Feb 22, 2019

ToTTenTranz said:
Finewine with less VRAM?

Time will show how things go.

Scott_Arm · Feb 22, 2019

Yah, I'm really not liking this 6GB thing they did with the 2060 and the 1660. To buy a gpu right now, and spend like $400-600 CAD on it, I'm not buying something that's only 6GB. 8GB is probably the minimum I want.

Deleted member 13524 · Feb 22, 2019

A1xLLcqAgt0qc2RyMz0y said:
With the TU102/104/106 the Tensor Cores are used for FP16 so there is no need to duplicate them as dedicated ones.

The Curious Case of FP16: Tensor Cores vs. Dedicated Cores

Dude did you just comment on a quote from @Ryan Smith using a quote from @Ryan Smith ?

Scott_Arm said:
Yah, I'm really not liking this 6GB thing they did with the 2060 and the 1660. To buy a gpu right now, and spend like $400-600 CAD on it, I'm not buying something that's only 6GB. 8GB is probably the minimum I want.

They're doing it because GDDR6 is more expensive and apparently a lot harder to implement on a PCB than GDDR5.
My question is if this will be a good compromise for the end user in the long run, instead of just being good for nVidia.

DavidGraham · Feb 22, 2019

Ryan Smith said:
But it does have honest to goodness dedicated FP16 CUDA cores, something that TU102/104/106 don't have.

I presume the function of dedicated FP16 cores is to double the rate of FP16, right? But big Turing is capable of doing double rate FP16 without those dedicated cores, so what gives?

Malo · Feb 22, 2019

DavidGraham said:
I presume the function of dedicated FP16 cores is to double the rate of FP16, right? But big Turing is capable of doing double rate FP16 without those dedicated cores, so what gives?

Normal Turing uses Tensor cores for double rate FP16.

DavidGraham · Feb 22, 2019

Malo said:
Normal Turing uses Tensor cores for double rate FP16.

I seem to lose myself in this, allow me to explain:

-Vega has Rapid Backed Math, which is essentially running two FP16 ops on a single FP32 ALU. All Turing GPUs can do the same thing too. However, there are two additional caveats:

-Big Turing has Tensor Cores which allow it to run a single FP16 op on a single Tensor Core, while FP32 ALUs do something else.
-Small Turing has dedicated FP16 cores, which allow it to run a single FP16 op on a single FP16 core, while the FP32 ALUs do something else.

Did I get all of these right? Which is the better overall implementation?

A1xLLcqAgt0qc2RyMz0y · Feb 22, 2019

DavidGraham said:
I seem to lose myself in this, allow me to explain:

-Vega has Rapid Backed Math, which is essentially running two FP16 ops on a single FP32 ALU. All Turing GPUs can do the same thing too. However, there are two additional caveats:

-Bug Turing has Tensor Cores which allow it to run a single FP16 op on a single Tensor Core, while FP32 ALUs do something else.
-Small Turing has dedicated FP16 cores, which allow it to run a single FP16 op on a single FP16 core, while the FP32 ALUs do something else.

Did I get all of these right? Which is the better overall implementation?

Vega has Rapid Backed Math, which is essentially running two FP16 ops on a single FP32 ALU. All Turing GPUs can do the same thing too

No I don't believe that Turing GPUs can run two FP16 ops on a single FP32 ALU.

Bug Turing has Tensor Cores

I assume you meant Big not Bug?

A1xLLcqAgt0qc2RyMz0y · Feb 22, 2019

ToTTenTranz said:
Dude did you just comment on a quote from @Ryan Smith using a quote from @Ryan Smith ?

The article's authors are Nate Oh & Ryan Smith and it is unclear who the quote belongs to.

My post was to elaborate that FP16 is done on the Tensor Cores on the Big Turing GPU's and dedicated FP16 cores on the Turing GTX GPU's.

DavidGraham · Feb 22, 2019

A1xLLcqAgt0qc2RyMz0y said:
No I don't believe that Turing GPUs can run two FP16 ops on a single FP32 ALU.

FP16 output is 2X FP32, there must be a connection. Anandtech says the same thing:

Like all other Turing parts, TU116 get NVIDIA’s fast FP16 path. This means that these GPUs can process FP16 operations at twice the rate of FP32 operations,

https://www.anandtech.com/show/13973/nvidia-gtx-1660-ti-review-feat-evga-xc-gaming/2

A1xLLcqAgt0qc2RyMz0y said:
I assume you meant Big not Bug?

Yeah, corrected. Thanks.

Scott_Arm · Feb 22, 2019

Peak fp16 tflops for Turing are double fps32 according to Nvidia's turing whitepaper. That's from the SMs, not tensor cores.

Deleted member 13524 · Feb 22, 2019

I'm guessing the advantage of having dedicated FP16 units instead of having the FP32 ALUs doing RPM is that IIRC Vega and GP100 (and maybe GV100?) only get 2xFP16 throughput if the two FP16 calculations going for that ALU are using the same operation. With dedicated FP16 ALUs they don't have that dependence, so the real-life throughput should be higher.

A1xLLcqAgt0qc2RyMz0y said:
The article's authors are Nate Oh & Ryan Smith and it is unclear who the quote belongs to.

My post was to elaborate that FP16 is done on the Tensor Cores on the Big Turing GPU's and dedicated FP16 cores on the Turing GTX GPU's.

So you didn't assume the co-author of the article you took that information from was aware of it?

Scott_Arm · Feb 22, 2019

Pretty sure the SMs do fp16, but only have fp32 alu. Tensor core fp16 performance is listed as a different metric.

Nvidia Turing Product Reviews and Previews: (Super, TI, 2080, 2070, 2060, 1660, etc)

Ryan Smith

Deleted member 13524

Guest

Rootax

manux

Deleted member 13524

Guest

manux

Deleted member 13524

Guest

A1xLLcqAgt0qc2RyMz0y

manux

Scott_Arm

Deleted member 13524

Guest

DavidGraham

Malo

Yak Mechanicum

DavidGraham

A1xLLcqAgt0qc2RyMz0y

A1xLLcqAgt0qc2RyMz0y

DavidGraham

Scott_Arm

Deleted member 13524

Guest

Scott_Arm