Nvidia Turing Product Reviews and Previews: (Super, TI, 2080, 2070, 2060, 1660, etc)

Frenetic Pony · Feb 22, 2019

A1xLLcqAgt0qc2RyMz0y said:
No I don't believe that Turing GPUs can run two FP16 ops on a single FP32 ALU.

Ow, pricepoints. That's really what Nvidia should be concentrating on for the next arch, rather than new features. There's so much overlap of function in the silicon here.

But, well, at least it's something in the $2XX price range. So they've got that going for them.

vipa899 · Feb 22, 2019

Frenetic Pony said:
Ow, pricepoints. That's really what Nvidia should be concentrating on for the next arch, rather than new features. There's so much overlap of function in the silicon here.

But, well, at least it's something in the $2XX price range. So they've got that going for them.

Agree, price is the only problem with nvidias gpu's. They need competition from AMD and Intel, if they come with products at about the same performance and features for human prices they will have to adjust.

Ryan Smith · Feb 23, 2019

DavidGraham said:
FP16 output is 2X FP32, there must be a connection. Anandtech says the same thing:

https://www.anandtech.com/show/13973/nvidia-gtx-1660-ti-review-feat-evga-xc-gaming/2

Thanks for pointing that out. I forgot to edit that after NVIDIA confirmed the dedicated FP16 cores and how they work.

There are numerous good reasons to have the FP16 rate be 2x the FP32 rate, even when using tensor cores. This includes register file bandwidth and pressure, and consistency with Turing parts that don't get tensor cores (since NV has to lay down dedicated FP16 cores on those parts).

Scott_Arm said:
Peak fp16 tflops for Turing are double fps32 according to Nvidia's turing whitepaper. That's from the SMs, not tensor cores.

IMO, the whitepaper didn't do a very good job of explaining it. But according to NVIDIA, for TU102/104/106, general (non-tensor) FP16 operations are definitely done on the tensor cores. They are part of the SMs, after all.

mczak · Feb 23, 2019

Big Turing doing fp16 as part of the tensor core is rather intriguing, but makes sense I suppose. It's just a bunch of fp16 multipliers and adders after all. For non-matrix operations you basically only need 1/4 of them, without any complex cross-lane wiring.
In that sense dedicated fp16 cores would really be just the remains of the tensor cores.
I'm wondering though actually what fp16 operations turing can do with twice the rate of single precision, that is, can they do more than mul/add/fma? Obviously for the tensor operations you don't really need anything else, but otherwise things like comparisons would be quite desirable.

DavidGraham · Feb 23, 2019

Ryan Smith said:
IMO, the whitepaper didn't do a very good job of explaining it. But according to NVIDIA, for TU102/104/106, general (non-tensor) FP16 operations are definitely done on the tensor cores. They are part of the SMs, after all.

I am still quite lost on this. Let's give an example, Far Cry 5 supports RPM, Vega does it on the ALUs, the 2080Ti does it on the Tensor Cores? If so then how is it able to maintain 2x FP32 rate? Are the tensor cores capable of such feat?

Deleted member 13524 · Feb 23, 2019

The tensor cores in "Big Turing" can do "linear" (non-matrix) FP16 at 1/4th their matrix op rate.
It looks like the dedicated FP16 units in TU116 are stripped down tensor units.

AFAIK Far Cry 5 doesn't support RPM per se, it just uses FP16 pixel shaders. Vega (and GP100/GV100) uses RPM to process FP16 at 2x FP32 rate, Turing does it differently.

DavidGraham · Feb 23, 2019

ToTTenTranz said:
The tensor cores in "Big Turing" can do "linear" (non-matrix) FP16 at 1/4th their matrix op rate.

Thanks. But If Big Turing uses only Tensor cores for FP16, and the tensor cores do it at quarter of their matrix capability then Turing isn't really capable of 2x FP32.

entity279 · Feb 23, 2019

DavidGraham said:
Thanks. But If Big Turing uses only Tensor cores for FP16, and the tensor cores do it at quarter of their matrix capability then Turing isn't really capable of 2x FP32.

Depends of just how many tensor cores there are, right ?

ninelven · Feb 23, 2019

Yeah, seems like a decent enough card, but nobody wants to buy a 6 GB card in 2019 for $280 regardless of what benchmarks show. Pretty out of touch...

DavidGraham · Feb 23, 2019

entity279 said:
Depends of just how many tensor cores there are, right ?

Precisely my point.

Kaotik · Feb 23, 2019

DavidGraham said:
Thanks. But If Big Turing uses only Tensor cores for FP16, and the tensor cores do it at quarter of their matrix capability then Turing isn't really capable of 2x FP32.

How so? Big Turings Tensor OPS rate is 8x FP32, doing FP16 on those at quarter of matrix speed would result in 2x FP32

Silent_Buddha · Feb 23, 2019

Hmmm, so the 1660 Ti is basically similar to a 1070 in performance (sometimes a little faster, sometimes a little slower) with slightly lower power consumption and slightly higher noise levels? Oh and 2 less GB of memory (6 GB vs. 8 GB).

Not bad, although you can still occasionally find 1070's at 299 USD (one on Newegg right now) which may or may not be a better deal. Of course, eventually those will all disappear leaving just the 1660 Ti's.

Regards,
SB

DavidGraham · Feb 23, 2019

Kaotik said:
How so? Big Turings Tensor OPS rate is 8x FP32, doing FP16 on those at quarter of matrix speed would result in 2x FP32

It seem I somehow missed that fact. Though this has the implication of limiting DLSS performance in games that heavily utilize FP16 shaders.

troyan · Feb 23, 2019

No, Tensor operation will always run alone. DLSS is post processing AA which runs after the creation of the frame.

Kaotik · Feb 23, 2019

DavidGraham said:
It seem I somehow missed that fact. Though this has the implication of limiting DLSS performance in games that heavily utilize FP16 shaders.

I believe this would be correct

troyan said:
No, Tensor operation will always run alone. DLSS is post processing AA which runs after the creation of the frame.

Not sure how that changes anything, the time spent on DLSS as post processing the tensor cores could be already crunching FP16 shaders for next frame - it all depends on the loads

troyan · Feb 23, 2019

Future workload would be overlapping with the current frame creation.

jlippo · Feb 24, 2019

Kaotik said:
I believe this would be correct

Not sure how that changes anything, the time spent on DLSS as post processing the tensor cores could be already crunching FP16 shaders for next frame - it all depends on the loads

Didn''t Jensen implicate that rest of the GPU would idle when tensor cores are active?

Kaotik · Feb 24, 2019

jlippo said:
Didn''t Jensen implicate that rest of the GPU would idle when tensor cores are active?

If my memory serves me correctly, this only applies to DXR denoising, not tensor cores in general?

troyan · Feb 24, 2019

It is for using the Tensor Cores with tensor operations:

aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L08vNzk3MzE2L29yaWdpbmFsL3J0eC1vcHMuanBn

https://www.tomshardware.com/reviews/nvidia-turing-gpu-architecture-explored,5801-10.html

DavidGraham · Feb 24, 2019

https://twitter.com/x/status/1099379855118995458

Nvidia Turing Product Reviews and Previews: (Super, TI, 2080, 2070, 2060, 1660, etc)

Frenetic Pony

vipa899

Ryan Smith

mczak

DavidGraham

Deleted member 13524

Guest

DavidGraham

entity279

ninelven

PM

DavidGraham

Kaotik

Drunk Member

Silent_Buddha

DavidGraham

troyan

Kaotik

Drunk Member

troyan

jlippo

Kaotik

Drunk Member

troyan

DavidGraham