Nvidia Turing Product Reviews and Previews: (Super, TI, 2080, 2070, 2060, 1660, etc)

No I don't believe that Turing GPUs can run two FP16 ops on a single FP32 ALU.

Ow, pricepoints. That's really what Nvidia should be concentrating on for the next arch, rather than new features. There's so much overlap of function in the silicon here.

But, well, at least it's something in the $2XX price range. So they've got that going for them.
 
Ow, pricepoints. That's really what Nvidia should be concentrating on for the next arch, rather than new features. There's so much overlap of function in the silicon here.

But, well, at least it's something in the $2XX price range. So they've got that going for them.

Agree, price is the only problem with nvidias gpu's. They need competition from AMD and Intel, if they come with products at about the same performance and features for human prices they will have to adjust.
 
FP16 output is 2X FP32, there must be a connection. Anandtech says the same thing:


https://www.anandtech.com/show/13973/nvidia-gtx-1660-ti-review-feat-evga-xc-gaming/2
Thanks for pointing that out. I forgot to edit that after NVIDIA confirmed the dedicated FP16 cores and how they work.

There are numerous good reasons to have the FP16 rate be 2x the FP32 rate, even when using tensor cores. This includes register file bandwidth and pressure, and consistency with Turing parts that don't get tensor cores (since NV has to lay down dedicated FP16 cores on those parts).

Peak fp16 tflops for Turing are double fps32 according to Nvidia's turing whitepaper. That's from the SMs, not tensor cores.
IMO, the whitepaper didn't do a very good job of explaining it. But according to NVIDIA, for TU102/104/106, general (non-tensor) FP16 operations are definitely done on the tensor cores. They are part of the SMs, after all.
 
Big Turing doing fp16 as part of the tensor core is rather intriguing, but makes sense I suppose. It's just a bunch of fp16 multipliers and adders after all. For non-matrix operations you basically only need 1/4 of them, without any complex cross-lane wiring.
In that sense dedicated fp16 cores would really be just the remains of the tensor cores.
I'm wondering though actually what fp16 operations turing can do with twice the rate of single precision, that is, can they do more than mul/add/fma? Obviously for the tensor operations you don't really need anything else, but otherwise things like comparisons would be quite desirable.
 
IMO, the whitepaper didn't do a very good job of explaining it. But according to NVIDIA, for TU102/104/106, general (non-tensor) FP16 operations are definitely done on the tensor cores. They are part of the SMs, after all.
I am still quite lost on this. Let's give an example, Far Cry 5 supports RPM, Vega does it on the ALUs, the 2080Ti does it on the Tensor Cores? If so then how is it able to maintain 2x FP32 rate? Are the tensor cores capable of such feat?
 
The tensor cores in "Big Turing" can do "linear" (non-matrix) FP16 at 1/4th their matrix op rate.
It looks like the dedicated FP16 units in TU116 are stripped down tensor units.

AFAIK Far Cry 5 doesn't support RPM per se, it just uses FP16 pixel shaders. Vega (and GP100/GV100) uses RPM to process FP16 at 2x FP32 rate, Turing does it differently.
 
Yeah, seems like a decent enough card, but nobody wants to buy a 6 GB card in 2019 for $280 regardless of what benchmarks show. Pretty out of touch...
 
Thanks. But If Big Turing uses only Tensor cores for FP16, and the tensor cores do it at quarter of their matrix capability then Turing isn't really capable of 2x FP32.
How so? Big Turings Tensor OPS rate is 8x FP32, doing FP16 on those at quarter of matrix speed would result in 2x FP32
 
Hmmm, so the 1660 Ti is basically similar to a 1070 in performance (sometimes a little faster, sometimes a little slower) with slightly lower power consumption and slightly higher noise levels? Oh and 2 less GB of memory (6 GB vs. 8 GB).

Not bad, although you can still occasionally find 1070's at 299 USD (one on Newegg right now) which may or may not be a better deal. Of course, eventually those will all disappear leaving just the 1660 Ti's.

Regards,
SB
 
How so? Big Turings Tensor OPS rate is 8x FP32, doing FP16 on those at quarter of matrix speed would result in 2x FP32
It seem I somehow missed that fact. Though this has the implication of limiting DLSS performance in games that heavily utilize FP16 shaders.
 
Last edited:
No, Tensor operation will always run alone. DLSS is post processing AA which runs after the creation of the frame.
 
It seem I somehow missed that fact. Though this has the implication of limiting DLSS performance in games that heavily utilize FP16 shaders.
I believe this would be correct
No, Tensor operation will always run alone. DLSS is post processing AA which runs after the creation of the frame.
Not sure how that changes anything, the time spent on DLSS as post processing the tensor cores could be already crunching FP16 shaders for next frame - it all depends on the loads
 
I believe this would be correct

Not sure how that changes anything, the time spent on DLSS as post processing the tensor cores could be already crunching FP16 shaders for next frame - it all depends on the loads
Didn''t Jensen implicate that rest of the GPU would idle when tensor cores are active?
 
It is for using the Tensor Cores with tensor operations:
aHR0cDovL21lZGlhLmJlc3RvZm1pY3JvLmNvbS83L08vNzk3MzE2L29yaWdpbmFsL3J0eC1vcHMuanBn

https://www.tomshardware.com/reviews/nvidia-turing-gpu-architecture-explored,5801-10.html
 
Back
Top