Nvidia Pascal Reviews [1080XP, 1080ti, 1080, 1070ti, 1070, 1060, 1050, and 1030]

No... sm_50 and sm_52 do not have this capability.

If so you would just be able to access half-words and would achieve 64 ops/clock/SMM.



So 2*M*N fp16 vs. 1*M*N fp32 takes 6x more time?

A simulated (necessary for pre-sm_53) half2 FMA takes 11 instructions but you get 2 fp16 FMAs per thread.

Sounds like 6 ops per fp16 to me.

One way to discover more about what's going on under the hood is to not perform FMAs but just MULs.

If F2F conversion is happening then you'll be skipping unpacking/packing of the addend.
Wouldn't that show up in the SASS though? Or are you saying that an emulated HFMA2 would be transparent via the JIT?

If I disassemble the binaries that are generated by nvcc for my code, it's just HFMA2s, and the code doesn't even run on pre sm_53 targets as-is, so I'm not seeing transparent emulation.

I'll check and make sure I'm not building with stripped PTX.
 
Rys' code mentions compute level 5.3 (Tegra X1) and higher.
Maxwell 2 is compute level 5.2.

IOW: Maxwell 2 doesn't have native FP16 at all.

At least that's how understand it.
Yep, that's right.
 
Rys' code mentions compute level 5.3 (Tegra X1) and higher.
Maxwell 2 is compute level 5.2.

IOW: Maxwell 2 doesn't have native FP16 at all.

At least that's how understand it.
Yeah I appreciate that the only Maxwell GPU to support 5.3 compute is Tegra X1.
I am talking more about SGEMMex with its partial support for FP16 (also int8).

I think this was updated to HGEMM with regards to true FP16 and 5.3 compute.
Still wondering where I read the 1:1, maybe confusing it with AMD *Shrug*.

Edit:
So why did NVIDIA bother artifically bother changing this to 1/6th, when one just uses SgemmEX instead (yeah appreciate more limited)?
Rys any chance to test and compare both operations on the 1080?
Cheers
 
Last edited:
Wouldn't that show up in the SASS though? Or are you saying that an emulated HFMA2 would be transparent via the JIT?

You are definitely seeing an HFMA2 op in the SASS.

I was just pointing out that the 1080 throughput you report seems very close to the throughput of an FMA built out of the half/half2 data conversion intrinsics which are available on pre-sm_53 GPUs.

So perhaps it's microcoded?
 
So why did NVIDIA bother artifically bother changing this to 1/6th, when one just uses SgemmEX instead (yeah appreciate more limited)?
You are still assuming that the instruction is intentionally rate limited. As I said before, it's more likely that Nvidia just included 1-2 "new" CUDA cores per SMM, which are handling the new instructions, and the remainder is just "old" (Maxwell?) cores.
Wouldn't that show up in the SASS though? Or are you saying that an emulated HFMA2 would be transparent via the JIT?
It would show up.

But so far your test only reveals bare throughput, not the latency of HFMA2. So you can't tell yet whether the instruction itself is slower, or if there are just fewer cores available for that instruction to be scheduled to.

Can you please run it with only a single wavefront with only 2 threads, and compare that to FP32 throughput? If HFMA2 is now faster, or at least running at the same speed, then we know for sure that the limitation is not a lack of a fast path, but an asymmetric configuration of the CUDA cores.
 
You are still assuming that the instruction is intentionally rate limited. As I said before, it's more likely that Nvidia just included 1-2 "new" CUDA cores per SMM, which are handling the new instructions, and the remainder is just "old" (Maxwell?) cores..
I am not the only one who is assuming it is artificially limited though, but then we are all making assumptions here even if one disagrees that is the issue until further tests.

Cheers
 
Part of this conversation should also consider ScottGray showing what seemed full throughput instruction of Int8 (sm_61+) on the 1080.
Cheers
 
Overall, the generic thumb of rule here for a decent tweak and overclock is that performance can gain anywhere from 5 to 10% performance.

Pretty pathetic overclocks even from the Founder's Edition.
 
With regard to MSI's GTX 1080 offerings I think the card reviewed is their middle grade, or one step above the FE edition.

We found out more information on what sets MSI Gaming Z series graphics cards apart from the Gaming X series. The Gaming X series cards, such as the GeForce GTX 1080 Gaming X, which was part of the company's first wave of GTX 1080 cards, is the middle grade of the Gaming series. The Gaming Z series (such as the GTX 1080 Gaming Z, which is coming soon), is the top-tier variant.

The Gaming X GTX 1080 features double ball bearing fans, a few gamer-specific features though MSI Gaming Control Center, a back-plate, and a significant factory OC. The Gaming Z, on the other hand, gives you all the features of the Gaming X, but higher clock speeds, and a custom RGB LED setup lining the top logo and an MSI "Dragon" logo on the back-plate.
http://www.techpowerup.com/223002/msi-gaming-z-and-gaming-x-differentiated-some-more

The only card that comes to mind with truly pathetic overclocks is not an Nvidia card.
 
Last edited:
It's barely third of what 980Ti gaming could do, so it does look pretty pathetic. I doubt a better cooler is going to make that big of a difference unless MSI suddenly gimped their gaming series and are taking nvidia's FE approach. Maybe Pascal does better with voltage than what Maxwell did.
 
It's barely third of what 980Ti gaming could do, so it does look pretty pathetic. I doubt a better cooler is going to make that big of a difference unless MSI suddenly gimped their gaming series and are taking nvidia's FE approach. Maybe Pascal does better with voltage than what Maxwell did.

The MSI GeForce GTX 1080 GAMING X 8G is already factory overclocked.

GTX 1080 FE clocks:

Base: 1607 MHz
Boost: 1733 MHz
Memory: 5005/10010 MHz

MSI GTX 1080 GAMING X clocks:

Base: 1709 MHz - 6.3% higher than FE
Boost: 1848 MHz - 6.6% higher than FE
Memory: 5005/10010 MHz - same as FE

MSI GTX 1080 GAMING X Overclocked clocks:

Base: 1789 MHz - 11.3% higher than FE
Boost: 1955~2067 MHz - 12.8%-19.3% higher than FE
Memory: 5622/11244 MHz - 12.3% higher than FE

If anything, tweaking and overclocking has become more complicated starting with Pascal. You'll see that most cards out there all will tweak to roughly the same levels due to all kinds of hardware protection kicking in.

We applied the following settings:

  • Temp Target 95 Degrees C
  • CPU clock +80 MHz
  • Mem clock +575 MHz
  • Voltage +100%
  • FAN RPM 60 % (remains really silent with TwiNFrozr VI)
The Boost clock will now render at roughly 1950~2067 MHz depending on the power and temperature signature. The GPU will continuously be dynamically altered on voltage and clock frequency to match the power and temperature targets versus the increased core clock. In FireStrike we are now hovering at the 2 GHz marker on the Boost frequency for example, but some games jumped to roughly 2.1 GHz one second and dipped below 2 GHz the other.

http://www.guru3d.com/articles_pages/msi_geforce_gtx_1080_gaming_x_8g_review,38.html
 
Well look like clock speed are allready too close of the power / temp limits ( even at stock speed anyway ), its not much suprising. Could change with a special bios who allow to play outside thoses limit.
 
What does over clocking nVidia cards actually do? Is there a set limit to the number of boost bins above the set core clock?

If there was no limit I imagine the card would always boost to the same level regardless of the "base" clock.
 
Rys' code mentions compute level 5.3 (Tegra X1) and higher.
Maxwell 2 is compute level 5.2.

IOW: Maxwell 2 doesn't have native FP16 at all.

At least that's how understand it.

I thought maybe I am thinking of Kepler or Tesla so did a bit of checking but only Tesla cards.

Peak workload figures.
The M40 has same figure for both FP32 and FP16; 6,844 Gflops
Checking Tegra K1, had the same figure for both FP32 and FP16; 365 Gflops for both.
Also the Tesla K40 had the same figure for both FP32 and FP16; 5,040Gflops

So was this only applicable to Tesla and say the Tegra?
Context is where I was talking earlier about peak performance and 1:1 ratio and does that mean some older cards may be faster than 1080.
Cheers
 
Well look like clock speed are allready too close of the power / temp limits ( even at stock speed anyway ), its not much suprising. Could change with a special bios who allow to play outside thoses limit.
You are correct with regard to the bios. My Evga 780's come with 2 different bios, and the GTX1080 Classified will have 3 bios.
 
I thought maybe I am thinking of Kepler or Tesla so did a bit of checking but only Tesla cards.

Peak workload figures.
The M40 has same figure for both FP32 and FP16; 6,844 Gflops
Checking Tegra K1, had the same figure for both FP32 and FP16; 365 Gflops for both.
Also the Tesla K40 had the same figure for both FP32 and FP16; 5,040Gflops

So was this only applicable to Tesla and say the Tegra?
Context is where I was talking earlier about peak performance and 1:1 ratio and does that mean some older cards may be faster than 1080.
I tried to find some links starting FP16 numbers for older Tesla GPUs but couldn't find anything. Any links?
 
I tried to find some links starting FP16 numbers for older Tesla GPUs but couldn't find anything. Any links?
I want to double check the couple of sources I had to be 100% sure of the upper Tesla models as maybe that is what had me wrong in the 1st place..

Only one that would be accepted I think for now relates to the Tegra K1 and that is in the Nvidia whitepaper page 13: http://international.download.nvidia.com/pdf/tegra/Tegra-X1-whitepaper-v1.0.pdf
Page 13 shows the figure for the K1 (same fp16 and fp32) and also the X1 (where it is now doubled).

Cheers
 
Total War WARHAMMER DX12: PC graphics performance benchmark review w/GTX 1080/1070
This is an AMD title and the benchmark does enforce mandatory MLAA as MSAA is mysteriously missing. And yes, MLAA is better suited (well optimized) for AMD cards. So from our point of view there is a somewhat unfair performance benefit for AMD, it's the reality anno 2016 I'm afraid.
....
And yeah, Warhammer much to my surprize is INCREDIBLY CPU dependant on core clock frequency, but more importantly .. the number of CPU cores. It is totally not what you expect with DX12 title, as a lesser processor should benefit from all the DX12 goodness.

http://www.guru3d.com/articles_page..._graphics_performance_benchmark_review,1.html
 
Back
Top