Nvidia Pascal Announcement

Ext3h · May 29, 2016

pixelio said:
The code was generated from explicit PTX but CUDA intrinsics would have the same result.

Thanks. __half2 is already the packed format (2x half precision in a 32bit word), contrary to the old __half which is just 16 significant bits in a 32 bit word.

Grall · May 29, 2016

silent_guy said:
Because FP16 doesn't matter for hardly anyone but the deep learning crowd.

I've seen several posts here by sebbbi and whatnot claiming that FP16 would be beneficial to graphics tasks as well. The thing is not that they didn't make FP16 2x as fast as FP32, it's that they made it 64x slower than FP32. What in the name of fuck is that. Seriously.

At least if FP16 had been same rate as FP32, one could have benefited from data storage gains even if there were no performance gains, but this is ridiculous. We can only hope AMD rides to the rescue here.

Ext3h · May 29, 2016

Why would Nvidia possibly do FP16x2 with a dedicated unit? Is that tiny efficiency gain on the pure FP32 core, compared to a mixed prevision FPU, really worth killing FP16 all together?

Btw.: Old FP16 should still work fine, at least I don't see why that wouldn't still be handled by the FP32 cuda cores at truncated precision. Even though it's a really strange decision then to include a FP16x2 unit at all then, if you would achieve a higher throughput just by unpacking and running via the regular SP cores...

Ryan Smith · May 29, 2016

IMO, you guys are overthinking this. FP16 is a valuable feature to the deep learning community, so NVIDIA is relegating it to Tesla.

Grall said:
I've seen several posts here by sebbbi and whatnot claiming that FP16 would be beneficial to graphics tasks as well. The thing is not that they didn't make FP16 2x as fast as FP32, it's that they made it 64x slower than FP32. What in the name of fuck is that. Seriously.

At least if FP16 had been same rate as FP32, one could have benefited from data storage gains even if there were no performance gains, but this is ridiculous. We can only hope AMD rides to the rescue here.

Note that as best as I can currently tell, FP16 performance is only limited for CUDA (and potentially OpenCL) use cases. Sandra's DX Compute Shader test currently reports that FP16 operations run at FP32 speeds, though I don't know a ton about the test or just how much support for native FP16 DXCS really offers.

CarstenS · May 29, 2016

It's because it is not exposed as native in the other APIs, so it's probably running thought FP32-mode there. Weird.

Alessio1989 · May 29, 2016

FP16 are great for tons of operations in graphics where 16FP is precise enough. But if it doesn't improve the rate over FP32, then there is no reason tu support FP16. So, FP16 support in consumer version of Pascal is a joke. Deep learning only/ machine learning only is just an excuse.

silent_guy said:
Hilarious or not, it's a great way to create a very profitable business. What's the alternative? Giving away everything and yelling"We're open!" ?

I am talking about architecture features. I am not talking about performance. Everyone can affirm without any doubt that performance are great.

CarstenS · May 29, 2016

Alessio1989 said:
FP16 are great for tons of operations in graphics where 16FP is precise enough. But if it doesn't improve the rate over FP32, then there is no reason tu support FP16. So, FP16 support in consumer version of Pascal is a joke. Deep learning only/ machine learning only is just an excuse.

I am talking about architecture features. I am not talking about performance. Everyone can affirm without any doubt that performance are great.

So, that leaves only one possible reason: They crippled it in order to make sure, it is not used for games (or by developers cutting in cheap like it was done for cryptocurrencies), where architectures supporting it natively could actually be drawing an advantage out of it. Is that what you're saying?

Ext3h · May 29, 2016

Ryan Smith said:
Sandra's DX Compute Shader test currently reports that FP16 operations run at FP32 speeds, though I don't know a ton about the test or just how much support for native FP16 DXCS really offers.

Because that's most likely the old, non-vectorized operations. Which are essentially just single precision ops, where the result is truncated, and 16bit of the 32bit register are wasted. Can still be used to pack the 16bit floats for compact storage, but can't be utilized to improve the ALU throughput.

constant · May 29, 2016

Honestly, caliming that FP16 throughput is 1/64th crippled is slightly disingenuous as in practice you would still be able to get full FP32 throughput WITH the benefit of truncated storage.

Just do:

//load
__half2float conversion
// Compute as if FP32
__float2half conversion
//store

CSI PC · May 29, 2016

CarstenS said:
So, that leaves only one possible reason: They crippled it in order to make sure, it is not used for games (or by developers cutting in cheap like it was done for cryptocurrencies), where architectures supporting it natively could actually be drawing an advantage out of it. Is that what you're saying?

Also worth remembering NVIDIA also ensured some aspects of Tesla did not make it to the consumer products; sure there were differences in the Dynamic Parallelism between Tesla and consumer going back to Kepler days.
Cheers

Ext3h · May 29, 2016

constant said:
Honestly, caliming that FP16 throughput is 1/64th crippled is slightly disingenuous as in practice you would still be able to get full FP32 throughput WITH the benefit of truncated storage.

Just do:

//load
__half2float conversion
// Compute as if FP32
__float2half conversion
//store

As long as you are using the __half type, you still get full SP throughput as it is. The problem only occurs with the new __half2 type, which is 2x 16bit packed into the low and high word of a 32bit register.

RecessionCone · May 29, 2016

Ryan Smith said:
My belief is software, just like FP64 perf on GTX 780 and the like. A 1/128 instruction rate is not something that can reasonably be done in hardware, since FP16 uses the FP32 CUDA cores and those all need to be identical.

You're mistaken on this. The FP16 instructions in Nvidia's implementation use dedicated FMA units that consume a serious amount of transistors and die area. Just like FP64: the big xx0 chip has full rate units, but all the xxy chips have just enough units to allow software to run.

Deleted member 13524 · May 29, 2016

Ryan Smith said:
IMO, you guys are overthinking this. FP16 is a valuable feature to the deep learning community, so NVIDIA is relegating it to Tesla.

Yet IMG/ PowerVR has had dedicated FP16 units in their mobile GPUs for years, and we've seen developers go on record saying that FP16 is enough for a number of shading effects and using the smaller variables when possible translates into lower bandwidth requirements, higher performance and lower power consumption. IIRC, GCN was initially praised for bringing dedicated instructions for FP16.

So which one is it?

CarstenS · May 29, 2016

RecessionCone said:
You're mistaken on this. The FP16 instructions in Nvidia's implementation use dedicated FMA units that consume a serious amount of transistors and die area. Just like FP64: the big xx0 chip has full rate units, but all the xxy chips have just enough units to allow software to run.

Are you're saying, that GP100 has
- 3.840 FP32 ALUs and
- 1.920 FP64 ALUs and
- what - 7.680 FP16 ALUs

all tied to the same register files and serviced by the same schedulers? Sounds like a massive waste of resources to my laymans' mind.

Maybe rather the FP64 units act as quad-speed FP16-ones? So they can easily adjust architectures - like reducing THEM back to 1/8th speed for consumer products?

Ext3h · May 29, 2016

CarstenS said:
Are you're saying, that GP100 has
- 3.840 FP32 ALUs and
- 1.920 FP64 ALUs and
- what - 7.680 FP16 ALUs

all tied to the same register files and serviced by the same schedulers? Sounds like a massive waste of resources to my laymans' mind.

Maybe rather the FP64 units act as quad-speed FP16-ones? So they can easily adjust architectures - like reducing THEM back to 1/8th speed for consumer products?

Option B: The GP100 has actually all mixed precision ALUs. The GP104 has mostly single mode FPUs, and only one or two full mixed precision cores per SMM?

The fundamental assumption that all CUDA cores are equal is possibly wrong. Might actually be not 128 CUDA cores on the GP104, but more like 127+1, except that this time the 127 are the incomplete ones.

CSI PC · May 29, 2016

CarstenS said:
Are you're saying, that GP100 has
- 3.840 FP32 ALUs and
- 1.920 FP64 ALUs and
- what - 7.680 FP16 ALUs

all tied to the same register files and serviced by the same schedulers? Sounds like a massive waste of resources to my laymans' mind.

Maybe rather the FP64 units act as quad-speed FP16-ones? So they can easily adjust architectures - like reducing THEM back to 1/8th speed for consumer products?

Just to add, why does the Pascal Tesla architecture only show FP64 and FP32 cores if there was dedicated FP16 ALUs.

Also quoting NextPlatform from their Pascal Tesla article - looks like part of this information came from discussions with Jonah Alben (many years involved with GPUs designed at NVIDIA)

While the 32-bit CUDA cores support 32-bit and half precision 16-bit processing by crunching a pair of 16-bit instructions at the same time (which effectively doubles the floating point operations per second on 16-bit datasets), the 64-bit DP units are not able to chew through two 32-bit or four 16-bit instructions in a single clock.
When we suggested to Jonah Alben, senior vice president of GPU engineering at Nvidia, that it was a shame that these 64-bit units could not be used in such a manner, he said that the support for FP16 math required tweaks to the FP32 CUDA cores and that the register bandwidth would have been too limited to run FP16 instructions through both sets of elements at the same time. But it would be cool if it were possible to do this, and perhaps even cooler to be able to create a CUDA core that spoke FP16, FP32, and FP64 at the same time.
That said, the Pascal GP100 GPU is the undisputed leader in both SP and DP performance among compute elements that are available today, so Nvidia knows what it is doing and the cost of a DP operation is now twice that of an SP operation, which is as things should be.

Cheers

Ext3h · May 29, 2016

CSI PC said:
Also quoting NextPlatform from their Pascal Tesla article - looks like part of this information came from discussions with Jonah Alben (many years involved with GPUs designed at NVIDIA)

I don't think that quote actually refers to the CUDA cores used in the GP104.

If my suspicion is correct, then GP104 isn't fully equipped with the "new" cores, but still uses old single precision cores for the most part, and only a couple of the new ones to provide at least formally the same instruction set. Essentially just the same way, as it only includes enough double precision cores to formally fulfill the specifications, but not enough that you could actually make good use of it.

With the GP100, Nvidia apparently didn't have to make such compromise. Every single FP32 core supporting the new vectorized FP16 ops, and enough FP64 cores to exhaust the feeds as well.

In order to verify that, we would need to test whether the throughput is also limited for other instructions newly added with Pascal, at least for these were one could assume that they would require any significant amount of additional die space to implement.

pixelio · May 29, 2016

A few points:

Scalar fp16<>fp32 conversions aren't free. They're 32 ops/clock in CUDA. Note that you can get "free" type conversions on read-only texture loads but that's not what we're discussing.
Type conversions really begin to add up. For example, a simulated fp16x2 FMA operation would require 8 type conversions, two fp32 FMAs and, in my example, one register "merge". That's 11 ops!
A native full-rate fp16x2 FMA would have throughput of 2 FMAs per clock.
I don't recall the source, but I thought I read that an fp16 unit is like 1/3rd the size of an fp32 unit. Presumably fp16x2 would be 2/3rds. That's not "free".
An fp16 is suitable for representing color and performing blends. It's also useful for image processing and saves bandwidth.
But why the excitement over native fp16x2 in a shader/compute kernel? Register pressure! You can squeeze more of them into a register file while doubling your throughput.
An incomplete list of platforms with "wide" fp16x2 support: Intel Gen 8+, PowerVR, Adreno 530, Tegra X1, GP100, ...

Here's a simulated fp16x2 FMA:

Code:

half2 fma_half2(half2 a, half2 b, half2 c)
{
#if __CUDA_ARCH__ >= 530
  return __hfma2(a,b,c);
#else
  return __floats2half2_rn(fmaf( __low2float(a), __low2float(b), __low2float(c)),
                           fmaf(__high2float(a),__high2float(b),__high2float(c)));
#endif
}

Here's the SASS for the simulated hfma2():

Code:

F2F.F32.F16 R3, R0;
F2F.F32.F16 R6, R0.H1;
F2F.F32.F16 R4, R1;
F2F.F32.F16 R7, R1.H1;
F2F.F32.F16 R5, R2;
F2F.F32.F16 R8, R2.H1;
FFMA R3, R3, R4, R5;
FFMA R4, R6, R7, R8;
F2F.F16.F32 R0, R3;
F2F.F16.F32 R5, R4;
XMAD.PSL.CLO R0, R5, 0x1, R0;

While the native hfma2() is just a single instruction:

Code:

HFMA2 R3, R0, R1, R2;

CSI PC · May 29, 2016

Ext3h said:
I don't think that quote actually refers to the CUDA cores used in the GP104.

If my suspicion is correct, then GP104 isn't fully equipped with the "new" cores, but still uses old single precision cores for the most part, and only a couple of the new ones to provide at least formally the same instruction set. Essentially just the same way, as it only includes enough double precision cores to formally fulfill the specifications, but not enough that you could actually make good use of it.

With the GP100, Nvidia apparently didn't have to make such compromise. Every single FP32 core supporting the new vectorized FP16 ops, and enough FP64 cores to exhaust the feeds as well.

In order to verify that, we would need to test whether the throughput is also limited for other instructions newly added with Pascal, at least for these were one could assume that they would require any significant amount of additional die space to implement.

So you do not think they can tweak the fp32 core in the P100 (Tesla in general) back to being 1:1 for FP16 when released for consumers?
They also mention in theory they could do FP16 over FP64 giving 1:4 in one cycle with these Cuda cores, just not doable due to other constraints.
Anyway makes me wonder where this will be going with Volta, which could be a monster in terms of the mix precision capability.

Cheers

Infinisearch · May 29, 2016

ToTTenTranz said:
Yet IMG/ PowerVR has had dedicated FP16 units in their mobile GPUs for years, and we've seen developers go on record saying that FP16 is enough for a number of shading effects and using the smaller variables when possible translates into lower bandwidth requirements, higher performance and lower power consumption. IIRC, GCN was initially praised for bringing dedicated instructions for FP16.

I was under the same impression, that FP16 was good enough precision wise for certain things and there's performance to be had by using it.

Nvidia Pascal Announcement

Ext3h

Grall

Invisible Member

Ext3h

Ryan Smith

Attachments

CarstenS

Moderator

Alessio1989

CarstenS

Moderator

Ext3h

constant

CSI PC

Ext3h

RecessionCone

Deleted member 13524

Guest

CarstenS

Moderator

Ext3h

CSI PC

Ext3h

pixelio

CSI PC

Infinisearch

Similar threads