Nvidia Pascal Announcement

Ok, let me put it this way: Even if no special FP16 goodness was in the GP104 hardware, it should be able to execute FP16 at the same rate as FP32, like Maxwell and Kepler before it, right?

Kepler and Maxwell can't perform any fp16 computations unless there are some triple-secret non-CUDA instructions that I don't know about.

I'm guessing the source of confusion here is that you can get "free" conversion from fp16 to fp32 if you pull your data through the texture hardware.

Something like the following would load 4 f16 elements via a texture and auto-magically convert them to 32-bit floats:

Code:
tex.1d.v4.f32.f16 {r0,r1,r2,r3}, [tex_a,idx];
 
Last edited:
I thought there was a presentation somewhere where Nvidia gave a rough area cost for creating an FP32 unit that could split into two FP16 ops.
The cost was modest, so it might be worthwhile in a chip that needs to cater to a variety of compute workloads.

One possibility is that the overheads are not that onerous if other SM structures around the FP ALUs create some slack due to items like data paths and neighboring units being sized for supplying the larger DP units and their operands, or the larger register file per ALU ratio.
 
Kepler and Maxwell can't perform any fp16 computations unless there are some triple-secret non-CUDA instructions that I don't know about.

I'm guessing the source of confusion here is that you can get "free" conversion from fp16 to fp32 if you pull your data through the texture hardware.

Something like the following would load 4 f16 elements via a texture and auto-magically convert them to 32-bit floats:

Code:
tex.1d.v4.f32.f16 {r0,r1,r2,r3}, [tex_a,idx];
Like that, for example, yes. Of course with promotion to FP32, i thought that much was clear from the start. Sorry.
 
Do you have a reference as it would help to clarify the situation.
I mentioned in the past that NextPlatform quoted Jonah Alben suggesting it is integral to FP32 Cuda Core when they discussed the multi-precision of Pascal and the P100.


Edit:
Also it seems Tegra X1 only has 256 Cuda cores and the FP16 support is in those - going by the whitepaper.

Cheers
Nothing in the Tegra X1 white paper or in Jonah Alben's comments contradicts what I've said or supports the idea that the same units are used between GP100 and GP10x, nor the idea that GP10x slow FP16 rate is due to throttling.
 
Yes, you said that repeatedly now. Is there any source for it apart from what one dev divined out of assembly code? I mean, I'm not at all against those propositions, but I'm interested in whether or not they have a substantial base.
All I can do is tell the truth. You journalists should be careful to avoid publishing speculation.
 
Let's be fair: Carsten is trying to confirm what you, an anonymous poster on the internet, are claiming to be the truth. That's exactly what a good journalist is supposed to be doing.
Sure. It's all good as long as the skepticism and confirmation seeking is symmetric.

There is no evidence to suggest GP10x has 2x rate FP16 hardware, so I'd like to see some digging at those unfounded claims.
 
Cross-pollination from the Nvidia devtalk forum: provisional Amber benchmarks for 1080 and P100 appliance. Everything is FP32 or fixed point integer.

Ratios between P100 and 1080:
Number of cores: 1.4
Clocks: 1.4/1.7 = ~0.82
core ratio * clock ratio: ~1.14

Depending on the benchmark, the P100/GTX 1080 performance ratio varies between 0.98 to 1.75, with a median somewhere around 1.45.
 
All I can do is tell the truth. You journalists should be careful to avoid publishing speculation.
All i can do is veryfying sources and so far, I see just another posting which I can choose to believe in or not. Much like church, which I opted out of.

Sure. It's all good as long as the skepticism and confirmation seeking is symmetric.

There is no evidence to suggest GP10x has 2x rate FP16 hardware, so I'd like to see some digging at those unfounded claims.
And that's exactly what I wrote in my article about this topic

In den nach wie vor vier Graphics Processing Clustern (GPC) sind nun jeweils fünf statt vier und insgesamt also 20 anstelle von 16 SMs vorhanden - jedes nach wie vor mit 4 x 32 FP32-ALUs, 4 x 8 Spezialfunktionseinheiten, 4 x 8 Load-/Store-Einheiten, 2 x 4 Textureinheiten und sogar vier FP64-ALUs pro SM. Nicht einmal der für den GP100 als wichtige Verbesserung ausgemachte, doppelte FP16-Durchsatz blieb erhalten.
Google translate works reasonably well if not in grammar, but in general gist, for this part of the article:
http://www.pcgameshardware.de/Nvidi...598/Specials/Benchmark-Test-Video-1195464/#a3
(thx for reminding me to look this up, found and corrected a typo.)
 
Last edited:
Nothing in the Tegra X1 white paper or in Jonah Alben's comments contradicts what I've said or supports the idea that the same units are used between GP100 and GP10x, nor the idea that GP10x slow FP16 rate is due to throttling.
Well the Tegra X1 shows there is no special fp16 core and using fp32 cuda cores.

Jonah's comment shows that the GP100 uses the same FP32 Cuda core for FP16 as the only only GP100 GPU is P100.
Cheers
 
Last edited:
Kepler and Maxwell can't perform any fp16 computations unless there are some triple-secret non-CUDA instructions that I don't know about.

I'm guessing the source of confusion here is that you can get "free" conversion from fp16 to fp32 if you pull your data through the texture hardware.

Something like the following would load 4 f16 elements via a texture and auto-magically convert them to 32-bit floats:

Code:
tex.1d.v4.f32.f16 {r0,r1,r2,r3}, [tex_a,idx];
It cannot perform them natively but what is the SGEMMex (Cuda 7.5 brought in more support for fp16) for?
Appreciate it is not free.
Is there any workload benchmark showing both in term of Gflops?
Cheers
 
Last edited:
Sure. It's all good as long as the skepticism and confirmation seeking is symmetric.

There is no evidence to suggest GP10x has 2x rate FP16 hardware, so I'd like to see some digging at those unfounded claims.

So why do the whitepapers for the Tegra X1 and also Tesla Pascal not show FP16 hardware along with Jonah mentioning they tweaked the FP32 cuda cores?
From a SM perspective again the Tesla Pascal is near identical internally to what Nvidia has shown for Maxwell 2 such as Titan X with regards to LD/ST and SFU.

I appreciate the cuda cores may had subtly changed between generations, but this would be an evolution of them becoming more mix-precision, so being handled within the FP32 cuda core.
And with that in mind, I would find it strange they would use different FP32 cuda cores HW design between what was shown on the P100 and using older 'Maxwell' generation on lower Pascal GPUs.
Logistically and from an engineering perspective that is asking for headaches.

Thanks
 
Like... always*? Just dropping the M in the marketing slides as well as in the products, like they did in GTX 980 (which actually way superior to desktop-980's, b/c of 8 GiByte framebuffer. :)

*I mean the fact that those were the same ASICs just differently binned and configured.
 
Jonah's comment shows that the GP100 uses the same FP32 Cuda core for FP16 as the only only GP100 GPU is P100.
Cheers
I don't know the context of the quote but I read it as saying they didn't use the 32 bit ALU. Or he was asked if the feature is present in parts other than GP100 and he was explaining the cost.
 
I don't know the context of the quote but I read it as saying they didn't use the 32 bit ALU. Or he was asked if the feature is present in parts other than GP100 and he was explaining the cost.
The context:
While the 32-bit CUDA cores support 32-bit and half precision 16-bit processing by crunching a pair of 16-bit instructions at the same time (which effectively doubles the floating point operations per second on 16-bit datasets), the 64-bit DP units are not able to chew through two 32-bit or four 16-bit instructions in a single clock.
When we suggested to Jonah Alben, senior vice president of GPU engineering at Nvidia, that it was a shame that these 64-bit units could not be used in such a manner

The response:
he said that the support for FP16 math required tweaks to the FP32 CUDA cores and that the register bandwidth would have been too limited to run FP16 instructions through both sets of elements at the same time.
NextPlatform added in same context, which I think they understand due to their discussion with Jonah Alben.
But it would be cool if it were possible to do this, and perhaps even cooler to be able to create a CUDA core that spoke FP16, FP32, and FP64 at the same time.
So they say it is great FP32 is mixed-precision, but would be cooler if the mixed-precision was also applied to FP64 cuda cores as well.
I would say it is a fair possiblity this is where Volta may be heading.
Cheers
 
Back
Top