Nvidia Pascal Announcement

With a production code of 1542 older than (production-samples of) GP104, which are from beginning of April 2016.
Good catch.
Wonder how many GP106 dies has so far been manufacturered and also set for GTX 1060; looks like Nvidia is further along with the smaller die than many publications suggest.
Cheers
 
Yes, and it happened at the same rate as FP32. Now it's slower. The question is: Why not leave it as it is.
It's implemented with dedicated units now. Just like FP64.

GP100 has lots of FP16 units (for deep learning training). GP10x does not.

Conversely, GP10x has lots of Int8 dot product units (for deep learning inference). GP100 does not.

The conspiracy theories about throttling are wrong. The chips are just different, with a different balance of functional unit.
 
It's implemented with dedicated units now. Just like FP64.

GP100 has lots of FP16 units (for deep learning training). GP10x does not.

Conversely, GP10x has lots of Int8 dot product units (for deep learning inference). GP100 does not.

The conspiracy theories about throttling are wrong. The chips are just different, with a different balance of functional unit.

Do you have a reference as it would help to clarify the situation.
I mentioned in the past that NextPlatform quoted Jonah Alben suggesting it is integral to FP32 Cuda Core when they discussed the multi-precision of Pascal and the P100.
When we suggested to Jonah Alben, senior vice president of GPU engineering at Nvidia, that it was a shame that these 64-bit units could not be used in such a manner, he said that the support for FP16 math required tweaks to the FP32 CUDA cores and that the register bandwidth would have been too limited to run FP16 instructions through both sets of elements at the same time.

Edit:
Also it seems Tegra X1 only has 256 Cuda cores and the FP16 support is in those - going by the whitepaper.

Cheers
 
Last edited:
Do you have a reference for that?
I mentioned in the past that NextPlatform quoted Jonah Alben suggesting it is integral to FP32 Cuda Core when they discussed the multi-precision of Pascal and the P100.

That seems to indicate tweaks were made to the FP32 ALU implementation in the GP100.
It isn't strictly necessary that one implementation of FP32 for one model is the same as the other, just as some GCN designs that are baseline FP32 have varying FP64 ratios baked in.

I am speculating a lot on this, but it does seem like Nvidia prefers to differentiate units more readily than AMD. The differentiated units skimp on some of the overhead built into hardware that iterates repeatedly for different precisions like GCN, and if there's less variability in the execution loop of the ALUs, they might run "tighter" than a base architecture whose critical path that lets various disparate operations sneak in on the same route.
 
That seems to indicate tweaks were made to the FP32 ALU implementation in the GP100.
It isn't strictly necessary that one implementation of FP32 for one model is the same as the other, just as some GCN designs that are baseline FP32 have varying FP64 ratios baked in.

I am speculating a lot on this, but it does seem like Nvidia prefers to differentiate units more readily than AMD. The differentiated units skimp on some of the overhead built into hardware that iterates repeatedly for different precisions like GCN, and if there's less variability in the execution loop of the ALUs, they might run "tighter" than a base architecture whose critical path that lets various disparate operations sneak in on the same route.
Err I was responding to a post that specifically mentions GP100 :)

And yeah going by historical trends with NVIDIA we may only see the full implementation across the same die used for Titan-top Quadro-Tesla P102; similar to days of Kepler GK110.
Still I would expect same cuda cores just artificially tweaked (IMO), same way Kepler Tesla had the full Dynamic Parallelism and the consumer cards did not.
Cheers
 
Last edited:
It's implemented with dedicated units now. Just like FP64.
GP100 has lots of FP16 units (for deep learning training). GP10x does not.
Conversely, GP10x has lots of Int8 dot product units (for deep learning inference). GP100 does not.
The conspiracy theories about throttling are wrong. The chips are just different, with a different balance of functional unit.

Yes, you said that repeatedly now. Is there any source for it apart from what one dev divined out of assembly code? I mean, I'm not at all against those propositions, but I'm interested in whether or not they have a substantial base.
 
Is there any source for it apart from what one dev divined out of assembly code? I mean, I'm not at all against those propositions, but I'm interested in whether or not they have a substantial base.
It's unlikely GP104 has the full set of fp16x2 ALUs. There is no strong base for concluding the existence of artifically disabled fp16x2 units. We don't know the details of what is there, and some evidence we assembled is contradictory, but Occam's razor says the simplest explaination of fp16x2 being a hardware feature only on P100 (and Tegra) is likely correct. We CUDA guys are especially hopeful, but it's just hope, not actual expectation anymore.
 
It's unlikely GP104 has the full set of fp16x2 ALUs. There is no strong base for concluding the existence of artifically disabled fp16x2 units. We don't know the details of what is there, and some evidence we assembled is contradictory, but Occam's razor says the simplest explaination of fp16x2 being a hardware feature only on P100 (and Tegra) is likely correct.
The strongest circumstantial evidence is probably the fact that the 8 bit operations are not restricted (has anyone actually verified on real hardware?), even though that would be the best candidate for it since it has no other reason to exist.
 
It's implemented with dedicated units now. Just like FP64.

GP100 has lots of FP16 units (for deep learning training). GP10x does not.

Conversely, GP10x has lots of Int8 dot product units (for deep learning inference). GP100 does not.

The conspiracy theories about throttling are wrong. The chips are just different, with a different balance of functional unit.

I dont say you are wrong, but that start to make a lot of units, FP32 units, half in addittion for do FP64 ( at 1/2 rate ), and 2x the numbers of FP32 for doing FP16 ( at 2:1 FP32 rate ) and let alone all the transistors attached to them + redundancy ( because it is not really the "ALU" who cost the most )..

Thats seems to me to be a little bit "brute" force.. There's certainly other way to do it ( closer to what is doing AMD with FP64 ( at 1:2 on Hawai )
 
Maybe that's why its a 610mm2 16nm die?

As i have said, i dont say it is wrong, just it seems to me a bit unlogic on an engineering aspect. Outside when mixed precision will be used, ( and that will not be the case for traditional work of Tesla, quadro case ), look like a good waste of die space and transistors and so production cost.

This could explain the "GP102 sku name" ,coming without thoses additional capacity, and so the "units", who will not only be disabled, but completely missing.
( so this time, this will be a first to have 2 complete separate " sku" who differ really on many aspect between compute gpu's and the high end gpu's, including die size, transistors count and what not else )
GP102 could end with a smaller die, with less transistors.
 
Last edited:
It's unlikely GP104 has the full set of fp16x2 ALUs. There is no strong base for concluding the existence of artifically disabled fp16x2 units. We don't know the details of what is there, and some evidence we assembled is contradictory, but Occam's razor says the simplest explaination of fp16x2 being a hardware feature only on P100 (and Tegra) is likely correct. We CUDA guys are especially hopeful, but it's just hope, not actual expectation anymore.

Ok, let me put it this way: Even if no special FP16 goodness was in the GP104 hardware, it should be able to execute FP16 at the same rate as FP32, like Maxwell and Kepler before it, right?
 
Back
Top