Nvidia Pascal Announcement

Kaotik · May 31, 2016

Razor1 said:
http://videocardz.com/60666/inno3d-teases-geforce-gtx-1060

1060 coming soon?

More interesting is the fact that there's nothing below it, despite the fact that Aida64 developers already confirmed that there's GP107 and GP108 too

Razor1 · May 31, 2016

they usually come out later though.

Razor1 · May 31, 2016

http://videocardz.com/60604/nvidia-shows-off-new-tegra-based-on-pascal-at-computex

Tegra based Pascal

CarstenS · May 31, 2016

With a production code of 1542 older than (production-samples of) GP104, which are from beginning of April 2016.

CSI PC · May 31, 2016

CarstenS said:
With a production code of 1542 older than (production-samples of) GP104, which are from beginning of April 2016.

Good catch.
Wonder how many GP106 dies has so far been manufacturered and also set for GTX 1060; looks like Nvidia is further along with the smaller die than many publications suggest.
Cheers

A1xLLcqAgt0qc2RyMz0y · May 31, 2016

CarstenS said:
With a production code of 1542 older than (production-samples of) GP104, which are from beginning of April 2016.

I see 1617 on the Nvidia chips on the two small boards attached to the the Drive PX main board.

Doesn't that equate to the 17th week of 2016?

CarstenS · May 31, 2016

Should be, yes. FWIW, I was referring to the first picture in the linked article.

RecessionCone · Jun 1, 2016

CarstenS said:
Yes, and it happened at the same rate as FP32. Now it's slower. The question is: Why not leave it as it is.

It's implemented with dedicated units now. Just like FP64.

GP100 has lots of FP16 units (for deep learning training). GP10x does not.

Conversely, GP10x has lots of Int8 dot product units (for deep learning inference). GP100 does not.

The conspiracy theories about throttling are wrong. The chips are just different, with a different balance of functional unit.

CSI PC · Jun 1, 2016

RecessionCone said:
It's implemented with dedicated units now. Just like FP64.

GP100 has lots of FP16 units (for deep learning training). GP10x does not.

Conversely, GP10x has lots of Int8 dot product units (for deep learning inference). GP100 does not.

The conspiracy theories about throttling are wrong. The chips are just different, with a different balance of functional unit.

Do you have a reference as it would help to clarify the situation.
I mentioned in the past that NextPlatform quoted Jonah Alben suggesting it is integral to FP32 Cuda Core when they discussed the multi-precision of Pascal and the P100.

When we suggested to Jonah Alben, senior vice president of GPU engineering at Nvidia, that it was a shame that these 64-bit units could not be used in such a manner, he said that the support for FP16 math required tweaks to the FP32 CUDA cores and that the register bandwidth would have been too limited to run FP16 instructions through both sets of elements at the same time.

Edit:
Also it seems Tegra X1 only has 256 Cuda cores and the FP16 support is in those - going by the whitepaper.

Cheers

3dilettante · Jun 1, 2016

CSI PC said:
Do you have a reference for that?
I mentioned in the past that NextPlatform quoted Jonah Alben suggesting it is integral to FP32 Cuda Core when they discussed the multi-precision of Pascal and the P100.

That seems to indicate tweaks were made to the FP32 ALU implementation in the GP100.
It isn't strictly necessary that one implementation of FP32 for one model is the same as the other, just as some GCN designs that are baseline FP32 have varying FP64 ratios baked in.

I am speculating a lot on this, but it does seem like Nvidia prefers to differentiate units more readily than AMD. The differentiated units skimp on some of the overhead built into hardware that iterates repeatedly for different precisions like GCN, and if there's less variability in the execution loop of the ALUs, they might run "tighter" than a base architecture whose critical path that lets various disparate operations sneak in on the same route.

CSI PC · Jun 1, 2016

3dilettante said:
That seems to indicate tweaks were made to the FP32 ALU implementation in the GP100.
It isn't strictly necessary that one implementation of FP32 for one model is the same as the other, just as some GCN designs that are baseline FP32 have varying FP64 ratios baked in.

I am speculating a lot on this, but it does seem like Nvidia prefers to differentiate units more readily than AMD. The differentiated units skimp on some of the overhead built into hardware that iterates repeatedly for different precisions like GCN, and if there's less variability in the execution loop of the ALUs, they might run "tighter" than a base architecture whose critical path that lets various disparate operations sneak in on the same route.

Err I was responding to a post that specifically mentions GP100

And yeah going by historical trends with NVIDIA we may only see the full implementation across the same die used for Titan-top Quadro-Tesla P102; similar to days of Kepler GK110.
Still I would expect same cuda cores just artificially tweaked (IMO), same way Kepler Tesla had the full Dynamic Parallelism and the consumer cards did not.
Cheers

3dilettante · Jun 1, 2016

CSI PC said:
Err I was responding to a post that specifically mentions GP100

Cheers

I thought there was a cross-GP1xx comparison going. I may have misread the direction it was going.

CSI PC · Jun 1, 2016

3dilettante said:
I thought there was a cross-GP1xx comparison going. I may have misread the direction it was going.

Could be me who misread the context

Cheers

CarstenS · Jun 1, 2016

RecessionCone said:
It's implemented with dedicated units now. Just like FP64.
GP100 has lots of FP16 units (for deep learning training). GP10x does not.
Conversely, GP10x has lots of Int8 dot product units (for deep learning inference). GP100 does not.
The conspiracy theories about throttling are wrong. The chips are just different, with a different balance of functional unit.

Yes, you said that repeatedly now. Is there any source for it apart from what one dev divined out of assembly code? I mean, I'm not at all against those propositions, but I'm interested in whether or not they have a substantial base.

spworley · Jun 1, 2016

CarstenS said:
Is there any source for it apart from what one dev divined out of assembly code? I mean, I'm not at all against those propositions, but I'm interested in whether or not they have a substantial base.

It's unlikely GP104 has the full set of fp16x2 ALUs. There is no strong base for concluding the existence of artifically disabled fp16x2 units. We don't know the details of what is there, and some evidence we assembled is contradictory, but Occam's razor says the simplest explaination of fp16x2 being a hardware feature only on P100 (and Tegra) is likely correct. We CUDA guys are especially hopeful, but it's just hope, not actual expectation anymore.

silent_guy · Jun 1, 2016

spworley said:
It's unlikely GP104 has the full set of fp16x2 ALUs. There is no strong base for concluding the existence of artifically disabled fp16x2 units. We don't know the details of what is there, and some evidence we assembled is contradictory, but Occam's razor says the simplest explaination of fp16x2 being a hardware feature only on P100 (and Tegra) is likely correct.

The strongest circumstantial evidence is probably the fact that the 8 bit operations are not restricted (has anyone actually verified on real hardware?), even though that would be the best candidate for it since it has no other reason to exist.

lanek · Jun 1, 2016

RecessionCone said:
It's implemented with dedicated units now. Just like FP64.

GP100 has lots of FP16 units (for deep learning training). GP10x does not.

Conversely, GP10x has lots of Int8 dot product units (for deep learning inference). GP100 does not.

The conspiracy theories about throttling are wrong. The chips are just different, with a different balance of functional unit.

I dont say you are wrong, but that start to make a lot of units, FP32 units, half in addittion for do FP64 ( at 1/2 rate ), and 2x the numbers of FP32 for doing FP16 ( at 2:1 FP32 rate ) and let alone all the transistors attached to them + redundancy ( because it is not really the "ALU" who cost the most )..

Thats seems to me to be a little bit "brute" force.. There's certainly other way to do it ( closer to what is doing AMD with FP64 ( at 1:2 on Hawai )

silent_guy · Jun 1, 2016

lanek said:
I dont say you are wrong, but that start to make a lot of units, ...

Maybe that's why its a 610mm2 16nm die?

lanek · Jun 1, 2016

silent_guy said:
Maybe that's why its a 610mm2 16nm die?

As i have said, i dont say it is wrong, just it seems to me a bit unlogic on an engineering aspect. Outside when mixed precision will be used, ( and that will not be the case for traditional work of Tesla, quadro case ), look like a good waste of die space and transistors and so production cost.

This could explain the "GP102 sku name" ,coming without thoses additional capacity, and so the "units", who will not only be disabled, but completely missing.
( so this time, this will be a first to have 2 complete separate " sku" who differ really on many aspect between compute gpu's and the high end gpu's, including die size, transistors count and what not else )
GP102 could end with a smaller die, with less transistors.

CarstenS · Jun 1, 2016

spworley said:
It's unlikely GP104 has the full set of fp16x2 ALUs. There is no strong base for concluding the existence of artifically disabled fp16x2 units. We don't know the details of what is there, and some evidence we assembled is contradictory, but Occam's razor says the simplest explaination of fp16x2 being a hardware feature only on P100 (and Tegra) is likely correct. We CUDA guys are especially hopeful, but it's just hope, not actual expectation anymore.

Ok, let me put it this way: Even if no special FP16 goodness was in the GP104 hardware, it should be able to execute FP16 at the same rate as FP32, like Maxwell and Kepler before it, right?

Nvidia Pascal Announcement

Kaotik

Drunk Member

Razor1

Razor1

CarstenS

Moderator

CSI PC

A1xLLcqAgt0qc2RyMz0y

CarstenS

Moderator

RecessionCone

CSI PC

3dilettante

CSI PC

3dilettante

CSI PC

CarstenS

Moderator

spworley

silent_guy

lanek

silent_guy

lanek

CarstenS

Moderator

Similar threads