More interesting is the fact that there's nothing below it, despite the fact that Aida64 developers already confirmed that there's GP107 and GP108 too
More interesting is the fact that there's nothing below it, despite the fact that Aida64 developers already confirmed that there's GP107 and GP108 too
Good catch.With a production code of 1542 older than (production-samples of) GP104, which are from beginning of April 2016.
With a production code of 1542 older than (production-samples of) GP104, which are from beginning of April 2016.
It's implemented with dedicated units now. Just like FP64.Yes, and it happened at the same rate as FP32. Now it's slower. The question is: Why not leave it as it is.
It's implemented with dedicated units now. Just like FP64.
GP100 has lots of FP16 units (for deep learning training). GP10x does not.
Conversely, GP10x has lots of Int8 dot product units (for deep learning inference). GP100 does not.
The conspiracy theories about throttling are wrong. The chips are just different, with a different balance of functional unit.
When we suggested to Jonah Alben, senior vice president of GPU engineering at Nvidia, that it was a shame that these 64-bit units could not be used in such a manner, he said that the support for FP16 math required tweaks to the FP32 CUDA cores and that the register bandwidth would have been too limited to run FP16 instructions through both sets of elements at the same time.
Do you have a reference for that?
I mentioned in the past that NextPlatform quoted Jonah Alben suggesting it is integral to FP32 Cuda Core when they discussed the multi-precision of Pascal and the P100.
Err I was responding to a post that specifically mentions GP100That seems to indicate tweaks were made to the FP32 ALU implementation in the GP100.
It isn't strictly necessary that one implementation of FP32 for one model is the same as the other, just as some GCN designs that are baseline FP32 have varying FP64 ratios baked in.
I am speculating a lot on this, but it does seem like Nvidia prefers to differentiate units more readily than AMD. The differentiated units skimp on some of the overhead built into hardware that iterates repeatedly for different precisions like GCN, and if there's less variability in the execution loop of the ALUs, they might run "tighter" than a base architecture whose critical path that lets various disparate operations sneak in on the same route.
I thought there was a cross-GP1xx comparison going. I may have misread the direction it was going.Err I was responding to a post that specifically mentions GP100
Cheers
Could be me who misread the contextI thought there was a cross-GP1xx comparison going. I may have misread the direction it was going.
It's implemented with dedicated units now. Just like FP64.
GP100 has lots of FP16 units (for deep learning training). GP10x does not.
Conversely, GP10x has lots of Int8 dot product units (for deep learning inference). GP100 does not.
The conspiracy theories about throttling are wrong. The chips are just different, with a different balance of functional unit.
It's unlikely GP104 has the full set of fp16x2 ALUs. There is no strong base for concluding the existence of artifically disabled fp16x2 units. We don't know the details of what is there, and some evidence we assembled is contradictory, but Occam's razor says the simplest explaination of fp16x2 being a hardware feature only on P100 (and Tegra) is likely correct. We CUDA guys are especially hopeful, but it's just hope, not actual expectation anymore.Is there any source for it apart from what one dev divined out of assembly code? I mean, I'm not at all against those propositions, but I'm interested in whether or not they have a substantial base.
The strongest circumstantial evidence is probably the fact that the 8 bit operations are not restricted (has anyone actually verified on real hardware?), even though that would be the best candidate for it since it has no other reason to exist.It's unlikely GP104 has the full set of fp16x2 ALUs. There is no strong base for concluding the existence of artifically disabled fp16x2 units. We don't know the details of what is there, and some evidence we assembled is contradictory, but Occam's razor says the simplest explaination of fp16x2 being a hardware feature only on P100 (and Tegra) is likely correct.
It's implemented with dedicated units now. Just like FP64.
GP100 has lots of FP16 units (for deep learning training). GP10x does not.
Conversely, GP10x has lots of Int8 dot product units (for deep learning inference). GP100 does not.
The conspiracy theories about throttling are wrong. The chips are just different, with a different balance of functional unit.
Maybe that's why its a 610mm2 16nm die?I dont say you are wrong, but that start to make a lot of units, ...
Maybe that's why its a 610mm2 16nm die?
It's unlikely GP104 has the full set of fp16x2 ALUs. There is no strong base for concluding the existence of artifically disabled fp16x2 units. We don't know the details of what is there, and some evidence we assembled is contradictory, but Occam's razor says the simplest explaination of fp16x2 being a hardware feature only on P100 (and Tegra) is likely correct. We CUDA guys are especially hopeful, but it's just hope, not actual expectation anymore.