Nvidia Pascal Announcement

Doesn't look like a P100, since 24GB GDDR5X possibly points to a 384bit bus, like the one on P102. I haven't heard of GDDR5X capabilities in P100, even more since the chip seems to have been finalized before GDDR5X as a standard.

What this may mean is that the GP102 actually has 3840 sp and the new Titan is a cut-down chip.

Sold really fast or manufacturing problems getting enough out there - where they Nvidia? All a matter of perspective, it seems.
I don't think any of the Polaris chips have manufacturing problems because they've been shown in working condition since January.
Just my opinion and I could be wrong, though.


Anyway: This money-making argument would make sense, if they were genuine 4 GiByte models, but they were not.
The money-making argument makes sense because apparently AMD didn't even bother to make 4GB cards. They simply flashed 8GB cards with a bios that won't allocate half the memory and lower its clocks to 7Gbps, put a couple hundred of them in the market just as a placeholder and called it a day.
It's possible the amount of "4GB" cards is so tiny that they saved money by not even creating more than one product line. No quality-control, no separate order of 7Gbps VRAM chips, no separate assembly line created for a different product (other than the flashing bios part). The couple thousand dollars they lose by selling 8GB cards for cheap(er) is an investment to get the $199 placeholder faux-release.
Again: not a very consumer-friendly or honest decision.
 
Last edited by a moderator:
It is not a problem of the cooler being unable to handle it, but the problem of getting enough heat out of the small die. The heatpipes are the limiting factor in how much heat they can draw from the die. maybe a vapour chamber would really benefit RX480.
Agreed,
which is part of my 1st sentence when I mention density and die size :)
It is still an issue with the cooling solution if it is not designed to cope with this (of course IHVs will rightly mention manual OC is beyond guaranteed spec) and more of a challenge if going with air cooling rather than water.
Unfortunately vapor chamber is also only able to do so much as we see even with Nvidia cards.
If truly OC either card beyond their performance-scaling ceilings I showed chart earlier from Tom's, both of this generation will need to be on water cooling.
Context is about not needing to run the fans at rpm that is excessive and noisy/intrusive, and even then there are scaling limits that are lower than previous gen.
Cheers
 
Last edited:
Nvidia Potential Roadmap Update for 2017: Volta Architecture Could Be Landing As Early As 2H 2017
Word around the grapevine is, the Volta GPU is going to be landing one year early in May next year at the GTC event held annually by Nvidia. There are two particular sources in play on this report, both as critical to the authenticity of this information as the other.
...
If Volta will indeed be manufactured on the 16nm FinFET node than we can actually expect to see it by May 2017 – which is less than a year away.
...
In fact one additional argument in the favor of Volta in 2017 is the fact that the Pascal big chip has been designed for double precision as well as single precision unlike Maxwell, which was designed from the ground up for single precision. While this is invaluable for Nvidia’s other ventures, the double precision units are a waste of space on the P100 die as far as gaming is concerned. The Geforce TITAN X only has 3584 cores and based on a smaller die, which naturally means we haven’t seen the biggest possible chip for consumer purposes on this node yet. One thing is for sure, the 16nm FinFET process can do much better than a single precision core count of 3584 and until Nvidia has fully utilized this node, I very much doubt it will move on. So it is almost a given that there is at least one more generation of 16 FF GPUs coming from Nvidia, and it is a fair bet that this will be the Volta architecture since this only leaves behind the question of nomenclature.
http://wccftech.com/nvidia-roadmap-2017-volta-gpu/
 
I don't know how useful those comparisons/predictions are anyway. Kepler->Maxwell->Pascal all form part of a very similar SM structure but afaik Volta could be very different and could make a jump in ALU count similar to Fermi->Kepler/Maxwell.
 
I've heard that Volta has very different architecture than Maxwell/Pascal and with much higher efficiency.
Personally I would like to see Volta with a very flexible ALU that can do FP64 / 2xFP32 / 4xFP16 but its just a wish, I have no idea how it really be...
 
Okay, GP102/TitanX updates: slow FP16/FP64 confirmed. 471mm2 die size. Other than INT8, this really is a bigger GP104

Really sounds like 1.5x GP104, which would make the whole chip's SM count at 30. With the news of Quadro P6000 coming with 30 SM and 384bit GDDR5X bus, all this is pointing for the new Titan being a cut-down GP102.
 
Okay, GP102/TitanX updates: slow FP16/FP64 confirmed. 471mm2 die size. Other than INT8, this really is a bigger GP104
Thanks for the update.
Well Nvidia managed to create a confusing mess now with their narrative on scientific Deep learning and research operations.
You need a P100 in general for optimal performance of FP32/FP16, but a GP102 if doing Int8/dp4a; basically they have screwed any shared workload possibilities and added further cost-complexity considerations, which matters in the business-scientific world.
Made worst by how Nvidia seem recently to be pushing Int8 scientifically.

1st time in a long while I think they have made a bad business strategy here.
I think they will have no choice but to release an updated P100 with Int8/dp4a and share this through Tesla and Quadro now, possibly with certain number of FP64 Cuda cores disabled and at a reduced priced.
Makes one wonder what Pascal is going to fit in to replace the more inefficient K80, or at least be more competitively priced and efficient (obviously less overall performance than the P100).
Cheers
 
Last edited:
Okay, GP102/TitanX updates: slow FP16/FP64 confirmed. 471mm2 die size. Other than INT8, this really is a bigger GP104
Ryan, Is GP102's vaguely described "INT8" feature different than GP104's DP4A instruction? I suspect they're the same and you're mistaken about INT8 being a difference between GP104 and GP102.
 
Thanks for the update.
Well Nvidia managed to create a confusing mess now with their narrative on scientific Deep learning and research operations.
You need a P100 in general for optimal performance of FP32/FP16, but a GP102 if doing Int8/dp4a; basically they have screwed any shared workload possibilities and added further cost-complexity considerations, which matters in the business-scientific world.
Made worst by how Nvidia seem recently to be pushing Int8 scientifically.

1st time in a long while I think they have made a bad business strategy here.
I think they will have no choice but to release an updated P100 with Int8/dp4a and share this through Tesla and Quadro now, possibly with certain number of FP64 Cuda cores disabled and at a reduced priced.
Makes one wonder what Pascal is going to fit in to replace the more inefficient K80, or at least be more competitively priced and efficient (obviously less overall performance than the P100).
Cheers

I'm not so sure I can agree with the idea that it's bad business strategy for Nvidia to push certain customers towards higher-priced SKUs... Perhaps if they had promised this functionality then reneged at the last minute (cough>2SLIcough) but I don't think that has happened here.
 
Nvidia managed to create a confusing mess now with their narrative on scientific Deep learning. You need a P100 in general for optimal performance of FP32/FP16, but a GP102 if doing Int8/dp4a
There's a credible second-hand report that DP4A's design just didn't make it in time for P100.

I'm not a machine learning expert, but there is a big numerical difference between training a neural net and evaluating one. Training requires backpropogation and gradient information, which needs more precision because it will be used for division. FP16 is evidently enough. Evaluation of an existing net (designed for 8 bit) just needs to evaluate a lot of weighted sums of 8 bit input node values. That's a summed dot product A0*B0+A1*B1+A2*B2+A3*B3. And that weighted sum is exactly what integer DP4A does, where each element is a byte in a word. DP2A also exists and it's two 16 bit integers instead.
So the deep learning targeting may be "Buy a P100 to train your nets, and any other Pascal GPU to evaluate them afterwards."
 
There's a credible second-hand report that DP4A's design just didn't make it in time for P100.

I'm not a machine learning expert, but there is a big numerical difference between training a neural net and evaluating one. Training requires backpropogation and gradient information, which needs more precision because it will be used for division. FP16 is evidently enough. Evaluation of an existing net (designed for 8 bit) just needs to evaluate a lot of weighted sums of 8 bit input node values. That's a summed dot product A0*B0+A1*B1+A2*B2+A3*B3. And that weighted sum is exactly what integer DP4A does, where each element is a byte in a word. DP2A also exists and it's two 16 bit integers instead.
So the deep learning targeting may be "Buy a P100 to train your nets, and any other Pascal GPU to evaluate them afterwards."
I think they are looking to push the GP102 for inferencing.
But that is asking for research labs/teams to add additional separate hardware where for most they would possibly looking to share workload more.
Agree ideally you would use different dedicated HW, but that adds cost and complexity, which will not suite everyone.

Yes I agre about the P100, which is why IMO there needs to be an update or new lower product that is a true mixed-precision (with Int8/dp4a) accelerator.
It would not had been much of a headache apart from that it seems Nvidia is just starting their new push and narrative of using Int8 in research, which I doubt will be limited to inferencing.
This has impacts on the optimisation of the various research apps they work with, in terms of adding complexity and now dedicated GPUs.
IMO just a level of complexity they did not need to create for scientific research/supercomputer implementations/smaller teams/etc, and also creates certain gaps as I briefly suggested in earlier posts.
Anyway I doubt everyone is happy in the research world about the idea of buying a GP102 to do one specific task with regards to DL, or have to use it possibly for other optimal Int8 operations.
As Ryan and ToTTenTranz says this is more GP104 version1.5 rather than a real Titan.
Cheers
 
Last edited:
Don't know if it's available in europe, but one of the best cash back cards in the U.S, the citi double cash visa and also the citi costco visa cards, at least it seems(check just in case) appear to have a feature called price rewind(must register product online on their site's price rewind area), they will check hundreds of the top online sites and if a better price is found anywhere they will give you the difference back.

So at least in the U.S. with such easy to get no annual fee cards, iirc, it seems you'll get the card at or below the advertised price. Don't know about europe, though.
I nominate this post for the B3D Hall Of Fame!
 
Ryan, Is GP102's vaguely described "INT8" feature different than GP104's DP4A instruction? I suspect they're the same and you're mistaken about INT8 being a difference between GP104 and GP102.
It seems that way. But I need further clarification. It may just be that there's a software throttle somewhere...
 
Update re: GP102 precision capabilities via Anandtech:
Update 07/25: NVIDIA has given us a few answers to the question above. We have confirmation that the FP64 and FP16 rates are identical to GP104, which is to say very slow, and primarily there for compatibility/debug purposes. With the exception of INT8 support, this is a bigger GP104 throughout.
 
It seems that way. But I need further clarification. It may just be that there's a software throttle somewhere...
I think I have posted this at least 10 times before on this forum. But once more.

GP100 has 2x rate FP16. It has no Int8.
GP10x (102, 104, 106, etc.) have essentially no FP16 (1/64 rate), but have 4x rate Int8 with 32 bit accumulate.

The idea that GP10x has artificially throttled FP16 is wrong, the chips just don't have FP16 besides a token amount for software compatibility.

I don't know whether today's GP104 products have artificially throttled Int8, I haven't seen evidence either way. But GP104 and 102 are the same architecture, just with different numbers of units. GP100, on the other hand, is unique.
 
Back
Top