Nvidia Pascal Speculation Thread

Not open for further replies.
Charlie at it again, seemingly dissecting why Nvidia would not have working Pascal silicon in-house, based off of a forum post at our very own Beyond3D.


Seems pretty legit, except one thing: He bases his assumptions of timing etc. on the linked forum post here and explains why Nvidia would not have Pascal silicon in-house because they moved around bring-up tools only end of december. Flaw in his argument: The post explicitly says "big Pascal" while the tool movements were for a 37.5×37.5 part,

Possibly, big Pascal could have taped out and brought up first, and now Nvidia is working on bring up for a smaller chip.

From what I could quickly gather off of some photos, GM200's package was roughly 45-46 mm wide, GM204's 40-41 and GM206's 37-38 mm - and Charlie's article focuses on a 37.5×37.5 BGA size. Go figure.
plus there are tools for water cooling parts, the only water cooling parts that we know of right now being made in house is for PLX2, not GP100.
Are those really tools for WC or just components of an easily detachable Watercooler for the bring-up kit? I don't know.
Somebody posted zauba shipping manifest in SA forums, which seem to be the best leaks currently, for an unnamed nvidia chip having price like 2.5x of Fiji samples. Way back in August-September so probably related to a june tapeout?

More interesting for me was that Polaris got demoed to press at Sonoma in early december in similar fashion as at CES,

AMD showed off functional Polaris silicon with multiple working devices in early December. It wasn’t rough, it wasn’t a static demo, it had drivers, cards, and all the things you would expect from non-first silicon. A month later Nvidia did not have silicon and lied to a room of analysts and press about it, and there was no way their CEO would be unaware of such a major milestone at his company.

Would explain why Koduri is so confident of distancing themselves from nvidia's new chip release.
AMD is the one that has to prove to OEM's and system builders they are ready with their products, nV doesn't need to prove themselves as they already have cornered that market.

As nV has done so many times as did AMD/ATi when they have well selling products in the OEM markets, then tend to slowly introduce new top ends to OEM's and system builders (this is because OEM's have already a certain amount inventory obligations based on contracts), where the focus on first time adopters through general consumers first. This also helps when or if initial yields are low.....
At least AMD has shown only a to-be mobile part, thus not hurting their partners' channel inventory. Nvidia OTOH would probably hurt current sales much more if they already showed working silicon.
Apologized for that and thread cleaned. And I didn't tried to start a war, even a petty one, but to provoke (by being... provocative) a response from a specific forum member (still hoping for that)... Instead of that I saw other people, people that I respect greatly, offseted by my comment and I realized something went wrong...
Hello everyone.
First, I do not speak English well. I hope you understand.

This is what I found.
2016-02-09 17;40;05.jpg
699-2H403 - ?
699-12914 -
699-1G411 - GM204 variant ?
699-1H400 - GP104 ?

Background Information
PG401 - GM204 Reference Card (GTX980), 699-1G401
PG600 - GM200 Reference Card, 699-1G600
PG301 - GM206 Reference Card (Maybe), 699-1G301

I think everyone knows this.
2016-02-09 17;57;47.jpg
26-Oct-2015 85423100 GRAPHICS PROCESSOR INTEGRATED CIRCUITS 3R08A South Korea Banglore Air Cargo NOS 10 89,950 8,995 - This is Unique

Please delete this article if this is the problem.
Earlier this week I found some NVIDIA presentations that contained details that I have not seen before (or I don't remember).

Slide 6 of "The Future of HPC and The Path to Exascale" (from 3 March 2014) gives a roadmap with a DP GFLOPS/W value for Pascal. The presentation's date is between the GTC 2013 roadmap which does not contain Pascal and the GTC 2014 roadmap which does contain Pascal.


Below are the approximate DP GFLOPS/W values for the various architectures:
  • Tesla: 0.5
  • Fermi: 2
  • Kepler: 5.5
  • Maxwell: 10.5
  • Pascal: 14
  • Volta: 22
Slide 43 of "Accelerators : The Changing Landscape" (from around May 2015) shows that a Pascal GPU has a peak of over 3 TFLOPS (presumably DP). I'm not sure if it includes boost, it very well could because the GFLOPS values of existing GPU parts on page 14 are from the maximum boost clocks.


Now, I had written up a long post analyzing these two pieces of information, but I subsequently found some more presentations and I had to change most of it.

Slides 12, 14, and 17 of "GPU Accelerated Computing" (from 10 November 2014) contain a roadmap of specific Tesla parts, not just architectures. This roadmap contains a single-GPU part called "Pascal-Solo" with a 235 W TDP. [Also, you may notice something missing in slide 11, which makes sense given the date.]


The presentation doesn't specifically state what chip the "Pascal-Solo" uses, but I think the slides following the roadmap may point to the GP100 chip with HBM2. The roadmap contains all the Kepler Teslas that I previously knew of but it does not have any Maxwell Teslas (Maxwell isn't mentioned in this presentation at all), so I think it's possible that the "Pascal-Solo" isn't the only Tesla planned for 2016. But I haven't found any direct evidence of any other Pascal Teslas in 2016, or any Pascal releases before the later part of this year for that matter. EDIT: That being said, I also haven’t found any hard evidence that says there will be no Pascal parts before late 2016. There may be little reason for presentations to specifically mention a future Pascal chip, even in a Tesla part, that does not have HBM2 or NVLink. I’m still hoping for a GP102 or GP104 release in March or April.

The last piece of information I found may explain what I previously thought might be a a discrepancy between the "Future of HPC" roadmap and the GTC 2015 roadmap. The GTC 2015 roadmap shows ~42 SGEMM/W for Pascal. Given how close the SGEMM/W and theoretical SP GFLOPS/W numbers are for Maxwell, and that Maxwell and Pascal seem to be architecturally similar, I guessed that Pascal has a theoretical ~43 SP GFLOPS/W. I had also assumed that fast DP Maxwell has a 1:2 DP rate, but 43 is much higher than two times 14.

Slide 75 of "New hardware features in Kepler, SMX and Tesla K40" (from April 2014) mentions that a Pascal with stacked memory has 4 DP TFLOPS, 12 SP TFLOPS, and 1024 GB/s. It's worth noting that the DP value matches the value from a presentation linked earlier in this thread.


I don't think these FLOPS numbers automatically imply a 1:3 DP rate—the number of significant figures are few enough to mask small differences from 1:3.

Question 1: Is it possible for a Pascal chip to consist of some SMs with a 1:2 DP rate and other SMs with no DP or a 1:32 DP rate?

Taking into account the above information, the 12 SP and 4 DP TFLOPS values more closely align with a ~280 W TDP than a 235 W TDP. So I'm thinking that either some roadmap information is outdated or there is some hidden > 235 W Tesla part that we don't know about. After all, the K20X wasn't unveiled at the same time as the K20, even though both parts launched at about the same time. So my current guess for the 2016 Tesla lineup is as follows:
  • Tesla P##: 1x GP100, ~14 DP GFLOPS/W, 235 W, ~3.3 DP TFLOPS, ~9.9 SP TFLOPS, 1 TB/s
  • Tesla P##X: 1x GP100, ~14 DP GFLOPS/W, 275-300 W, 4 DP TFLOPS, 12 SP TFLOPS, 1 TB/s [less likely?]
    • Question 2: Is it possible to have 2x GP100 on the same interposer? (Or even 2x GP102 if that chip uses HBM2.)
By the way, I have collected a large number of NVIDIA roadmaps in this presentation file, including all four in this post.
Last edited:
4Tflops is more like a design target for HPC products.

Traditional HPC cards are either passively or actively air-cooled, and also considering the power limit (235W), I would expect the Pascal HPC cards are significantly downclocked comparing to their gaming (water-cooled) counterpart, more so than these in Kepler generations.

The main reason is HBM with interposer, interposer is made of copper, and it will pose significant cooling challenge, sorry for my limited english knowledge, however due to different heat-expension coefficients of Cu and Si, the chips will experenice significant stress/sheering force between loading and unloading cycles, and will significantly shortening the life cycles of the chip comparing to traditional chips if the temp is high, for an acceptable life, such HBM2 package has to be working in a much lower temp, we are not talking about 80degreeC, more like 50 degreeC or below, so if it is air-cooled instead of water-cooled, and if its load is very stressful, Nvidia has no choice but significantly downclock their new HPC chips or try to persuade all the big players in HPC market to design new water-cooled solutions (not very likely), so the freqency thus the performance gap between air-cooled HPC chips and water-cooled gaming chips can be huge.

So a 4Tflops HPC performance is not a very good indication for the peak performance of Pascal.
Why would there be an interposer made of copper? That has not been the case for AMD or others. There are materials besides silicon, but I have not seen discussion on metal ones.
With regards to thermal expansion, an interposer presents a less problematic surface for the silicon chips on it versus the organic package that a GPU like Fiji would otherwise attach to.
Thank you for that explanation - the whole one, though I'm focussing just on the quoted part here. It was stated in this thread, that you would need deterministic neural networks with predictable results for use in automated driving. This would mean that most of the training should be completed in factory-delivered cars already (i.e. offline) - and the question was, whether or not the live AutoM8s would still require massive computing power, when their Neural Networks have already been pre-trained.

Neural Networks are always deterministic. The problem is always the quality of the underlying training algorithm and methods. It is necessary to make sure that your training algorithm and training examples covers unlikely inputs and outputs. The most commonly used training algorithm is the gradient-descent backpropagation algorithm (GD). Another algorithm would be the Particle Swam Optimization (PSO). The latter is better usable for training on GPU clusters. Both use labeled training examples, which are used to iteratively adjust the network parameters (biases and weights). This kind of training is also called super-vised learning. There exist many different neural network structures. For automated driving and the most image processing tasks, Convolutional Neural Networks (CNN) are used, which feature good parallel execution streams and less dependencies.

I don't see online training methods for automated driving, especially not in the fields of image recognition and processing. The problem is just, that training costs much processing power, and it is not guarenteed that the training is successful. It could happen, that the network parameters get worse and not better.

Of course huge processing power is necessary to execute an CNN. Normally you set up an CNN with an fixed input region, which cannot be changed at all. You need to train your CNN with corresponding labeled training cases. For example, if the input region is 32x32 pixel (which is very common), you cannot feed the network with smaller or bigger input. It might be possible to scale the input data upwards and downwards, but there will be problems with the recognition rate then. For a 720p input stream you need to raster scan the CNN over the whole input, or try to use some pre processing to determine "where" are the regions of interest, which then can be feed into the CNN. Then of course you need additional preprocessing. For traffic signs you need to use RGB pixels, and not just pixel intensity.

The whole process of automated driving is very crucial. You don't just use CNNs, but also many other algorithms in order to extract the necessary informations. All the information needs to be gathered, via a process called "sensor fusion". The whole process is very complex, and I don't have that much of an insight into every detail. CNNs or Artificial Neural Networks is just one part of the whole process and not an universal remedy.


About the whole stuff about FP64 and FP32 execution rate of big Pascal. Maybe it will look like this:


There are 4 execution blocks in a SM. Always two of the are sharing one FP64 SIMD, which would give a ration of 1:0.375, which is pretty close to 3:1. Maybe the boost clocks get lower for FP64 tasks and exeuction.
I think there's couple other, mor interesting notes on that slide deck:
- NVIDIA considered HMC first, not HBM
- NVIDIA claims they have integrated the memory to be part of the actual GPU die

I would just like to ask - what the f#¤k and how the f#%k this happened, TSMC isn't doing HMC nor HBM, and don't those actually require a different manufactuing process altogether?
As chip manufacturing process shrank to less than a micron, they started to be integrated on-die: 1989: FPU [Intel 80486DX].
1999: SRAM [Intel Pentium III].
2009: GPU [AMD Fusion].
2016: DRAM [Nvidia Pascal]

The end of the story is SoC (System-on-Chip).
This is from a Cuda Fellow, i.e. an independent researcher outside of Nvidia. I won't bet any vital parts on it, that everything in there is officially sanctioned Nvidia material.
Not open for further replies.