Nvidia Pascal Speculation Thread

Status
Not open for further replies.
This is from a Cuda Fellow, i.e. an independent researcher outside of Nvidia. I won't bet any vital parts on it, that everything in there is officially sanctioned Nvidia material.
Thanks for the info.

I checked the authors of the presentations I linked to in my above post, and as far as I can tell, all of the authors are within NVIDIA except for the CUDA Fellow.

EDIT: Also, the date of "The Future of HPC and The Path to Exascale" presentation is 2 March, not 3 March.
 
Last edited:
That deck is ancient and should be ignored.

The clue is in the fact that there's no mention of HBM.
 
That deck is ancient and should be ignored.

The clue is in the fact that there's no mention of HBM.
Actually there is slide called "HBM (High Bandwidth Memory) versus DDR3 and GDDR5", and the Pascal bandwidth obviously matches HBM, not HMC
 
Of course huge processing power is necessary to execute an CNN. Normally you set up an CNN with an fixed input region, which cannot be changed at all. You need to train your CNN with corresponding labeled training cases. For example, if the input region is 32x32 pixel (which is very common), you cannot feed the network with smaller or bigger input. It might be possible to scale the input data upwards and downwards, but there will be problems with the recognition rate then. For a 720p input stream you need to raster scan the CNN over the whole input, or try to use some pre processing to determine "where" are the regions of interest, which then can be feed into the CNN. Then of course you need additional preprocessing. For traffic signs you need to use RGB pixels, and not just pixel intensity.

Interesting, But then you can run another neural network on say 16x16 pixel data?
I have the feeling you can throw endless resources at the problem, although that may take a second place to algorithms and knowing what you're doing.
 
Actually there is slide called "HBM (High Bandwidth Memory) versus DDR3 and GDDR5", and the Pascal bandwidth obviously matches HBM, not HMC



This part of an independant course on the SC event in june 2015 about memory evolution on GPU

Dont know if the number have been given to him at this time, or if it was just projection, speculation based on a possible +50% gain on performance for Pascal.

This said, we are at 8 TFlops SP. +50% seems like plausible.
 
Last edited:
The big, round, power of 10 numbers for Pascal versus actual numbers for everything else obviously constitute a guesstimate, and a projected one at that given the timeframe. What's interesting is if it's an informed one. 12 teraflops is certainly doable with the new finfet nodes, I'd always just assumed it would take a die too big for Nvidia to produce, at least in the initial run and on what is essentially Maxwell in all the important parts.
 
They apparently have some transistor-level power tricks up their sleeve (see Dally's talk at SC15) so they probably can utilize a bit more of the additional areal density without severely running into the power wall. My guess is that they've also improved on the RFC (Dallys pet data locality) and probably the register file itself. Plus maybe larger L2 caches.
 
Interesting, But then you can run another neural network on say 16x16 pixel data?
I have the feeling you can throw endless resources at the problem, although that may take a second place to algorithms and knowing what you're doing.

Of course, it is possible to run another ANN with a other input size. A CNN usually consists of mutliple different kinds of layers. There so-called Convolutional Layers (CL), Pooling Layers (PL) and Fully-Connected Layers (FL) at the end of the network. I've made a post regarding the structure of CNNs a while ago, so don't bother search for it.

CLs are used to recognize features within the input data. For pixel data, features are composed of adjascent pixels, forming regions of certain contrasts, figures, edges, and so on... You usually have multiple CLs in parallel, as every CL is capable to search the input data for one single feature. Again this is achieved via raster scan fashion, meaning every neuron in a CL is looking at certain space of the input data. For example, if a CL has a kernel size of 5x5, it searches for features with a size of 5x5 pixels. For every feature you want to recognize, you have a dedicated CL in parallel, which are independent from each other. Therefore you achieve parallelism, where you can get a good speed-up by using a GPGPU or other VectorProcessingUnits. As well, the general execution is very similar to matrix multiplication, and therefore it is possible to use tricks like screen-space tiling (BLAS) to achieve better performance.

After every CL, there is a Poolin Layer, which downsamples the recognized features. This is very important, as it removes noise and feature "artifacts". It can be seen as a downsampling kernel (Min, Max, Average, etc.). After that there are another CLs and PLs. You stack them together, in order to recognize other high-level or low-level features, as features can be composed of other features, and so on. The general structure of CNNs is established after the visual cortex in our brain, where complex nervous cells and simple nervous cells are stacked onto each other. After multiple CLs and PLs, there are FLs, which gather the recognized feature information and map them to corresponding outputs.

For example, if I want to recognize different traffic signs, I would use a defined input pixel size, for my input image. For example 32x32 pixels RGB => 3x32x32 input data, and use multiple stacked parallel CLs and PLs. Maybe there are 10 parallel CL and PL pipes, which 3 stacks each, consisting of 6 layers at all. Then I include two additional FL layers at the end of my network consisting of 100 and then 30 neurons. If I want to recognize 10 different traffic signs, the CNNs has 10 outputs. So I train that network with my GD algorithm and until my evaluation data set achieve a very high recognition rate. For every output I get a probability of the corresponding feature. So there can be problems, with negative-positive and positive-negative errors. For example, for the recognition of handwritten digits (MNIST), a "7" can be look similar to a "1", therefore the recognition rate for those cases might suffer.

And now for your question. If I create a CNN with 32x32 pixels input size, it also includes smaller input sizes. As I mentioned, a CNN consists of stacked CLs and PLs, with downsampling capabilities. So you indirectly are investigating the input for smaller features. The final answer is always, "it depends"! It is possible to train a 32x32 CNN to recognize 16x16 input data, but you need to make sure, that the data is transformed accordingly (scaling, zero-patterning) and that the network is trained to recognized such transformed data. Maybe you need greater CLs or more stacked CLs and PLs to recognize those data, then. Maybe it would be better to run an independent CNN. So the final answer is "it depends".

Creating an and training an CNN is pretty simple, with toolkits like Theano or Caffe. The problem is, how can I preprocess the input data, how can I postprocess the output data, how many CLs and PLs are necessary, what is right feature size, and so on... These are the true hindrances. CNNs and ANNs are black boxes once trained, you cannot take a look inside and understand whats going on.
 
They apparently have some transistor-level power tricks up their sleeve (see Dally's talk at SC15) so they probably can utilize a bit more of the additional areal density without severely running into the power wall. My guess is that they've also improved on the RFC (Dallys pet data locality) and probably the register file itself. Plus maybe larger L2 caches.

Problem is with the initial run of Finfet the limiting factor is easily going to be yields, not power. The fmax curve for finfets is so sharp that trying to drum up power efficiency doesn't get you much, increasing the frequency is going to hit you with a bad exponential power curve no matter what you do. Though funny enough this is also going to apply to trying increase density, as efficiency versus frequency drops off the other way as well. So you can produce a huge die and run at lower frequency, but with efficiency dropping off exponentially the other way you're not going to get much that way either.

Really with such a sharp curve the efficiency use of hardware EG output v. frequency would be the best optimization you could expect for the current gen of finfets, and perhaps future ones as well. And yes, if you "phrase that right" it might sound like power efficiency and output efficiency are the same thing. But while they're linked, you can still concentrate on getting less power draw v frequency in ways that don't necessarily output more useful work (think IPC) per clock cycle, which is exactly what Nvidia did with Maxwell and let them run at high frequencies on 28nm. But that's less useful on finfet.

Still, it brings up the question of what TDPs the first Finfet GPUs can hit. There is still wiggle room within the curve, so while power efficiency gains are a bit less useful it certainly isn't useless.
 
They apparently have some transistor-level power tricks up their sleeve (see Dally's talk at SC15) so they probably can utilize a bit more of the additional areal density without severely running into the power wall. My guess is that they've also improved on the RFC (Dallys pet data locality) and probably the register file itself. Plus maybe larger L2 caches.

Just watched the Dally presentation. Interesting stuff. Seems the charge recycling kung fu has been around for a while (at least on paper).

I just did a htpc build with a gtx 960 in a fractal node 202 (sweet console sized case) and I'm now firmly in the need for more performance in a power constrained chassis ~120w max gpu tdp.

Hoping for magic from this new generation.
 
I just did a htpc build with a gtx 960 in a fractal node 202 (sweet console sized case) and I'm now firmly in the need for more performance in a power constrained chassis ~120w max gpu tdp.

Isn't that coming with a 450W PSU? Why limit yourself to a 120W GPU?

Regardless, Polaris 11 is expected to be ~150W.
 
Isn't that coming with a 450W PSU? Why limit yourself to a 120W GPU?

Regardless, Polaris 11 is expected to be ~150W.

The limitation isn't the psu. It's cooling/noise. There isn't a lot of airflow available to the gpu and temps are pretty high. Had to raise the temp limit to 95c just to avoid throttling. Maybe a blower would work better.
 
The limitation isn't the psu. It's cooling/noise. There isn't a lot of airflow available to the gpu and temps are pretty high. Had to raise the temp limit to 95c just to avoid throttling. Maybe a blower would work better.

Well you definitely need a blower in that type of case. I think you could mount a GTX 970 in there if it was using the reference cooler.
 
Question:

Do we know for sure, that:
  • The perf./Watt numbers and the performance records belong to the same chip, or are at least manufactured on the same node configuration of 16nm?
  • The rumored 12/4 TFLOPS GPU is actually a single die, and not already split, making good use of the interposer which is required for HBM2 anyway?

As it stands, I'm still having trouble believing that Nvidia could possibly achieve that performance number with a single die at that power target. My point is, even with replacing the GDDR5 controller with HBM2, and possibly also optimizing the global routing for less wasted space, the only way to achieve a doubled throughput would be to go all the way for the high density configuration of the 16nm Finfet process.

Problem with that is, that the power characteristics when opting for that process are pretty much close to 28nm - even with a few smart tricks, and the savings from HBM2, that's still an overall net gain in power consumption. Something around the magnitude of 350-400W TDP.

Going for the low power configuration with a single die doesn't sound plausible either, as that means roughly the same size as with 28nm. If they did that, there's just no way they could keep the die size below 800-900mm² or even more. That's why I'm suspecting that the announced perf/W figures are actually for a smaller chip manufactured on a low power configuration, that appears barely doable at the characteristics of 16nm Finfet we know.

Finally, the only real option I see, how they could achieve both the performance and the efficiency targets while keeping the die size within bounds, is that the GP100 (or whatever it is actually going to be called), is actually a dual GPU inside a single package. The interposer makes this a lot easier, as the interconnects can be attached at any arbitrary point, greatly simplifying the routing. The announced fact of having NVLink integrated into that GPU anyway, somewhat supports this theory.
 
The rumored 12/4 TFLOPS GPU is actually a single die, and not already split, making good use of the interposer which is required for HBM2 anyway?
The CUDA fellows slide-deck claims they've actually integrated the memories straight on the same die, not exactly sure how it could be done though
 
The CUDA fellows slide-deck claims they've actually integrated the memories straight on the same die, not exactly sure how it could be done though
Yeah, I don't believe that one either.

More likely that the clerk got package, interposer and die mixed up. Especially since the HBM2 memory is not planar, but stacked, and trying to make it planar would eliminate a huge part of the savings. Not to mention the die size required to achieve any reasonable RAM size.
 
The CUDA fellows slide-deck claims they've actually integrated the memories straight on the same die, not exactly sure how it could be done though
I think that guy was just trying to point out, very clumsy, how things are getting pulled closer to the main die. I wouldn't read anything more into it. It's a pretty sloppy presentation.
 
Status
Not open for further replies.
Back
Top