Nvidia Pascal Speculation Thread

Discussion in 'Architecture and Products' started by DSC, Mar 25, 2014.

Tags:
Thread Status:
Not open for further replies.
  1. iMacmatician

    Regular

    Joined:
    Jul 24, 2010
    Messages:
    774
    Likes Received:
    202
    Thanks for the info.

    I checked the authors of the presentations I linked to in my above post, and as far as I can tell, all of the authors are within NVIDIA except for the CUDA Fellow.

    EDIT: Also, the date of "The Future of HPC and The Path to Exascale" presentation is 2 March, not 3 March.
     
    #761 iMacmatician, Feb 17, 2016
    Last edited: Feb 18, 2016
    CarstenS likes this.
  2. Jawed

    Legend

    Joined:
    Oct 2, 2004
    Messages:
    10,873
    Likes Received:
    767
    Location:
    London
    That deck is ancient and should be ignored.

    The clue is in the fact that there's no mention of HBM.
     
  3. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,254
    Likes Received:
    1,940
    Location:
    Finland
    Actually there is slide called "HBM (High Bandwidth Memory) versus DDR3 and GDDR5", and the Pascal bandwidth obviously matches HBM, not HMC
     
  4. Blazkowicz

    Legend Veteran

    Joined:
    Dec 24, 2004
    Messages:
    5,607
    Likes Received:
    256
    Interesting, But then you can run another neural network on say 16x16 pixel data?
    I have the feeling you can throw endless resources at the problem, although that may take a second place to algorithms and knowing what you're doing.
     
  5. lanek

    Veteran

    Joined:
    Mar 7, 2012
    Messages:
    2,469
    Likes Received:
    315
    Location:
    Switzerland


    This part of an independant course on the SC event in june 2015 about memory evolution on GPU

    Dont know if the number have been given to him at this time, or if it was just projection, speculation based on a possible +50% gain on performance for Pascal.

    This said, we are at 8 TFlops SP. +50% seems like plausible.
     
    #765 lanek, Feb 17, 2016
    Last edited: Feb 17, 2016
  6. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    347
    Likes Received:
    93
    The big, round, power of 10 numbers for Pascal versus actual numbers for everything else obviously constitute a guesstimate, and a projected one at that given the timeframe. What's interesting is if it's an informed one. 12 teraflops is certainly doable with the new finfet nodes, I'd always just assumed it would take a die too big for Nvidia to produce, at least in the initial run and on what is essentially Maxwell in all the important parts.
     
  7. CarstenS

    Veteran Subscriber

    Joined:
    May 31, 2002
    Messages:
    4,803
    Likes Received:
    2,064
    Location:
    Germany
    They apparently have some transistor-level power tricks up their sleeve (see Dally's talk at SC15) so they probably can utilize a bit more of the additional areal density without severely running into the power wall. My guess is that they've also improved on the RFC (Dallys pet data locality) and probably the register file itself. Plus maybe larger L2 caches.
     
    nnunn likes this.
  8. Nakai

    Newcomer

    Joined:
    Nov 30, 2006
    Messages:
    46
    Likes Received:
    10
    Of course, it is possible to run another ANN with a other input size. A CNN usually consists of mutliple different kinds of layers. There so-called Convolutional Layers (CL), Pooling Layers (PL) and Fully-Connected Layers (FL) at the end of the network. I've made a post regarding the structure of CNNs a while ago, so don't bother search for it.

    CLs are used to recognize features within the input data. For pixel data, features are composed of adjascent pixels, forming regions of certain contrasts, figures, edges, and so on... You usually have multiple CLs in parallel, as every CL is capable to search the input data for one single feature. Again this is achieved via raster scan fashion, meaning every neuron in a CL is looking at certain space of the input data. For example, if a CL has a kernel size of 5x5, it searches for features with a size of 5x5 pixels. For every feature you want to recognize, you have a dedicated CL in parallel, which are independent from each other. Therefore you achieve parallelism, where you can get a good speed-up by using a GPGPU or other VectorProcessingUnits. As well, the general execution is very similar to matrix multiplication, and therefore it is possible to use tricks like screen-space tiling (BLAS) to achieve better performance.

    After every CL, there is a Poolin Layer, which downsamples the recognized features. This is very important, as it removes noise and feature "artifacts". It can be seen as a downsampling kernel (Min, Max, Average, etc.). After that there are another CLs and PLs. You stack them together, in order to recognize other high-level or low-level features, as features can be composed of other features, and so on. The general structure of CNNs is established after the visual cortex in our brain, where complex nervous cells and simple nervous cells are stacked onto each other. After multiple CLs and PLs, there are FLs, which gather the recognized feature information and map them to corresponding outputs.

    For example, if I want to recognize different traffic signs, I would use a defined input pixel size, for my input image. For example 32x32 pixels RGB => 3x32x32 input data, and use multiple stacked parallel CLs and PLs. Maybe there are 10 parallel CL and PL pipes, which 3 stacks each, consisting of 6 layers at all. Then I include two additional FL layers at the end of my network consisting of 100 and then 30 neurons. If I want to recognize 10 different traffic signs, the CNNs has 10 outputs. So I train that network with my GD algorithm and until my evaluation data set achieve a very high recognition rate. For every output I get a probability of the corresponding feature. So there can be problems, with negative-positive and positive-negative errors. For example, for the recognition of handwritten digits (MNIST), a "7" can be look similar to a "1", therefore the recognition rate for those cases might suffer.

    And now for your question. If I create a CNN with 32x32 pixels input size, it also includes smaller input sizes. As I mentioned, a CNN consists of stacked CLs and PLs, with downsampling capabilities. So you indirectly are investigating the input for smaller features. The final answer is always, "it depends"! It is possible to train a 32x32 CNN to recognize 16x16 input data, but you need to make sure, that the data is transformed accordingly (scaling, zero-patterning) and that the network is trained to recognized such transformed data. Maybe you need greater CLs or more stacked CLs and PLs to recognize those data, then. Maybe it would be better to run an independent CNN. So the final answer is "it depends".

    Creating an and training an CNN is pretty simple, with toolkits like Theano or Caffe. The problem is, how can I preprocess the input data, how can I postprocess the output data, how many CLs and PLs are necessary, what is right feature size, and so on... These are the true hindrances. CNNs and ANNs are black boxes once trained, you cannot take a look inside and understand whats going on.
     
  9. Frenetic Pony

    Regular Newcomer

    Joined:
    Nov 12, 2011
    Messages:
    347
    Likes Received:
    93
    Problem is with the initial run of Finfet the limiting factor is easily going to be yields, not power. The fmax curve for finfets is so sharp that trying to drum up power efficiency doesn't get you much, increasing the frequency is going to hit you with a bad exponential power curve no matter what you do. Though funny enough this is also going to apply to trying increase density, as efficiency versus frequency drops off the other way as well. So you can produce a huge die and run at lower frequency, but with efficiency dropping off exponentially the other way you're not going to get much that way either.

    Really with such a sharp curve the efficiency use of hardware EG output v. frequency would be the best optimization you could expect for the current gen of finfets, and perhaps future ones as well. And yes, if you "phrase that right" it might sound like power efficiency and output efficiency are the same thing. But while they're linked, you can still concentrate on getting less power draw v frequency in ways that don't necessarily output more useful work (think IPC) per clock cycle, which is exactly what Nvidia did with Maxwell and let them run at high frequencies on 28nm. But that's less useful on finfet.

    Still, it brings up the question of what TDPs the first Finfet GPUs can hit. There is still wiggle room within the curve, so while power efficiency gains are a bit less useful it certainly isn't useless.
     
  10. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,436
    Likes Received:
    443
    Location:
    New York
    Just watched the Dally presentation. Interesting stuff. Seems the charge recycling kung fu has been around for a while (at least on paper).

    I just did a htpc build with a gtx 960 in a fractal node 202 (sweet console sized case) and I'm now firmly in the need for more performance in a power constrained chassis ~120w max gpu tdp.

    Hoping for magic from this new generation.
     
  11. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,076
    Likes Received:
    4,651
    Isn't that coming with a 450W PSU? Why limit yourself to a 120W GPU?

    Regardless, Polaris 11 is expected to be ~150W.
     
  12. Ailuros

    Ailuros Epsilon plus three
    Legend Subscriber

    Joined:
    Feb 7, 2002
    Messages:
    9,420
    Likes Received:
    179
    Location:
    Chania
  13. trinibwoy

    trinibwoy Meh
    Legend

    Joined:
    Mar 17, 2004
    Messages:
    10,436
    Likes Received:
    443
    Location:
    New York
    The limitation isn't the psu. It's cooling/noise. There isn't a lot of airflow available to the gpu and temps are pretty high. Had to raise the temp limit to 95c just to avoid throttling. Maybe a blower would work better.
     
  14. ToTTenTranz

    Legend Veteran Subscriber

    Joined:
    Jul 7, 2008
    Messages:
    10,076
    Likes Received:
    4,651
    Well you definitely need a blower in that type of case. I think you could mount a GTX 970 in there if it was using the reference cooler.
     
  15. Nakai

    Newcomer

    Joined:
    Nov 30, 2006
    Messages:
    46
    Likes Received:
    10
  16. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    Question:

    Do we know for sure, that:
    • The perf./Watt numbers and the performance records belong to the same chip, or are at least manufactured on the same node configuration of 16nm?
    • The rumored 12/4 TFLOPS GPU is actually a single die, and not already split, making good use of the interposer which is required for HBM2 anyway?

    As it stands, I'm still having trouble believing that Nvidia could possibly achieve that performance number with a single die at that power target. My point is, even with replacing the GDDR5 controller with HBM2, and possibly also optimizing the global routing for less wasted space, the only way to achieve a doubled throughput would be to go all the way for the high density configuration of the 16nm Finfet process.

    Problem with that is, that the power characteristics when opting for that process are pretty much close to 28nm - even with a few smart tricks, and the savings from HBM2, that's still an overall net gain in power consumption. Something around the magnitude of 350-400W TDP.

    Going for the low power configuration with a single die doesn't sound plausible either, as that means roughly the same size as with 28nm. If they did that, there's just no way they could keep the die size below 800-900mm² or even more. That's why I'm suspecting that the announced perf/W figures are actually for a smaller chip manufactured on a low power configuration, that appears barely doable at the characteristics of 16nm Finfet we know.

    Finally, the only real option I see, how they could achieve both the performance and the efficiency targets while keeping the die size within bounds, is that the GP100 (or whatever it is actually going to be called), is actually a dual GPU inside a single package. The interposer makes this a lot easier, as the interconnects can be attached at any arbitrary point, greatly simplifying the routing. The announced fact of having NVLink integrated into that GPU anyway, somewhat supports this theory.
     
  17. Kaotik

    Kaotik Drunk Member
    Legend

    Joined:
    Apr 16, 2003
    Messages:
    8,254
    Likes Received:
    1,940
    Location:
    Finland
    The CUDA fellows slide-deck claims they've actually integrated the memories straight on the same die, not exactly sure how it could be done though
     
  18. Ext3h

    Regular Newcomer

    Joined:
    Sep 4, 2015
    Messages:
    354
    Likes Received:
    304
    Yeah, I don't believe that one either.

    More likely that the clerk got package, interposer and die mixed up. Especially since the HBM2 memory is not planar, but stacked, and trying to make it planar would eliminate a huge part of the savings. Not to mention the die size required to achieve any reasonable RAM size.
     
  19. silent_guy

    Veteran Subscriber

    Joined:
    Mar 7, 2006
    Messages:
    3,754
    Likes Received:
    1,379
    I think that guy was just trying to point out, very clumsy, how things are getting pulled closer to the main die. I wouldn't read anything more into it. It's a pretty sloppy presentation.
     
  20. 3dilettante

    Legend Alpha

    Joined:
    Sep 15, 2003
    Messages:
    8,137
    Likes Received:
    2,939
    Location:
    Well within 3d
    I admittedly only skimmed the deck, but I am unclear on the specific claim. Which slide is being discussed?
     
Loading...
Thread Status:
Not open for further replies.

Share This Page

  • About Us

    Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!

    Beyond3D is proudly published by GPU Tools Ltd.
Loading...