Nvidia Pascal Speculation Thread

entity279 · Jan 18, 2016

CarstenS said:
For autonomous driving, cars should be able to „learn“ different sets of geographically and culturally relevant characteristics above their pre-trained set patterns, yes.

Unlikely this will be true in for the next decade.

The current solution is to querry the cloud for any such characteristic while providing the GPS coordinates (in an annonimized way). (I'm reluctantly working in this automotive industry, not directly involved with driving assissts and automations, though)

CarstenS · Jan 18, 2016

Why then is massive computing power needed in the first place?

Deleted member 2197 · Jan 18, 2016

Ext3h said:
There is no way that could possibly pass the regulations. The shipped systems need to be static, and every single alteration of the net needs to be tested for compliance.

I think the systems would need to adapt somewhat based on specific country/geography characteristics, eg. right-hand/left-hand driving, environmental disasters (earthquakes, floods, etc...). I would think continual learning and adapting would be a prime requirement.

entity279 · Jan 18, 2016

CarstenS said:
Why then is massive computing power needed in the first place?

I'm clueless and curious as well. Perhaps the training of the classifier is done offline (cloud) and the resulted weights and their updates are downloaded to the car. If the sensory imput data is complex (might be, IMO), then simply taking a decision based on the input might be somewhat computationally expensive (and surely the usual sub Ghz ECUs tipically used in cars can't handle this. There are DSPs employed for these tasks in ADAS field so this is nV's competition)

Nakai · Jan 19, 2016

There are plenty deep learning and machine learning algorithms for different use cases. Popular approaches are SLAM, HOG and CNNs.

SLAM: Location and orientation in 3D space.
HOG: Feature distinction. Feature is present or not.
CNN: Recognition and differentiation of multiple features (different traffic signs etc.)

NVidia ist focussing on CNNs. CNN (Convolutional Neural Network) are prominent for a couple of years, as they achieve super-human image recognition rate. CNNs are composed of multiple layers, usually there convolutional layers (CL), (max) pooling layers (PL) and fully-connected layers (FL). CLs are used to extract certain features out of the underlying image. For each feature there exists a corresponding CL, where each neuron within an CL shares the same weights and bias. For example, if the input image is 28x28 (MNIST) and if you want to investigate features with the size of 5x5, the final CL has a size of 23x23 and each neuron in the CL has 5x5 (25) connections. So each neuron in a CL is investigating an area of 5x5 pixels. After a CL there is always a PL, which are used to raze noise and other irregularities.
You have to use multiple parallel CLs and PLs, as you always want to search for more features, as well there are subsequent CLs and PLs, in order to extract higher level features. After some cascading CLs and PLs, there are usually FLs, at least one. These are used to ghather the extracted features and data and map them to certain identification outputs. If you want to recognize single digits (0-9) you have 10 outputs in your output layer.

You need much processing power, in order to examine a video stream for certain objects. Still, the execution is not that dramatic, performance-wise. The training algorithms (gradient-descent, PSO) consume lots of processing power and a very iterative. Usually data sets consisting of millions of data files are used to train an artificial neural network. For simpler tasks, like MNIST and handwritten digit, the public training data set consists of 60,000 input images and their corresponding labels.

Voxilla said:
For computing neural networks you need multiplication and addition, don't expect a magical new special operation.
Typically the neuron outputs are 8 bit unsigned and the neural net weights are 8 bit signed.
For fully connected layers of say 1024 inputs and 1024 outputs, you have a 1024x1024 matrix in between.
All computation goes into multiplying a 1024 vector with a 1024 x 1024 matrix.
In case of 8 bit 'special' hardware can speedup this by doing for example n0*w0 + n1*w1 + n2*w2 + n3*w3
and accumulate this with a 32-bit accumulator, the multiplications being 8 bit. Hence the mixed precision.

To wow the crowds, sure 24 tera Deep Leanrning operations per second sound more impressive than 24 tera 8-bit mixed precision operations.

1) Well, and how do you handle numeric problems, like overflows and underflows? I've implemented an small OpenCL lib for training and execution and a concept to execute ANNs on an FPGA.
I've used my lib to train with fix-point numbers in order to transmigrate my networks onto my FPGA. I was using 16 bit fix-point numbers with the Q6,10 [-32,32] number format and encountered many overflows and underflows, for addition and multiplication. Two 16 bit numbers can yield a 17 bit number for addition and a 32 bit number for multiplication. In order to handle those errors, I truncated my value range. If I encountered an overflow, I set the value to max, and vice versa for underflow.

2) Which activation function did you use? Most likely ReLU. So you reduced the final 32 bit value to a 8 bit output (signed)? Were there any problems?

xpea · Jan 26, 2016

some update on Pascal from this place:
http://techfrag.com/2016/01/25/nvid...-in-april-gtx-1080-in-june-and-volta-in-2017/
I can't find his source but it's apparently Japanese, looking at this picture:

If we believe this guy, big Pascal unveiled in April (maybe during GTC) then GP104 around Computech.
IMHO, looks too good to be true...

iMacmatician · Jan 26, 2016

The graphic style looks like something from ASCII.jp. If so then without additional information I unfortunately have doubts about its reliability.

That being said… I'm not surprised that a big chip is coming first. NVIDIA's last fast DP chip was GK210 so I would guess that releasing a fast DP Pascal is a high priority for NVIDIA. I've been hoping for a while that a fast DP Pascal is introduced at GTC 2016, but given the recent rumors and signs, I suspect that if such an introduction happens at all then it may be a Tesla K20-like announcement with the actual release many months later.

Alexko · Jan 26, 2016

Yeah this seems purely speculative, and April sounds a bit optimistic, especially if the big chip is to come first.

xpea · Jan 26, 2016

Alexko said:
Yeah this seems purely speculative, and April sounds a bit optimistic, especially if the big chip is to come first.

We already know long time ago that big Pascal will come first. Nvidia has NOAA to equip this year:
http://www.anandtech.com/show/9791/...ation-to-build-tesla-weather-research-cluster
They can afford abysmal yields in these Tesla SKUs like they did with Fermi for Oak Ridge Titan supercomputer

CarstenS · Jan 26, 2016

Nakai said:
You need much processing power, in order to examine a video stream for certain objects. Still, the execution is not that dramatic, performance-wise. The training algorithms (gradient-descent, PSO) consume lots of processing power and a very iterative. Usually data sets consisting of millions of data files are used to train an artificial neural network. For simpler tasks, like MNIST and handwritten digit, the public training data set consists of 60,000 input images and their corresponding labels.

Thank you for that explanation - the whole one, though I'm focussing just on the quoted part here. It was stated in this thread, that you would need deterministic neural networks with predictable results for use in automated driving. This would mean that most of the training should be completed in factory-delivered cars already (i.e. offline) - and the question was, whether or not the live AutoM8s would still require massive computing power, when their Neural Networks have already been pre-trained.

McHuj · Jan 26, 2016

GP100 with only 4K shaders? With a 2X density increase, we're only getting 33% more shaders? Maybe we're getting a much higher clock as well.

RecessionCone · Jan 26, 2016

McHuj said:
GP100 with only 4K shaders? With a 2X density increase, we're only getting 33% more shaders? Maybe we're getting a much higher clock as well.

GM200 is a 600mm2 chip. Perhaps Nvidia doesn't want to be so aggressive for their first 16FF process. Also,I expect a GP100 shader is bigger than a GM200 shader, since it has DP and hopefully some other features.

Deleted member 13524 · Jan 26, 2016

GDDR5X will only go into mass production in late Q2/early Q3.
Those rumors can't be correct since they say GP104 would release at the same time (or even before?) GDDR5X enters mass production, while using that same memory.

I wouldn't expect anything with GDDR5X to release before Q4 or early 2017.

McHuj said:
GP100 with only 4K shaders? With a 2X density increase, we're only getting 33% more shaders? Maybe we're getting a much higher clock as well.

As stated before, it's unlikely that nVidia will start FinFet with a chip as large as GM200.
For new process nodes, nVidia has tradicionally been using ~100-120mm^2 chips to try out the waters: 100mm^2 for GT216 on 40nm, 118mm^2 for GK107 on 28nm).

Same thing with AMD, which is why their first Polaris card in the market is a laptop-oriented chip.

Kaotik · Jan 26, 2016

McHuj said:
GP100 with only 4K shaders? With a 2X density increase, we're only getting 33% more shaders? Maybe we're getting a much higher clock as well.

16nm won't probably allow you to do 600mm^2 chips. Also, (big) Pascal will have proper FP64 performance which Maxwell didn't, which eats space too.

Kaotik · Jan 26, 2016

xpea said:
some update on Pascal from this place:
http://techfrag.com/2016/01/25/nvid...-in-april-gtx-1080-in-june-and-volta-in-2017/
I can't find his source but it's apparently Japanese, looking at this picture:

If we believe this guy, big Pascal unveiled in April (maybe during GTC) then GP104 around Computech.
IMHO, looks too good to be true...

The more recent rumors suggest at least HBM2 Pascal coming in H2, nowhere near April

fuboi · Jan 26, 2016

ToTTenTranz said:
For new process nodes, nVidia has tradicionally been using ~100-120mm^2 chips to try out the waters: 100mm^2 for GT216 on 40nm, 118mm^2 for GK107 on 28nm).

Same thing with AMD, which is why their first Polaris card in the market is a laptop-oriented chip.

Were those chips process pathfinding chips? They (GPU vendors) won't be pioneers for the new node this time, right? Apple and others are handling the newborn this round, I think. Still, a 600mm2 chip seems risky.

silent_guy · Jan 26, 2016

Disable a few units, and it's not more risky than a 400mm2 chip.

Deleted member 13524 · Jan 26, 2016

fuboi said:
Were those chips process pathfinding chips? They (GPU vendors) won't be pioneers for the new node this time, right?

In the case of 40nm GT216, AMD had released the 40nm RV740 3 months before (138mm^2). For 28nm, AMD had released their entire 1st-gen GCN 28nm line-up (Tahiti, Pitcairn, Cape Verde) before the GK107 was available for laptops.

silent_guy said:
Disable a few units, and it's not more risky than a 400mm2 chip.

And performance/cost + performance/power goes down.

RecessionCone · Jan 26, 2016

silent_guy said:
Disable a few units, and it's not more risky than a 400mm2 chip.

I think the biggest constraint is 16FF wafer space.

silent_guy · Jan 26, 2016

ToTTenTranz said:
And performance/cost + performance/power goes down.

For a die that will be used in a $5000 product, the performance/cost reduction will be irrelevant. In fact, the reduced number of units will increase performance/cost due to massive improvement in yields.

And the performance/power reduction should be small as well. See Titan X vs GTX 980 Ti.

And those factors have nothing to do with risk, IMO.

Nvidia Pascal Speculation Thread

entity279

CarstenS

Moderator

Deleted member 2197

Guest

entity279

Nakai

xpea

iMacmatician

Alexko

xpea

CarstenS

Moderator

McHuj

RecessionCone

Deleted member 13524

Guest

Kaotik

Drunk Member

Kaotik

Drunk Member

fuboi

silent_guy

Deleted member 13524

Guest

RecessionCone

silent_guy

Similar threads