There are plenty deep learning and machine learning algorithms for different use cases. Popular approaches are SLAM, HOG and CNNs.
SLAM: Location and orientation in 3D space.
HOG: Feature distinction. Feature is present or not.
CNN: Recognition and differentiation of multiple features (different traffic signs etc.)
NVidia ist focussing on CNNs. CNN (Convolutional Neural Network) are prominent for a couple of years, as they achieve super-human image recognition rate. CNNs are composed of multiple layers, usually there convolutional layers (CL), (max) pooling layers (PL) and fully-connected layers (FL). CLs are used to extract certain features out of the underlying image. For each feature there exists a corresponding CL, where each neuron within an CL shares the same weights and bias. For example, if the input image is 28x28 (MNIST) and if you want to investigate features with the size of 5x5, the final CL has a size of 23x23 and each neuron in the CL has 5x5 (25) connections. So each neuron in a CL is investigating an area of 5x5 pixels. After a CL there is always a PL, which are used to raze noise and other irregularities.
You have to use multiple parallel CLs and PLs, as you always want to search for more features, as well there are subsequent CLs and PLs, in order to extract higher level features. After some cascading CLs and PLs, there are usually FLs, at least one. These are used to ghather the extracted features and data and map them to certain identification outputs. If you want to recognize single digits (0-9) you have 10 outputs in your output layer.
You need much processing power, in order to examine a video stream for certain objects. Still, the execution is not that dramatic, performance-wise. The training algorithms (gradient-descent, PSO) consume lots of processing power and a very iterative. Usually data sets consisting of millions of data files are used to train an artificial neural network. For simpler tasks, like MNIST and handwritten digit, the public training data set consists of 60,000 input images and their corresponding labels.
For computing neural networks you need multiplication and addition, don't expect a magical new special operation.
Typically the neuron outputs are 8 bit unsigned and the neural net weights are 8 bit signed.
For fully connected layers of say 1024 inputs and 1024 outputs, you have a 1024x1024 matrix in between.
All computation goes into multiplying a 1024 vector with a 1024 x 1024 matrix.
In case of 8 bit 'special' hardware can speedup this by doing for example n0*w0 + n1*w1 + n2*w2 + n3*w3
and accumulate this with a 32-bit accumulator, the multiplications being 8 bit. Hence the mixed precision.
To wow the crowds, sure 24 tera Deep Leanrning operations per second sound more impressive than 24 tera 8-bit mixed precision operations.
1) Well, and how do you handle numeric problems, like overflows and underflows? I've implemented an small OpenCL lib for training and execution and a concept to execute ANNs on an FPGA.
I've used my lib to train with fix-point numbers in order to transmigrate my networks onto my FPGA. I was using 16 bit fix-point numbers with the Q6,10 [-32,32] number format and encountered many overflows and underflows, for addition and multiplication. Two 16 bit numbers can yield a 17 bit number for addition and a 32 bit number for multiplication. In order to handle those errors, I truncated my value range. If I encountered an overflow, I set the value to max, and vice versa for underflow.
2) Which activation function did you use? Most likely ReLU. So you reduced the final 32 bit value to a 8 bit output (signed)? Were there any problems?