“The transformer engine as it was originally invented with Hopper, what it does is it tracks the accuracy and the dynamic range of every layer of every tensor in the entire neural network as it proceeds in computing,” explained Buck. “And as the model is training over time, we are constantly monitoring the ranges of every layer and adapting to stay within the bounds of the numerical precision to get the best performance. In Hopper, this bookkeeping extends to a 1,000-way history to compute updates and scale factors to allow the entire computation to happen in only eight bits precision.
With Blackwell, we take it a step further. In hardware, we can adjust the scaling on every tensor. Blackwell supports micro tensor scaling is not the entire tensor, which we can still monitor, but now we can look at the individual elements within the tensor. To take this a step further, Blackwell’s second generation Transformer Engine will allow us to take AI computing to FP4, or using only four bits floating point representation, to perform the AI calculation. That’s four zeros and ones for every neuron, every connection – literally the numbers one through 16.
Getting down to that level of fine granularity is a miracle in itself. And the second generation Transformer Engine does that work combined with Blackwell’s micro tensor scaling, and that means we can deliver twice the amount of compute as before, we can double the effective bandwidth because from eight bits to four bits is half the size. And of course, double the model size can fit on an individual GPU.”