They increase the die by 33.6% while impressively keeping with same 300W and yet they go further:
The
impressive part for me is more of a "
oh shit I can't believe they were bullish enough to do this" than an actually technical achievement.
Each waffer can only make very few chips and most probably the great majority of them comes out with a defect (either for disabled units or for .
With a 300mm wafer they're probably getting around 60-65 dies per wafer.
They're only making this because they got the clients to pay for >$15k per GPU, meaning a 2% yield (practically 1 good GPU per wafer) is already providing some profit.
10% yields (6 good chips) means getting them $90K revenue, of which they're probably getting a profit of well over $80K after putting the cards together.
FP32 compute increases by 41.5% or 2x (yeah depends upon function with Tensor).
FP64 compute increases by 41.5%
FP16 compute increased by 41.5% or 4x (yeah depends upon function with Tensor).
Squeezing into that 33.6% die increase an extra 41% Cuda cores and importantly with additional functions/units.
The FP32 and FP64 unit increase is almost a match to the increase in die area. Unlike Pascal P100, the FP32 units don't seem to do 2*FP16 operations anymore, as the Tensor cores do that instead.
So what they saved in smaller FP32 units and general die area from the 12FF transition, they invested in the Tensor cores.
Is there a game rendering application for the Tensor units?
The Tensor cores are definitely unable to unpack the values at any position in the cubic matrixes (otherwise they would be just regular FP16 ALUs). My guess is someone can just multiply 4*4 matrixes using two 4*
1 matrixes with "valid" FP16 values and the 3rd dimension could just be filled with 1s, and in the end you just read the first row (EDIT: derp, forgot how to Algebra).
That said, this results in 30 TFLOPs (120/4) of regular FP16 FMAD operations.
Other than being usable as dedicated FP16 units, I don't see any
rendering application for the Tensor units. They could be used for AI inferencing in a game, though..
For gaming, they'd probably be better off going back to the FP32 units capable of doing 2*FP16 operations.
Or like what they did with consumer Pascal, just ignore FP16 altogether and just promote all FP16 variables to FP32 and call it a day. This would be
risky because in the future there could be developers using a lot of FP16 in rendering, but nvidia's architectures in consumer products aren't exactly known for being extremely future-proof.