Jawed
Legend
The second important point about C is that it enables you to construct the multiplication of larger matrices, from the building blocks of smaller multiplies. So if we're working with 8x8 matrices, we can break each matrix down into 4 sub-matrices: top-left, top-right, bottom-left and bottom-right.
So, the top-left sub-matrix of D, is computed as follows:
D-tl =
A-tl * B-tl
+ A-tr * B-bl
then:
D-bl =
A-bl * B-tl
+ A-br * B-bl
and then similar for D-tr and D-br.
So the tensor operation becomes the fundamental building block of arbitrary-sized matrix multiplication. In prior GPUs, FMA was that building block. This tensor operation is essentially an FMA on matrix-blocks.
So 4 tensor cores share the same register file as 32 FP32 cores (or 32 int cores or 16 FP64 cores). Since an FP32 core can fetch two FP16 operands per clock for each of A, B and C in a conventional FMA (D = A * B + C - though V100 does not have ordinary, general purpose, FP16 cores) and since each tensor core takes the place of eight FP32 cores, a tensor core has access to 48 FP16 operands. Which is precisely the required operand count for the tensor core in FP16 mode.
In mixed FP16/32 mode, D is computed as FP32 and then it would be fed forwards into successive tensor operations as C. In this scenario, register forwarding would take on the strain of providing the operand bandwidth for 32-bit C, so the GPU wouldn't be starved of bandwidth in trying to fetch 2x 16-bit for A and B then 32-bit for C from the register file.
So in the computation of D-tl, the second tensor operation is A-tr * B-bl (A * B, both FP16 4x4 matrices) added to C which is an FP32 4x4 matrix. C was computed in the previous tensor operation from A-tl * B-tl as an FP32 4x4 matrix resultant. So forwarding C to the second tensor operation, instead of writing it to the register file means there is no issue with the operand bandwidth that FP32 C uses.
So, the top-left sub-matrix of D, is computed as follows:
D-tl =
A-tl * B-tl
+ A-tr * B-bl
then:
D-bl =
A-bl * B-tl
+ A-br * B-bl
and then similar for D-tr and D-br.
So the tensor operation becomes the fundamental building block of arbitrary-sized matrix multiplication. In prior GPUs, FMA was that building block. This tensor operation is essentially an FMA on matrix-blocks.
So 4 tensor cores share the same register file as 32 FP32 cores (or 32 int cores or 16 FP64 cores). Since an FP32 core can fetch two FP16 operands per clock for each of A, B and C in a conventional FMA (D = A * B + C - though V100 does not have ordinary, general purpose, FP16 cores) and since each tensor core takes the place of eight FP32 cores, a tensor core has access to 48 FP16 operands. Which is precisely the required operand count for the tensor core in FP16 mode.
In mixed FP16/32 mode, D is computed as FP32 and then it would be fed forwards into successive tensor operations as C. In this scenario, register forwarding would take on the strain of providing the operand bandwidth for 32-bit C, so the GPU wouldn't be starved of bandwidth in trying to fetch 2x 16-bit for A and B then 32-bit for C from the register file.
So in the computation of D-tl, the second tensor operation is A-tr * B-bl (A * B, both FP16 4x4 matrices) added to C which is an FP32 4x4 matrix. C was computed in the previous tensor operation from A-tl * B-tl as an FP32 4x4 matrix resultant. So forwarding C to the second tensor operation, instead of writing it to the register file means there is no issue with the operand bandwidth that FP32 C uses.