3.2 SGEMM
SGEMM is the BLAS 3 subroutine used to multiply two single precision general dense matrices. It is the key building block for many dense linear algebra algorithms. The BLAS 3 routine includes many variations (e.g. scaling, transposition, etc), but we consider only the simplest case: C = A x B, where A, B, and C are NxP, PxM, and NxM matrices, respectively.
The SGEMM routine can be defined by the following C code fragment:
Code:
for (i=0; i<N;i++)
for (j=0;j<M;j++)
C[i][j] = 0.0;
for(k=0;k<P;k++)
C[i][j] +=
A[i][k] * B[k][j];
Modern algorithms for SGEMM are block-based and decompose the problem into smaller SGEMM problems. For matrices of order N these algorithms have read/write operations that scale as O(N²) while computational effort is O(N³). By selecting large blocks that fit in cache, the read/write operations can be overlapped with computations, allowing SGEMM to execute near the peak floating point capacity of a chip.
The 80-core Terascale processor does not have a cache, but blocking could still be useful for overlapping communication and computation as well as register blocking. However, block-structured algorithms require triple loops over the blocks of the matrices. Since the Terascale Processor does not support nested loops, we have to use a different and less efficient algorithm based on dot products. We mapped the cores onto a ring topology. The A and C matrices were decomposed into rows with one row per core. B was decomposed by columns with one column per core. The algorithm proceeded as follows:
Code:
On core number i
Loop over j = 1, M
{
C[i][j] =
dot_product (row Ai *
column Bj)
Circular shift column Bj
to neighboring core
}
Optimization and Performance Analysis:
Additional optimizations used in the final algorithm
- Used diagonal wrapped storage where the array holding each row of C started with the diagonal element and wraped around, thereby letting us use a single indexing scheme into C for each core.
- Unrolled the dot product loop, so we could use a single loop to run over a column of C.
- Organized code so elements of B were shifted immediately after they were used; thereby overlapping computation and communication
We selected the matrix dimensions N, M and P to fill the available memory: N=P=80 and M=206. Each element of C required a dot product of length M, resulting in M multiplies, M-1 adds, and 2M loads. Each FPU can retire one FMAC per cycle. The FPU is a 9-stage pipeline. So to load registers to feed the pipelines for a dot product required 2*9 registers for the operands and 1 register for the accumulator; or 38 registers for both FMAC units. Obviously, there were not enough registers to keep both FPUs running at 100% for a dot product. The fundamental performance bottleneck, however, was the fact that one can only load 2 operands from data memory per cycle. Since the FMACs need 2 operands from memory each, the loads limited peak dot product performance to 50% of the total available floating point performance.