Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

1Learning Outcomes

Before continuing, we recommend reviewing the DGEMM benchmark and the performance optimizations we tried on a SISD architecture.

"Three matrix diagram showing DGEMM nested-loop structure and index mapping over i, j, and k dimensions. Multiplication of row 0 in matrix A with column 0 of matrix B results in element 0,0 in matrix C."

Figure 4:DGEMM for 8×88 \times 8 square matrices A and B. C00C_{00} is computed as the dot product of row i = 0 of A and column j = 0 of B.

2DGEMM 7: Naive SIMD DGEMM

With the ability to perform SIMD multiplication, it is tempting to use our new 256-bit-wide registers to load in blocks of row A and blocks of column B for the element-wise multiplication necessary for dot products, as shown in Figure 1.

"Visual of matrix matrix muliplication using three separate matrix grids. The first, on the left, has row 0 highlighted and differentiated into two segments. The second matrix in the middle has columns 0 through 3 highlighted in different colors. The first column is similarly differentiated into two segments. The third matrix on the right has the 0,0 element highlighted, showing where the result of multiplying row 0 of matrix A and column 0 of matrix B is stored in matrix C. The segments in row 0 and column 0 show the idea of a naive SIMD implementation to parallelize mulitplications within a single dot product."

Figure 1:“Naive” SIMD DGEMM that leverages SIMD architecture registers to parallelize multiplications within a single dot product. The outlined boxes indicate which values are loaded into the 256-bit-wide registers.[1]

This naive SIMD implementation does better than our naive DGEMM but worse than specifying narrow registers with the register keyword in C:

C               0.768672 seconds
registers:      0.277462 seconds
simd,naive:     0.416584 seconds

The non-compulsory cache miss persists in our naive SIMD implementation—we are still loading in multiple rows of B instead of columns of B.

3DGEMM 8: Transpose SIMD DGEMM

We next apply cache blocking by first transposing B, then leveraging our 256-bit-wide (four-double) registers for element-wise multiplication:

"Visual of matrix matrix muliplication using three separate matrix grids. The first, on the left, has row 0 highlighted and differentiated into two segments. The second matrix in the middle has column 0 highlighted and is similarly differentiated into two segments to indicate which values are loaded into the 256-bit-wide registers. The third matrix on the right has the 0,0 element highlighted, showing where the result of multiplying row 0 of matrix A and column 0 of matrix B is stored in matrix C. The segments in row 0 and column 0 show the idea of a naive SIMD implementation to parallelize mulitplications within a single dot product."

Figure 2:“Transposed” SIMD DGEMM that uses a transposed B to load in columns of B to streamline memory accesses. The outlined boxes indicate which values are loaded into the 256-bit-wide registers.

C               0.768672 seconds
registers:      0.277462 seconds
simd,naive:     0.416584 seconds
simd,transpose: 0.275622 seconds

This version is certainly speedier, but it does not prove huge benefits beyond our register keyword approach.

4DGEMM 9: Tiled SIMD DGEMM

Recall that cache blocking is any re-design of our algorithm to adjust memory accesses. Earlier, we discussed a submatrix tiling approach to matrix multiplication—where we compute multiple elements of C with the current set of rows of A and set of columns of B.

In Figure 3, we assume that the product of a scalar with a vector can be computed as a vector operation, provided that the scalar is copied to each element in a vector of the same length.

"Illustration of matrix multiplication: on the right, a cloud symbol encompassing the four element kth row of B times the single i,k element of A, which is computed repeatedly for rows 0 through N. On the left, the resulting 4-element accumulated sum in a row of C, starting with element C_i,j."

Figure 3:Compute four elements of C (starting with CijC_{ij}) by iteratively adding the result of scaling four elements of the kk-th row of B (starting with BkjB_{kj}) with AikA_{ik}.

We can extend this idea to a tiled SIMD approach shown in Figure 4.

Figure 4:SIMD dgemm tiled matrix multiplication. The outlined boxes indicate which values are loaded into the 256-bit-wide registers. Use the menu bar to trace through the animation or access the original Google Slides.

The “tiled” cache blocking SIMD approach performs slightly worse than the transposed SIMD approach.

C               0.768672 seconds
registers:      0.277462 seconds
simd,naive:     0.416584 seconds
simd,transpose: 0.275622 seconds
simd,tiled:     0.356344 seconds

5DGEMM 10: GCC Optimization

Finally, let’s compile using different gcc optimization flags. In Table 1, we can see that even with mild gcc optimizations like -O1, SIMD vastly outperforms any SISD approach.

Table 1:DGEMM SIMD runtime (in seconds) with different optimization flags.

Program

-O0

-O1

-O2

-O3

DGEMM (original)

0.768672

0.197168

0.197538

0.193889

DGEMM with registers

0.277462

0.191474

0.191921

0.126439

DGEMM, naive SIMD

0.416584

0.048405

0.049970

0.049045

DGEMM transpose SIMD

0.275622

0.031824

0.032188

0.031978

DGEMM tile SIMD

0.356344

0.033386

0.035177

0.037030

Generally speaking, most of the speedup comes not from doing multiple math operations at a time, but instead from doing a large memory load/store at a time.

6DGEMM SIMD Code

The three algorithsm discussed in this section leverage Intel SIMD extensions. You will see in the code below that the SIMD instructions are written in C as Intel Intrinsics. More next!

DGEMM, SIMD naive
DGEMM, SIMD Transpose
DGEMM, SIMD Tiled
DGEMM (original)
AVX intrinsics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
matrix_d_t *dgemm_simd(matrix_d_t *A, matrix_d_t *B) {
  if (A->ncols!=B->nrows) return NULL; 
    matrix_d_t *C = init_mat_d(A->nrows, B->ncols);
  for (int i = 0; i < A->nrows; i++) {
    for (int j = 0; j < B->ncols; j+=4) { // 4 doubles at a time
            avx256_t v_C = avx_load(C->data + i*C->ncols +j);
            for (int k = 0; k < A->ncols; k++) {
                avx256_t s_A = avx_set_num(A->data[(i*A->ncols)+k]);
                avx256_t v_B = avx_load(B->data+k*B->ncols+j);
                // C_ij += a_ik * B_jk (for j = 0...3)
                v_C = avx_mul_add(s_A, v_B, v_C);
            }
            avx_store(C->data+i*C->ncols+j, v_C);
    }
  }
  return C;
}
Footnotes
  1. In Figure 1, we assume a cache that has 256-bit blocks. This carries over the cache assumption from our cache blocking discussion.