Permuted Layout: Batched Matrix-Multiplication

This section contains an exercise file RAJA/exercises/permuted-layout-batch-matrix-multiply.cpp for you to work through if you wish to get some practice with RAJA. The file RAJA/exercises/permuted-layout-batch-matrix-multiply_solution.cpp contains complete working code for the examples discussed in this section. You can use the solution file to check your work and for guidance if you get stuck. To build the exercises execute make permuted-layout-batch-matrix-multiply and make permuted-layout-batch-matrix-multiply_solution from the build directory.

Key RAJA features shown in the following example:

  • RAJA::forall loop traversal template

  • RAJA execution policies

  • RAJA::View multi-dimensional data access

  • RAJA::make_permuted_layout method to permute data ordering

This example performs a “batched” matrix multiplication operation for a collection of \(3 \times 3\) matrices. Each pair of matrices \(A^{e}\) and \(B^{e}\), indexed by ‘e’, is multiplied and the product is stored in a matrix \(C^{e}\). \(A^{e}\) matrix entries, for all values of e, are stored in an array \(A\), all \(B^{e}\) matrices are stored in an array \(B\), and all \(C^{e}\) matrices are stored in an array \(C\). In the following discussion, the notation \(A^{e}_{rc}\) indicates the row r and column c entry of the \(3 \times 3\) matrix \(A^{e}\).

In the exercise, we use two different data layouts for the arrays \(A\), \(B\), and \(C\) to represent different storage patterns for the \(3 \times 3\) matrices. Below, we describe these layouts for two \(3 \times 3\) matrices. The extension to more than two matrices is straightforward as you will see in the exercise code. In the exercise code, we time the execution of the batched matrix multiplication operation to compare the performance for each layout and execution policy. These comparisons are not completely conclusive as to which layout is best since there may be additional performance to be gained by more specific tuning of the memory layouts for an architecture and execution back-end. A complete, detailed analysis of the performance implications of memory layout and access patterns is beyond the scope of the exercise.

In layout 1, the entries for each \(3 \times 3\) matrix are contiguous in memory following row major ordering; i.e., the ordering is column index, then row index, then matrix index:

\[\begin{split}A = [A^{0}_{00}, A^{0}_{01}, A^{0}_{02}, A^{0}_{10}, A^{0}_{11}, A^{0}_{12}, A^{0}_{20}, A^{0}_{21}, A^{0}_{22},\\ A^{1}_{00}, A^{1}_{01}, A^{1}_{02}, A^{1}_{10}, A^{1}_{11}, A^{1}_{12}, A^{1}_{20}, A^{1}_{21}, A^{1}_{22}]\end{split}\]

In layout 2, the matrix entries are first ordered by matrix index, then by column index, and finally by row index:

\[\begin{split}A = [A^{0}_{00}, A^{1}_{00}, A^{0}_{01}, A^{1}_{01}, A^{0}_{02}, A^{1}_{02}, A^{0}_{10}, A^{1}_{10}, A^{0}_{11},\\ A^{1}_{11}, A^{0}_{12}, A^{1}_{12}, A^{0}_{20}, A^{1}_{20}, A^{0}_{21}, A^{1}_{21}, A^{0}_{22}, A^{1}_{22}]\end{split}\]

Permuted Layouts

Next, we show how to construct the two data layouts described above using RAJA::View and RAJA::Layout objects. For more information on these RAJA concepts, please see View and Layout.

The views to access data for layout 1 are constructed as follows:

  std::array<RAJA::idx_t, 3> perm1 {{0, 1, 2}};
  auto layout1 =
      RAJA::make_permuted_layout( {{N, N_r, N_c}}, perm1 );

  RAJA::View<double, RAJA::Layout<3, int, 2>> Aview(A, layout1);
  RAJA::View<double, RAJA::Layout<3, int, 2>> Bview(B, layout1);
  RAJA::View<double, RAJA::Layout<3, int, 2>> Cview(C, layout1);

The first argument to RAJA::make_permuted_layout is an array whose entries correspond to the extent of each layout dimension. Here, we have \(N\) \(N_r \times N_c\) matrices. The second argument, the layout permutation, describes the striding order of the array indices. Note that since this case follows the default RAJA ordering convention (see View and Layout), we use the identity permutation ‘(0,1,2)’. For each matrix, the column index (index 2) has unit stride and the row index (index 1) has stride \(N_c\), the number of columns in each matrix. The matrix index (index 0) has stride \(N_r \times N_c\), the number of entries in each matrix.

The views for layout 2 are constructed similarly, with a different index striding order:

  std::array<RAJA::idx_t, 3> perm2 {{1, 2, 0}};
  auto layout2 =
      RAJA::make_permuted_layout( {{N, N_r, N_c}}, perm2 );

  RAJA::View<double, RAJA::Layout<3, int, 0>> Aview2(A2, layout2);
  RAJA::View<double, RAJA::Layout<3, int, 0>> Bview2(B2, layout2);
  RAJA::View<double, RAJA::Layout<3, int, 0>> Cview2(C2, layout2);

Here, the first argument to RAJA::make_permuted_layout is the same as in layout 1 since we have the same number of matrices with the same matrix dimensions, and we will use the same indexing scheme to access the matrix entries. However, the permutation we use is ‘(1,2,0)’. This makes the matrix index (index 0) have unit stride, the column index (index 2) have stride N, which is the number of matrices, and the row index (index 1) has stride \(N \times N_c\).

RAJA Kernel Variants

The exercise files contain RAJA variants that run the batched matrix multiplication kernel with different execution back-ends. As mentioned earlier, we print out execution timings for each so you can compare the run times of the different layouts described above. For example, the sequential CPU variant using layout 1 is:

    RAJA::forall<RAJA::seq_exec>(RAJA::TypedRangeSegment<int>(0, N),
      [=](int e) {

        Cview(e, 0, 0) = Aview(e, 0, 0) * Bview(e, 0, 0)
                         + Aview(e, 0, 1) * Bview(e, 1, 0)
                         + Aview(e, 0, 2) * Bview(e, 2, 0);
        Cview(e, 0, 1) = Aview(e, 0, 0) * Bview(e, 0, 1)
                         + Aview(e, 0, 1) * Bview(e, 1, 1)
                         + Aview(e, 0, 2) * Bview(e, 2, 1);
        Cview(e, 0, 2) = Aview(e, 0, 0) * Bview(e, 0, 2)
                         + Aview(e, 0, 1) * Bview(e, 1, 2)
                         + Aview(e, 0, 2) * Bview(e, 2, 2);

        Cview(e, 1, 0) = Aview(e, 1, 0) * Bview(e, 0, 0)
                         + Aview(e, 1, 1) * Bview(e, 1, 0)
                         + Aview(e, 1, 2) * Bview(e, 2, 0);
        Cview(e, 1, 1) = Aview(e, 1, 0) * Bview(e, 0, 1)
                         + Aview(e, 1, 1) * Bview(e, 1, 1)
                         + Aview(e, 1, 2) * Bview(e, 2, 1);
        Cview(e, 1, 2) = Aview(e, 1, 0) * Bview(e, 0, 2)
                         + Aview(e, 1, 1) * Bview(e, 1, 2)
                         + Aview(e, 1, 2) * Bview(e, 2, 2);

        Cview(e, 2, 0) = Aview(e, 2, 0) * Bview(e, 0, 0)
                         + Aview(e, 2, 1) * Bview(e, 1, 0)
                         + Aview(e, 2, 2) * Bview(e, 2, 0);
        Cview(e, 2, 1) = Aview(e, 2, 0) * Bview(e, 0, 1)
                         + Aview(e, 2, 1) * Bview(e, 1, 1)
                         + Aview(e, 2, 2) * Bview(e, 2, 1);
        Cview(e, 2, 2) = Aview(e, 2, 0) * Bview(e, 0, 2)
                         + Aview(e, 2, 1) * Bview(e, 1, 2)
                         + Aview(e, 2, 2) * Bview(e, 2, 2);
      }
    );

The sequential CPU variant using layout 2 is:

    RAJA::forall<RAJA::seq_exec>(RAJA::TypedRangeSegment<int>(0, N), 
      [=](int e) {

        Cview2(e, 0, 0) = Aview2(e, 0, 0) * Bview2(e, 0, 0)
                          + Aview2(e, 0, 1) * Bview2(e, 1, 0)
                          + Aview2(e, 0, 2) * Bview2(e, 2, 0);
        Cview2(e, 0, 1) = Aview2(e, 0, 0) * Bview2(e, 0, 1)
                          + Aview2(e, 0, 1) * Bview2(e, 1, 1)
                          + Aview2(e, 0, 2) * Bview2(e, 2, 1);
        Cview2(e, 0, 2) = Aview2(e, 0, 0) * Bview2(e, 0, 2)
                          + Aview2(e, 0, 1) * Bview2(e, 1, 2)
                          + Aview2(e, 0, 2) * Bview2(e, 2, 2);

        Cview2(e, 1, 0) = Aview2(e, 1, 0) * Bview2(e, 0, 0)
                          + Aview2(e, 1, 1) * Bview2(e, 1, 0)
                          + Aview2(e, 1, 2) * Bview2(e, 2, 0);
        Cview2(e, 1, 1) = Aview2(e, 1, 0) * Bview2(e, 0, 1)
                          + Aview2(e, 1, 1) * Bview2(e, 1, 1)
                          + Aview2(e, 1, 2) * Bview2(e, 2, 1);
        Cview2(e, 1, 2) = Aview2(e, 1, 0) * Bview2(e, 0, 2)
                          + Aview2(e, 1, 1) * Bview2(e, 1, 2)
                          + Aview2(e, 1, 2) * Bview2(e, 2, 2);

        Cview2(e, 2, 0) = Aview2(e, 2, 0) * Bview2(e, 0, 0)
                          + Aview2(e, 2, 1) * Bview2(e, 1, 0)
                          + Aview2(e, 2, 2) * Bview2(e, 2, 0);
        Cview2(e, 2, 1) = Aview2(e, 2, 0) * Bview2(e, 0, 1)
                          + Aview2(e, 2, 1) * Bview2(e, 1, 1)
                          + Aview2(e, 2, 2) * Bview2(e, 2, 1);
        Cview2(e, 2, 2) = Aview2(e, 2, 0) * Bview2(e, 0, 2)
                          + Aview2(e, 2, 1) * Bview2(e, 1, 2)
                          + Aview2(e, 2, 2) * Bview2(e, 2, 2);

      }
    );

The only differences between these two are the names of the views that appear in the lambda expression loop body since a different layout is used to create view objects for each layout case. To make the algorithm code identical for all cases, we could use type aliases for the view and layout types in a header file similar to how we may abstract the execution policy out of the algorithm, and compile the code for the case we want to run.

For comparison, here is an OpenMP CPU variant using layout 1:

    RAJA::forall<RAJA::omp_parallel_for_exec>(RAJA::TypedRangeSegment<int>(0, N), 
      [=](int e) {

        Cview(e, 0, 0) = Aview(e, 0, 0) * Bview(e, 0, 0)
                         + Aview(e, 0, 1) * Bview(e, 1, 0)
                         + Aview(e, 0, 2) * Bview(e, 2, 0);
        Cview(e, 0, 1) = Aview(e, 0, 0) * Bview(e, 0, 1)
                         + Aview(e, 0, 1) * Bview(e, 1, 1)
                         + Aview(e, 0, 2) * Bview(e, 2, 1);
        Cview(e, 0, 2) = Aview(e, 0, 0) * Bview(e, 0, 2)
                         + Aview(e, 0, 1) * Bview(e, 1, 2)
                         + Aview(e, 0, 2) * Bview(e, 2, 2);

        Cview(e, 1, 0) = Aview(e, 1, 0) * Bview(e, 0, 0)
                         + Aview(e, 1, 1) * Bview(e, 1, 0)
                         + Aview(e, 1, 2) * Bview(e, 2, 0);
        Cview(e, 1, 1) = Aview(e, 1, 0) * Bview(e, 0, 1)
                         + Aview(e, 1, 1) * Bview(e, 1, 1)
                         + Aview(e, 1, 2) * Bview(e, 2, 1);
        Cview(e, 1, 2) = Aview(e, 1, 0) * Bview(e, 0, 2)
                         + Aview(e, 1, 1) * Bview(e, 1, 2)
                         + Aview(e, 1, 2) * Bview(e, 2, 2);

        Cview(e, 2, 0) = Aview(e, 2, 0) * Bview(e, 0, 0)
                         + Aview(e, 2, 1) * Bview(e, 1, 0)
                         + Aview(e, 2, 2) * Bview(e, 2, 0);
        Cview(e, 2, 1) = Aview(e, 2, 0) * Bview(e, 0, 1)
                         + Aview(e, 2, 1) * Bview(e, 1, 1)
                         + Aview(e, 2, 2) * Bview(e, 2, 1);
        Cview(e, 2, 2) = Aview(e, 2, 0) * Bview(e, 0, 2)
                         + Aview(e, 2, 1) * Bview(e, 1, 2)
                         + Aview(e, 2, 2) * Bview(e, 2, 2);

      }
    );

The only difference between this variant and the sequential CPU variant shown above is the execution policy. The lambda expression loop body is identical to the sequential CPU variant.

The exercise files also contain variants for RAJA CUDA and HIP back-ends. Their similarities and differences are the same as what we’ve just described.