Batched Matrix-Multiply (Permuted Layouts)

Key RAJA features shown in the following example:

  • RAJA::forall loop traversal template
  • RAJA execution policies
  • RAJA::View multi-dimensional data access
  • RAJA::make_permuted_layout method to permute data ordering

This example performs batched matrix multiplication for a set of \(3 \times 3\) matrices using two different data layouts.

Matrices \(A\) and \(B\) are multiplied with the product stored in matrix \(C\). The notation \(A^{e}_{rc}\) indicates the row r and column c entry of matrix e. We describe the two data layouts we use for two matrices. The extension to more than two matrices is straightforward. Using different data layouts, we can assess which performs best for a given execution policy and computing environment.

Layout 1: Entries in each matrix are grouped together with each each having row major ordering; i.e.,

\[\begin{split}A = [A^{0}_{00}, A^{0}_{01}, A^{0}_{02}, A^{0}_{10}, A^{0}_{11}, A^{0}_{12}, A^{0}_{20}, A^{0}_{21}, A^{0}_{22},\\ A^{1}_{00}, A^{1}_{01}, A^{1}_{02}, A^{1}_{10}, A^{1}_{11}, A^{1}_{12}, A^{1}_{20}, A^{1}_{21}, A^{1}_{22}];\end{split}\]

Layout 2: Matrix entries are first ordered by matrix index, then by column index, and finally by row index; i.e.,

\[\begin{split}A = [A^{0}_{00}, A^{1}_{00}, A^{0}_{01}, A^{1}_{01}, A^{0}_{02}, A^{1}_{02}, A^{0}_{10}, A^{1}_{10}, A^{0}_{11},\\ A^{1}_{11}, A^{0}_{12}, A^{1}_{12}, A^{0}_{20}, A^{1}_{20}, A^{0}_{21}, A^{1}_{21}, A^{0}_{22}, A^{1}_{22}];\end{split}\]

Permuted Layouts

Next, we show how to construct the two data layouts using RAJA::View and RAJA::Layout objects. For more details on these RAJA concepts, please refer to View and Layout.

The views for layout 1 are constructed as follows:

  std::array<RAJA::idx_t, 3> perm1 {{0, 1, 2}};
  auto layout1 =
      RAJA::make_permuted_layout( {{N, N_r, N_c}}, perm1 );

  RAJA::View<double, RAJA::Layout<3, Index_type, 2>> Aview(A, layout1);
  RAJA::View<double, RAJA::Layout<3, Index_type, 2>> Bview(B, layout1);
  RAJA::View<double, RAJA::Layout<3, Index_type, 2>> Cview(C, layout1);

The first argument to RAJA::make_permuted_layout is a C++ array whose entries correspond to the size of each array dimension; i.e., we have ‘N’ \(N_r \times N_c\) matrices. The second argument describes the striding order of the array dimensions. Note that since this case follows the default RAJA ordering convention (see View and Layout), we use the identity permutation ‘(0,1,2)’. For each matrix, the column index (index 2) has unit stride and the row index (index 1) has stride 3 (number of columns). The matrix index (index 0) has stride 9 (\(N_c \times N_r\)).

The views for layout 2 are constructed similarly:

  std::array<RAJA::idx_t, 3> perm2 {{1, 2, 0}};
  auto layout2 =
      RAJA::make_permuted_layout( {{N, N_r, N_c}}, perm2 );

  RAJA::View<double, RAJA::Layout<3, Index_type, 0>> Aview2(A2, layout2);
  RAJA::View<double, RAJA::Layout<3, Index_type, 0>> Bview2(B2, layout2);
  RAJA::View<double, RAJA::Layout<3, Index_type, 0>> Cview2(C2, layout2);

Here, the first argument to RAJA::make_permuted_layout is the same as in layout 1 since we have the same number of matrices, matrix dimensions and we will use the same indexing scheme to access the matrix entries. However, the permutation we use is ‘(1,2,0)’. This makes the matrix index (index 0) have unit stride, the column index (index 2) for each matrix has stride N, which is the number of matrices, and the row index (index 1) has stride \(N \times N_c\).

Example Code

Complete working examples that run the batched matrix-multiplication computation for both layouts and various RAJA execution policies is located in the file RAJA/examples/tut_batched-matrix-multiply.cpp.

It compares the execution run times of the two layouts described above using four RAJA back-ends (Sequential, OpenMP, CUDA, and HIP). The OpenMP version for layout 1 looks like this:

    RAJA::forall<RAJA::omp_parallel_for_exec>(
        RAJA::RangeSegment(0, N), [=](Index_type e) {

          Cview(e, 0, 0) = Aview(e, 0, 0) * Bview(e, 0, 0)
                           + Aview(e, 0, 1) * Bview(e, 1, 0)
                           + Aview(e, 0, 2) * Bview(e, 2, 0);
          Cview(e, 0, 1) = Aview(e, 0, 0) * Bview(e, 0, 1)
                           + Aview(e, 0, 1) * Bview(e, 1, 1)
                           + Aview(e, 0, 2) * Bview(e, 2, 1);
          Cview(e, 0, 2) = Aview(e, 0, 0) * Bview(e, 0, 2)
                           + Aview(e, 0, 1) * Bview(e, 1, 2)
                           + Aview(e, 0, 2) * Bview(e, 2, 2);

          Cview(e, 1, 0) = Aview(e, 1, 0) * Bview(e, 0, 0)
                           + Aview(e, 1, 1) * Bview(e, 1, 0)
                           + Aview(e, 1, 2) * Bview(e, 2, 0);
          Cview(e, 1, 1) = Aview(e, 1, 0) * Bview(e, 0, 1)
                           + Aview(e, 1, 1) * Bview(e, 1, 1)
                           + Aview(e, 1, 2) * Bview(e, 2, 1);
          Cview(e, 1, 2) = Aview(e, 1, 0) * Bview(e, 0, 2)
                           + Aview(e, 1, 1) * Bview(e, 1, 2)
                           + Aview(e, 1, 2) * Bview(e, 2, 2);

          Cview(e, 2, 0) = Aview(e, 2, 0) * Bview(e, 0, 0)
                           + Aview(e, 2, 1) * Bview(e, 1, 0)
                           + Aview(e, 2, 2) * Bview(e, 2, 0);
          Cview(e, 2, 1) = Aview(e, 2, 0) * Bview(e, 0, 1)
                           + Aview(e, 2, 1) * Bview(e, 1, 1)
                           + Aview(e, 2, 2) * Bview(e, 2, 1);
          Cview(e, 2, 2) = Aview(e, 2, 0) * Bview(e, 0, 2)
                           + Aview(e, 2, 1) * Bview(e, 1, 2)
                           + Aview(e, 2, 2) * Bview(e, 2, 2);

        });

The only differences between the lambda loop body for layout 1 and layout 2 cases are the names of the views. To make the algorithm code identical for all cases, we would use type aliases for the view and layout types in a header file similarly to how we would abstract the execution policy out of the algorithm.