Batched Matrix-Multiply (Permuted Layouts)¶
Key RAJA features shown in the following example:
RAJA::forall
loop traversal template- RAJA execution policies
RAJA::View
multi-dimensional data accessRAJA::make_permuted_layout
method to permute data ordering
This example performs batched matrix multiplication for a set of \(3 \times 3\) matrices using two different data layouts.
Matrices \(A\) and \(B\) are multiplied with the product stored in matrix \(C\). The notation \(A^{e}_{rc}\) indicates the row r and column c entry of matrix e. We describe the two data layouts we use for two matrices. The extension to more than two matrices is straightforward. Using different data layouts, we can assess which performs best for a given execution policy and computing environment.
Layout 1: Entries in each matrix are grouped together with each each having row major ordering; i.e.,
Layout 2: Matrix entries are first ordered by matrix index, then by column index, and finally by row index; i.e.,
Permuted Layouts¶
Next, we show how to construct the two data layouts using RAJA::View
and
RAJA::Layout
objects. For more details on these RAJA concepts, please
refer to View and Layout.
The views for layout 1 are constructed as follows:
std::array<RAJA::idx_t, 3> perm1 {{0, 1, 2}};
auto layout1 =
RAJA::make_permuted_layout( {{N, N_r, N_c}}, perm1 );
RAJA::View<double, RAJA::Layout<3, Index_type, 2>> Aview(A, layout1);
RAJA::View<double, RAJA::Layout<3, Index_type, 2>> Bview(B, layout1);
RAJA::View<double, RAJA::Layout<3, Index_type, 2>> Cview(C, layout1);
The first argument to RAJA::make_permuted_layout
is a C++ array
whose entries correspond to the size of each array dimension; i.e., we have
‘N’ \(N_r \times N_c\) matrices. The second argument describes the
striding order of the array dimensions. Note that since this case follows
the default RAJA ordering convention (see View and Layout), we use the
identity permutation ‘(0,1,2)’. For each matrix, the column index (index 2)
has unit stride and the row index (index 1) has stride 3 (number of columns).
The matrix index (index 0) has stride 9 (\(N_c \times N_r\)).
The views for layout 2 are constructed similarly:
std::array<RAJA::idx_t, 3> perm2 {{1, 2, 0}};
auto layout2 =
RAJA::make_permuted_layout( {{N, N_r, N_c}}, perm2 );
RAJA::View<double, RAJA::Layout<3, Index_type, 0>> Aview2(A2, layout2);
RAJA::View<double, RAJA::Layout<3, Index_type, 0>> Bview2(B2, layout2);
RAJA::View<double, RAJA::Layout<3, Index_type, 0>> Cview2(C2, layout2);
Here, the first argument to RAJA::make_permuted_layout
is the same as in
layout 1 since we have the same number of matrices, matrix dimensions and we
will use the same indexing scheme to access the matrix entries. However, the
permutation we use is ‘(1,2,0)’. This makes the matrix index (index 0) have
unit stride, the column index (index 2) for each matrix has stride N, which
is the number of matrices, and the row index (index 1) has
stride \(N \times N_c\).
Example Code¶
Complete working examples that run the batched matrix-multiplication
computation for both layouts and various RAJA execution policies is located
in the file RAJA/examples/tut_batched-matrix-multiply.cpp
.
It compares the execution run times of the two layouts described above using four RAJA back-ends (Sequential, OpenMP, CUDA, and HIP). The OpenMP version for layout 1 looks like this:
RAJA::forall<RAJA::omp_parallel_for_exec>(
RAJA::RangeSegment(0, N), [=](Index_type e) {
Cview(e, 0, 0) = Aview(e, 0, 0) * Bview(e, 0, 0)
+ Aview(e, 0, 1) * Bview(e, 1, 0)
+ Aview(e, 0, 2) * Bview(e, 2, 0);
Cview(e, 0, 1) = Aview(e, 0, 0) * Bview(e, 0, 1)
+ Aview(e, 0, 1) * Bview(e, 1, 1)
+ Aview(e, 0, 2) * Bview(e, 2, 1);
Cview(e, 0, 2) = Aview(e, 0, 0) * Bview(e, 0, 2)
+ Aview(e, 0, 1) * Bview(e, 1, 2)
+ Aview(e, 0, 2) * Bview(e, 2, 2);
Cview(e, 1, 0) = Aview(e, 1, 0) * Bview(e, 0, 0)
+ Aview(e, 1, 1) * Bview(e, 1, 0)
+ Aview(e, 1, 2) * Bview(e, 2, 0);
Cview(e, 1, 1) = Aview(e, 1, 0) * Bview(e, 0, 1)
+ Aview(e, 1, 1) * Bview(e, 1, 1)
+ Aview(e, 1, 2) * Bview(e, 2, 1);
Cview(e, 1, 2) = Aview(e, 1, 0) * Bview(e, 0, 2)
+ Aview(e, 1, 1) * Bview(e, 1, 2)
+ Aview(e, 1, 2) * Bview(e, 2, 2);
Cview(e, 2, 0) = Aview(e, 2, 0) * Bview(e, 0, 0)
+ Aview(e, 2, 1) * Bview(e, 1, 0)
+ Aview(e, 2, 2) * Bview(e, 2, 0);
Cview(e, 2, 1) = Aview(e, 2, 0) * Bview(e, 0, 1)
+ Aview(e, 2, 1) * Bview(e, 1, 1)
+ Aview(e, 2, 2) * Bview(e, 2, 1);
Cview(e, 2, 2) = Aview(e, 2, 0) * Bview(e, 0, 2)
+ Aview(e, 2, 1) * Bview(e, 1, 2)
+ Aview(e, 2, 2) * Bview(e, 2, 2);
});
The only differences between the lambda loop body for layout 1 and layout 2 cases are the names of the views. To make the algorithm code identical for all cases, we would use type aliases for the view and layout types in a header file similarly to how we would abstract the execution policy out of the algorithm.