Matrix Multiplication (Nested Loops)¶
Key RAJA features shown in the following examples:
RAJA::kernel
template for nested-loop execution- RAJA kernel execution policies
RAJA::View
multi-dimensional data access- Basic RAJA nested-loop interchange
- Specifying lambda arguments through statements
In this example, we present different ways to perform multiplication of two
square matrices ‘A’ and ‘B’ of dimension N x N and store the result in matrix
‘C’. To motivate the use of the RAJA::View
abstraction that we use,
we define the following macros to access the matrix entries in the
C-version:
#define A(r, c) A[c + N * r]
#define B(r, c) B[c + N * r]
#define C(r, c) C[c + N * r]
Then, a typical C-style sequential matrix multiplication operation looks like this:
for (int row = 0; row < N; ++row) {
for (int col = 0; col < N; ++col) {
double dot = 0.0;
for (int k = 0; k < N; ++k) {
dot += A(row, k) * B(k, col);
}
C(row, col) = dot;
}
}
For the RAJA variants of the matrix multiple operation presented below,
we use RAJA::View
objects, which allow us to access matrix
entries in a multi-dimensional manner similar to the C-style version that
uses macros. We create a two-dimensional N x N ‘view’
for each of the three matrices:
RAJA::View<double, RAJA::Layout<DIM>> Aview(A, N, N);
RAJA::View<double, RAJA::Layout<DIM>> Bview(B, N, N);
RAJA::View<double, RAJA::Layout<DIM>> Cview(C, N, N);
We show the most basic RAJA view usage here – to simplify multi-dimensional array indexing. RAJA views can be used to abstract a variety of different data layouts and access patterns, including stride permutations, offsets, etc. For more information about RAJA views, see View and Layout.
We also use the following RAJA::RangeSegment
objects to define the matrix
row and column and dot product iteration spaces:
RAJA::RangeSegment row_range(0, N);
RAJA::RangeSegment col_range(0, N);
RAJA::RangeSegment dot_range(0, N);
Should I Use RAJA::forall For Nested Loops?¶
We begin by walking through some RAJA variants of the matrix multiplication
operation that show RAJA usage that we do not recommend, but which helps
to motivate the RAJA::kernel
interface. We noted some rationale behind
this preference in Complex Loops (RAJA::kernel). Here, we discuss this
in more detail.
Starting with the C-style kernel above, we first convert the outermost
‘row’ loop to a RAJA::forall
method call with a sequential execution policy:
RAJA::forall<RAJA::loop_exec>( row_range, [=](int row) {
for (int col = 0; col < N; ++col) {
double dot = 0.0;
for (int k = 0; k < N; ++k) {
dot += Aview(row, k) * Bview(k, col);
}
Cview(row, col) = dot;
}
});
Here, the lambda expression for the loop body contains the inner ‘col’ and ‘k’ loops.
Note that changing the RAJA execution policy to an OpenMP or CUDA policy
enables the outer ‘row’ loop to run in parallel. When this is done,
each thread executes the lambda expression body, which contains the ‘col’
and ‘k’ loops. Although this enables some parallelism, there is still more
available. In a bit, we will how the RAJA::kernel
interface helps us to
expose all available parallelism.
Next, we nest a RAJA::forall
method call for the ‘column’ loop inside the
outer lambda expression:
RAJA::forall<RAJA::loop_exec>( row_range, [=](int row) {
RAJA::forall<RAJA::loop_exec>( col_range, [=](int col) {
double dot = 0.0;
for (int k = 0; k < N; ++k) {
dot += Aview(row, k) * Bview(k, col);
}
Cview(row, col) = dot;
});
});
Here, the innermost lambda expression contains the row-column dot product initialization, the inner ‘k’ loop for the dot product, and the operation that assigns the dot product to the proper location in the result matrix.
Note that we can replace either RAJA execution policy with an OpenMP
execution policy to parallelize either the ‘row’ or ‘col’ loop. For example,
we can use an OpenMP execution policy on the outer ‘row’ loop and the result
will be the same as using an OpenMP execution policy in the earlier case that
used a RAJA::forall
statement for the outer loop.
We do not recommend using a parallel execution policy for both loops in
this type of kernel as the results may not be what is expected and RAJA
provides better mechanisms for parallelizing nested loops. Also, changing
the outer loop policy to a CUDA policy will not compile. This is by design
in RAJA since nesting forall statements inside lambdas in this way has limited
utility, is inflexible, and can hinder performance when compared to
RAJA::kernel
constructs, which we describe next.
Basic RAJA::kernel Variants¶
Next, we show how to cast the matrix-multiplication operation
using the RAJA::kernel
interface, which was introduced in
Complex Loops (RAJA::kernel). We first present a complete example, and
then describe its key elements, noting important differences between
RAJA::kernel
and RAJA::forall
loop execution interfaces.
using EXEC_POL =
RAJA::KernelPolicy<
RAJA::statement::For<1, RAJA::loop_exec, // row
RAJA::statement::For<0, RAJA::loop_exec, // col
RAJA::statement::Lambda<0>
>
>
>;
RAJA::kernel<EXEC_POL>(RAJA::make_tuple(col_range, row_range),
[=](int col, int row) {
double dot = 0.0;
for (int k = 0; k < N; ++k) {
dot += Aview(row, k) * Bview(k, col);
}
Cview(row, col) = dot;
});
Here, RAJA::kernel
expresses the outer ‘row’ and ‘col’ loops;
the inner ‘k’ loop is included in the lambda expression for the loop body.
Note that the RAJA::kernel
template takes two arguments. Similar to
RAJA::forall
, the first argument describes the iteration space and the
second argument is the lambda loop body. Unlike RAJA::forall
, the
iteration space for RAJA::kernel
is defined as a tuple of ranges
(created via the RAJA::make_tuple
method), one for the ‘col’ loop and
one for the ‘row’ loop. Also, the lambda expression takes an iteration index
argument for entry in the iteration space tuple.
Note
The number and order of lambda arguments must match the number and order of the elements in the tuple for this to be correct.
Another important difference between RAJA::forall
and RAJA::kernel
involves the execution policy template parameter. The execution policy defined
by the RAJA::KernelPolicy
type used here specifies a policy for each level
in the loop nest via nested RAJA::statement::For
types. Here, the row and
column loops will both execute sequentially. The integer that appears as the
first template parameter to each ‘For’ statement corresponds to the position of
a range in the iteration space tuple and also to the associated iteration
index argument to the lambda. Here, ‘0’ is the ‘col’ range and ‘1’ is the
‘row’ range because that is the order those ranges appear in the tuple. The
innermost type RAJA::statement::Lambda<0>
indicates that the first lambda
expression (the only one in this case!) argument passed to the
RAJA::kernel
method will be invoked inside the nested loops.
The integer arguments to the RAJA::statement::For
types are needed to
enable a variety of kernel execution patterns and transformations. Since the
kernel policy is a single unified construct, it can be used to parallelize
the nested loop iterations together, which we will show later. Also, the
levels in the loop nest can be permuted by reordering the policy arguments;
this is analogous to how one would reorder C-style nested loops; i.e.,
reorder for-statements for each loop nest level. These execution patterns
and transformations can be achieved by changing only the policy and leaving the
loop kernel code as is.
If we want to execute the row loop using OpenMP multithreaded parallelism and keep the column loop sequential, the policy we would use is:
using EXEC_POL1 =
RAJA::KernelPolicy<
RAJA::statement::For<1, RAJA::omp_parallel_for_exec, // row
RAJA::statement::For<0, RAJA::loop_exec, // col
RAJA::statement::Lambda<0>
>
>
>;
To swap the loop nest ordering and keep the same execution policy on each loop,
we would use the following policy, which swaps the RAJA::statement::For
types. The inner loop is now the ‘row’ loop and is run in parallel;
the outer loop is now the ‘col’ loop and is still sequential:
using EXEC_POL2 =
RAJA::KernelPolicy<
RAJA::statement::For<0, RAJA::loop_exec, // col
RAJA::statement::For<1, RAJA::omp_parallel_for_exec, // row
RAJA::statement::Lambda<0>
>
>
>;
Note
It is important to emphasize that these kernel transformations,
and others, can be done by switching the RAJA::KernelPolicy
type with no changes to the loop kernel code.
In Nested Loop Interchange, we provide a more detailed discussion of the
mechanics of loop nest reordering. Next, we show other variations of the
matrix multiplication kernel that illustrate other RAJA::kernel
features.
More Complex RAJA::kernel Variants¶
The matrix multiplication kernel variations described in this section use execution policies to express the outer row and col loops as well as the inner dot product loop using the RAJA kernel interface. They illustrate more complex policy examples and show additional RAJA kernel features.
The first example uses sequential execution for all loops:
using EXEC_POL6a =
RAJA::KernelPolicy<
RAJA::statement::For<1, RAJA::loop_exec,
RAJA::statement::For<0, RAJA::loop_exec,
RAJA::statement::Lambda<0, RAJA::Params<0>>, // dot = 0.0
RAJA::statement::For<2, RAJA::loop_exec,
RAJA::statement::Lambda<1> // inner loop: dot += ...
>,
RAJA::statement::Lambda<2, RAJA::Segs<0, 1>, RAJA::Params<0>> // set C(row, col) = dot
>
>
>;
RAJA::kernel_param<EXEC_POL6a>(
RAJA::make_tuple(col_range, row_range, dot_range),
RAJA::tuple<double>{0.0}, // thread local variable for 'dot'
// lambda 0
[=] (double& dot) {
dot = 0.0;
},
// lambda 1
[=] (int col, int row, int k, double& dot) {
dot += Aview(row, k) * Bview(k, col);
},
// lambda 2
[=] (int col, int row, double& dot) {
Cview(row, col) = dot;
}
);
Note that we use a RAJA::kernel_param
method to execute the kernel. It is
similar to RAJA::kernel
except that it accepts a tuple as the second
argument (between the iteration space tuple and the lambda expressions). The
tuple is a set of parameters that can be used in the kernel to pass data
into lambda expressions. Here, the parameter tuple holds a single scalar
variable for the dot product.
The remaining arguments include a sequence of lambda expressions representing different parts of the inner loop body. We use three lambda expressions that: initialize the dot product variable (lambda 0), define the ‘k’ inner loop row-col dot product operation (lambda 1), and store the computed row-col dot product in the proper location in the result matrix (lambda 2). Note that all lambdas take the same arguments in the same order, which is required for the kernel to be well-formed. In addition to the loop index variables, we pass the scalar dot product variable into each lambda. This enables the same variables to be used in all three lambda expressions. However, observe that not all lambda expressions use all three index variables. They are declared, but left unnamed to prevent compiler warnings.
Alternatively, the lambda statements in the execution policy may be used to specify which arguments each lambda takes and in which order. For example:
// Alias for convenience
using RAJA::Segs;
using RAJA::Params;
using EXEC_POL6b =
RAJA::KernelPolicy<
RAJA::statement::For<1, RAJA::loop_exec,
RAJA::statement::For<0, RAJA::loop_exec,
RAJA::statement::Lambda<0, Params<0>>, // dot = 0.0
RAJA::statement::For<2, RAJA::loop_exec,
RAJA::statement::Lambda<1, Segs<0,1,2>, Params<0>> // dot += ...
>,
RAJA::statement::Lambda<2, Segs<0,1>, Params<0>> // C(row, col) = dot
>
>
>;
RAJA::kernel_param<EXEC_POL6b>(
RAJA::make_tuple(col_range, row_range, dot_range),
RAJA::tuple<double>{0.0}, // thread local variable for 'dot'
// lambda 0
[=] (double& dot) {
dot = 0.0;
},
// lambda 1
[=] (int col, int row, int k, double& dot) {
dot += Aview(row, k) * Bview(k, col);
},
// lambda 2
[=] (int col, int row, double& dot) {
Cview(row, col) = dot;
}
);
By using RAJA::statement::Lambda
parameters in this way, the code
potentially indicates more clearly which areguments are used. Of course, this
makes the execution policy more verbose, but that is typically hidden away
in a header file. Statements such as RAJA::Segs
, and
RAJA::Params
identify the positions of the segments and params
in the tuples to be used as arguments to the lambda expressions.
As we noted earlier, the execution policy type passed to the
RAJA::kernel_param
method as a template parameter describes how the
statements and lambda expressions are assembled to form the complete kernel.
To illustrate this, we describe various policies that enable the kernel to
run in different ways. In each case, the RAJA::kernel_param
method call,
including its arguments is the same. The curious reader will inspect the
example code in the file listed below to see that this is indeed the case.
In the interest of simplicity, the remaining matrix multiplication examples
do not use RAJA::statement::Lambda
parameters to specify arguments to
the lambda expressions.
Next, we show how to collapse nested loops in an OpenMP parallel region
using a RAJA::statement::Collapse
type in the execution policy. This
allows one to parallelize multiple levels in a loop nest using OpenMP
directives, for instance. The following policy will collapse the two outer
loops into one OpenMP parallel region:
using EXEC_POL7 =
RAJA::KernelPolicy<
RAJA::statement::Collapse<RAJA::omp_parallel_collapse_exec,
RAJA::ArgList<1, 0>, // row, col
RAJA::statement::Lambda<0, RAJA::Params<0>>, // dot = 0.0
RAJA::statement::For<2, RAJA::loop_exec,
RAJA::statement::Lambda<1> // inner loop: dot += ...
>,
RAJA::statement::Lambda<2, RAJA::Segs<0, 1>, RAJA::Params<0>> // set C(row, col) = dot
>
>;
The RAJA::ArgList
type indicates which loops in the nest are to be
collapsed and their nesting order within the collapse region. The integers
passed to ArgList
are indices of entries in the tuple of iteration spaces
and indicate inner to outer loop levels when read from right to left (i.e.,
here ‘1, 0’ indicates the column loop is the inner loop and the row loop is
the outer). For this transformation there are no statement::For
types
and policies for the individual loop levels inside the OpenMP collapse region.
Lastly, we show how to use RAJA::statement::CudaKernel
and
RAJA::statement::HipKernel
types to generate GPU kernels launched with
a particular thread-block decomposition. We reiterate that although the
policies are different, the kernels themselves are identical to the sequential
and OpenMP variants above.
Here is a policy that will distribute the row indices across CUDA thread blocks and column indices across threads in the x dimension of each block:
using EXEC_POL8 =
RAJA::KernelPolicy<
RAJA::statement::CudaKernel<
RAJA::statement::For<1, RAJA::cuda_block_x_loop, // row
RAJA::statement::For<0, RAJA::cuda_thread_x_loop, // col
RAJA::statement::Lambda<0, RAJA::Params<0>>, // dot = 0.0
RAJA::statement::For<2, RAJA::seq_exec,
RAJA::statement::Lambda<1> // dot += ...
>,
RAJA::statement::Lambda<2, RAJA::Segs<0, 1>, RAJA::Params<0>> // set C = ...
>
>
>
>;
This is equivalent to defining a CUDA kernel with the lambda body inside it and defining row and column indices as:
int row = blockIdx.x;
int col = threadIdx.x;
and launching the kernel with appropriate CUDA grid and thread-block dimensions.
The HIP execution policy is similar:
using EXEC_POL8 =
RAJA::KernelPolicy<
RAJA::statement::HipKernel<
RAJA::statement::For<1, RAJA::hip_block_x_loop, // row
RAJA::statement::For<0, RAJA::hip_thread_x_loop, // col
RAJA::statement::Lambda<0, RAJA::Params<0>>, // dot = 0.0
RAJA::statement::For<2, RAJA::seq_exec,
RAJA::statement::Lambda<1> // dot += ...
>,
RAJA::statement::Lambda<2,
RAJA::Segs<0,1>, RAJA::Params<0>> // set C = ...
>
>
>
>;
The following policy will tile row and col indices across two-dimensional CUDA thread blocks with ‘x’ and ‘y’ dimensions defined by a ‘CUDA_BLOCK_SIZE’ parameter that can be set at compile time. Within each tile, the kernel iterates are executed by CUDA threads.
using EXEC_POL9a =
RAJA::KernelPolicy<
RAJA::statement::CudaKernel<
RAJA::statement::Tile<1, RAJA::tile_fixed<CUDA_BLOCK_SIZE>,
RAJA::cuda_block_y_loop,
RAJA::statement::Tile<0, RAJA::tile_fixed<CUDA_BLOCK_SIZE>,
RAJA::cuda_block_x_loop,
RAJA::statement::For<1, RAJA::cuda_thread_y_loop, // row
RAJA::statement::For<0, RAJA::cuda_thread_x_loop, // col
RAJA::statement::Lambda<0, RAJA::Params<0>>, // dot = 0.0
RAJA::statement::For<2, RAJA::seq_exec,
RAJA::statement::Lambda<1> // dot += ...
>,
RAJA::statement::Lambda<2, RAJA::Segs<0, 1>, RAJA::Params<0>> // set C = ...
>
>
>
>
>
>;
Note that the tiling mechanism requires a RAJA::statement::Tile
type,
with a tile size and a tiling execution policy, plus a RAJA::statement::For
type with an execution execution policy for each tile dimension.
The analogous HIP policy is:
using EXEC_POL9b =
RAJA::KernelPolicy<
RAJA::statement::HipKernel<
RAJA::statement::Tile<1, RAJA::tile_fixed<HIP_BLOCK_SIZE>,
RAJA::hip_block_y_loop,
RAJA::statement::Tile<0, RAJA::tile_fixed<HIP_BLOCK_SIZE>,
RAJA::hip_block_x_loop,
RAJA::statement::For<1, RAJA::hip_thread_y_loop, // row
RAJA::statement::For<0, RAJA::hip_thread_x_loop, // col
RAJA::statement::Lambda<0, Params<0>>, // dot = 0.0
RAJA::statement::For<2, RAJA::seq_exec,
RAJA::statement::Lambda<1, Segs<0,1,2>, Params<0>> // dot += ...
>,
RAJA::statement::Lambda<2, Segs<0,1>, Params<0>> // set C = ...
>
>
>
>
>
>;
In Tiled Matrix Transpose and Matrix Transpose with Local Array, we will discuss loop tiling in more detail including how it can be used to improve performance of certain algorithms.
The file RAJA/examples/tut_matrix-multiply.cpp
contains the complete
working code for all examples described in this section, plus others that
show a variety of RAJA::kernel
execution policy types. It also contains
a raw CUDA version of the kernel for comparison.