Matrix Multiplication (Nested Loops)

Key RAJA features shown in the following examples:

  • RAJA::kernel template for nested-loop execution
  • RAJA kernel execution policies
  • RAJA::View multi-dimensional data access
  • Basic RAJA nested-loop interchange
  • Specifying lambda arguments through statements

In this example, we present different ways to perform multiplication of two square matrices ‘A’ and ‘B’ of dimension N x N and store the result in matrix ‘C’. To motivate the use of the RAJA::View abstraction that we use, we define the following macros to access the matrix entries in the C-version:

#define A(r, c) A[c + N * r]
#define B(r, c) B[c + N * r]
#define C(r, c) C[c + N * r]

Then, a typical C-style sequential matrix multiplication operation looks like this:

  for (int row = 0; row < N; ++row) {
    for (int col = 0; col < N; ++col) {

      double dot = 0.0;
      for (int k = 0; k < N; ++k) {
        dot += A(row, k) * B(k, col);
      }
      C(row, col) = dot;

    }
  }

For the RAJA variants of the matrix multiple operation presented below, we use RAJA::View objects, which allow us to access matrix entries in a multi-dimensional manner similar to the C-style version that uses macros. We create a two-dimensional N x N ‘view’ for each of the three matrices:

  RAJA::View<double, RAJA::Layout<DIM>> Aview(A, N, N);
  RAJA::View<double, RAJA::Layout<DIM>> Bview(B, N, N);
  RAJA::View<double, RAJA::Layout<DIM>> Cview(C, N, N);

We show the most basic RAJA view usage here – to simplify multi-dimensional array indexing. RAJA views can be used to abstract a variety of different data layouts and access patterns, including stride permutations, offsets, etc. For more information about RAJA views, see View and Layout.

We also use the following RAJA::RangeSegment objects to define the matrix row and column and dot product iteration spaces:

  RAJA::RangeSegment row_range(0, N);
  RAJA::RangeSegment col_range(0, N);
  RAJA::RangeSegment dot_range(0, N);

Should I Use RAJA::forall For Nested Loops?

We begin by walking through some RAJA variants of the matrix multiplication operation that show RAJA usage that we do not recommend, but which helps to motivate the RAJA::kernel interface. We noted some rationale behind this preference in Complex Loops (RAJA::kernel). Here, we discuss this in more detail.

Starting with the C-style kernel above, we first convert the outermost ‘row’ loop to a RAJA::forall method call with a sequential execution policy:

  RAJA::forall<RAJA::loop_exec>( row_range, [=](int row) {

    for (int col = 0; col < N; ++col) {

      double dot = 0.0;
      for (int k = 0; k < N; ++k) {
        dot += Aview(row, k) * Bview(k, col);
      }
      Cview(row, col) = dot;

    }

  });

Here, the lambda expression for the loop body contains the inner ‘col’ and ‘k’ loops.

Note that changing the RAJA execution policy to an OpenMP or CUDA policy enables the outer ‘row’ loop to run in parallel. When this is done, each thread executes the lambda expression body, which contains the ‘col’ and ‘k’ loops. Although this enables some parallelism, there is still more available. In a bit, we will how the RAJA::kernel interface helps us to expose all available parallelism.

Next, we nest a RAJA::forall method call for the ‘column’ loop inside the outer lambda expression:

  RAJA::forall<RAJA::loop_exec>( row_range, [=](int row) {

    RAJA::forall<RAJA::loop_exec>( col_range, [=](int col) {

      double dot = 0.0;
      for (int k = 0; k < N; ++k) {
        dot += Aview(row, k) * Bview(k, col);
      }
      Cview(row, col) = dot;

    });

  });

Here, the innermost lambda expression contains the row-column dot product initialization, the inner ‘k’ loop for the dot product, and the operation that assigns the dot product to the proper location in the result matrix.

Note that we can replace either RAJA execution policy with an OpenMP execution policy to parallelize either the ‘row’ or ‘col’ loop. For example, we can use an OpenMP execution policy on the outer ‘row’ loop and the result will be the same as using an OpenMP execution policy in the earlier case that used a RAJA::forall statement for the outer loop.

We do not recommend using a parallel execution policy for both loops in this type of kernel as the results may not be what is expected and RAJA provides better mechanisms for parallelizing nested loops. Also, changing the outer loop policy to a CUDA policy will not compile. This is by design in RAJA since nesting forall statements inside lambdas in this way has limited utility, is inflexible, and can hinder performance when compared to RAJA::kernel constructs, which we describe next.

Basic RAJA::kernel Variants

Next, we show how to cast the matrix-multiplication operation using the RAJA::kernel interface, which was introduced in Complex Loops (RAJA::kernel). We first present a complete example, and then describe its key elements, noting important differences between RAJA::kernel and RAJA::forall loop execution interfaces.

  using EXEC_POL =
    RAJA::KernelPolicy<
      RAJA::statement::For<1, RAJA::loop_exec,    // row
        RAJA::statement::For<0, RAJA::loop_exec,  // col
          RAJA::statement::Lambda<0>
        >
      >
    >;

  RAJA::kernel<EXEC_POL>(RAJA::make_tuple(col_range, row_range),
    [=](int col, int row) {

    double dot = 0.0;
    for (int k = 0; k < N; ++k) {
      dot += Aview(row, k) * Bview(k, col);
    }
    Cview(row, col) = dot;

  });

Here, RAJA::kernel expresses the outer ‘row’ and ‘col’ loops; the inner ‘k’ loop is included in the lambda expression for the loop body. Note that the RAJA::kernel template takes two arguments. Similar to RAJA::forall, the first argument describes the iteration space and the second argument is the lambda loop body. Unlike RAJA::forall, the iteration space for RAJA::kernel is defined as a tuple of ranges (created via the RAJA::make_tuple method), one for the ‘col’ loop and one for the ‘row’ loop. Also, the lambda expression takes an iteration index argument for entry in the iteration space tuple.

Note

The number and order of lambda arguments must match the number and order of the elements in the tuple for this to be correct.

Another important difference between RAJA::forall and RAJA::kernel involves the execution policy template parameter. The execution policy defined by the RAJA::KernelPolicy type used here specifies a policy for each level in the loop nest via nested RAJA::statement::For types. Here, the row and column loops will both execute sequentially. The integer that appears as the first template parameter to each ‘For’ statement corresponds to the position of a range in the iteration space tuple and also to the associated iteration index argument to the lambda. Here, ‘0’ is the ‘col’ range and ‘1’ is the ‘row’ range because that is the order those ranges appear in the tuple. The innermost type RAJA::statement::Lambda<0> indicates that the first lambda expression (the only one in this case!) argument passed to the RAJA::kernel method will be invoked inside the nested loops.

The integer arguments to the RAJA::statement::For types are needed to enable a variety of kernel execution patterns and transformations. Since the kernel policy is a single unified construct, it can be used to parallelize the nested loop iterations together, which we will show later. Also, the levels in the loop nest can be permuted by reordering the policy arguments; this is analogous to how one would reorder C-style nested loops; i.e., reorder for-statements for each loop nest level. These execution patterns and transformations can be achieved by changing only the policy and leaving the loop kernel code as is.

If we want to execute the row loop using OpenMP multithreaded parallelism and keep the column loop sequential, the policy we would use is:

  using EXEC_POL1 =
    RAJA::KernelPolicy<
      RAJA::statement::For<1, RAJA::omp_parallel_for_exec,  // row
        RAJA::statement::For<0, RAJA::loop_exec,            // col
          RAJA::statement::Lambda<0>
        >
      >
    >;

To swap the loop nest ordering and keep the same execution policy on each loop, we would use the following policy, which swaps the RAJA::statement::For types. The inner loop is now the ‘row’ loop and is run in parallel; the outer loop is now the ‘col’ loop and is still sequential:

  using EXEC_POL2 =
    RAJA::KernelPolicy<
      RAJA::statement::For<0, RAJA::loop_exec,                  // col
        RAJA::statement::For<1, RAJA::omp_parallel_for_exec,    // row
          RAJA::statement::Lambda<0>
        >
      >
    >;

Note

It is important to emphasize that these kernel transformations, and others, can be done by switching the RAJA::KernelPolicy type with no changes to the loop kernel code.

In Nested Loop Interchange, we provide a more detailed discussion of the mechanics of loop nest reordering. Next, we show other variations of the matrix multiplication kernel that illustrate other RAJA::kernel features.

More Complex RAJA::kernel Variants

The matrix multiplication kernel variations described in this section use execution policies to express the outer row and col loops as well as the inner dot product loop using the RAJA kernel interface. They illustrate more complex policy examples and show additional RAJA kernel features.

The first example uses sequential execution for all loops:

  using EXEC_POL6a =
    RAJA::KernelPolicy<
      RAJA::statement::For<1, RAJA::loop_exec,
        RAJA::statement::For<0, RAJA::loop_exec,
          RAJA::statement::Lambda<0, RAJA::Params<0>>,  // dot = 0.0
          RAJA::statement::For<2, RAJA::loop_exec,
            RAJA::statement::Lambda<1> // inner loop: dot += ...
          >,
          RAJA::statement::Lambda<2, RAJA::Segs<0, 1>, RAJA::Params<0>>   // set C(row, col) = dot
        >
      >
    >;

  RAJA::kernel_param<EXEC_POL6a>(
    RAJA::make_tuple(col_range, row_range, dot_range),

    RAJA::tuple<double>{0.0},    // thread local variable for 'dot'

    // lambda 0
    [=] (double& dot) {
       dot = 0.0;
    },

    // lambda 1
    [=] (int col, int row, int k, double& dot) {
       dot += Aview(row, k) * Bview(k, col);
    },

    // lambda 2
    [=] (int col, int row, double& dot) {
       Cview(row, col) = dot;
    }

  );

Note that we use a RAJA::kernel_param method to execute the kernel. It is similar to RAJA::kernel except that it accepts a tuple as the second argument (between the iteration space tuple and the lambda expressions). The tuple is a set of parameters that can be used in the kernel to pass data into lambda expressions. Here, the parameter tuple holds a single scalar variable for the dot product.

The remaining arguments include a sequence of lambda expressions representing different parts of the inner loop body. We use three lambda expressions that: initialize the dot product variable (lambda 0), define the ‘k’ inner loop row-col dot product operation (lambda 1), and store the computed row-col dot product in the proper location in the result matrix (lambda 2). Note that all lambdas take the same arguments in the same order, which is required for the kernel to be well-formed. In addition to the loop index variables, we pass the scalar dot product variable into each lambda. This enables the same variables to be used in all three lambda expressions. However, observe that not all lambda expressions use all three index variables. They are declared, but left unnamed to prevent compiler warnings.

Alternatively, the lambda statements in the execution policy may be used to specify which arguments each lambda takes and in which order. For example:

  // Alias for convenience
  using RAJA::Segs;
  using RAJA::Params;

  using EXEC_POL6b =
    RAJA::KernelPolicy<
      RAJA::statement::For<1, RAJA::loop_exec,
        RAJA::statement::For<0, RAJA::loop_exec,
          RAJA::statement::Lambda<0, Params<0>>,  // dot = 0.0
          RAJA::statement::For<2, RAJA::loop_exec,
            RAJA::statement::Lambda<1, Segs<0,1,2>, Params<0>> // dot += ...
          >,
          RAJA::statement::Lambda<2, Segs<0,1>, Params<0>>  // C(row, col) = dot
        >
      >
    >;

  RAJA::kernel_param<EXEC_POL6b>(
    RAJA::make_tuple(col_range, row_range, dot_range),

    RAJA::tuple<double>{0.0},    // thread local variable for 'dot'

    // lambda 0
    [=] (double& dot) {
       dot = 0.0;
    },

    // lambda 1
    [=] (int col, int row, int k, double& dot) {
       dot += Aview(row, k) * Bview(k, col);
    },

    // lambda 2
    [=] (int col, int row, double& dot) {
       Cview(row, col) = dot;
    }

  );

By using RAJA::statement::Lambda parameters in this way, the code potentially indicates more clearly which areguments are used. Of course, this makes the execution policy more verbose, but that is typically hidden away in a header file. Statements such as RAJA::Segs, and RAJA::Params identify the positions of the segments and params in the tuples to be used as arguments to the lambda expressions.

As we noted earlier, the execution policy type passed to the RAJA::kernel_param method as a template parameter describes how the statements and lambda expressions are assembled to form the complete kernel. To illustrate this, we describe various policies that enable the kernel to run in different ways. In each case, the RAJA::kernel_param method call, including its arguments is the same. The curious reader will inspect the example code in the file listed below to see that this is indeed the case. In the interest of simplicity, the remaining matrix multiplication examples do not use RAJA::statement::Lambda parameters to specify arguments to the lambda expressions.

Next, we show how to collapse nested loops in an OpenMP parallel region using a RAJA::statement::Collapse type in the execution policy. This allows one to parallelize multiple levels in a loop nest using OpenMP directives, for instance. The following policy will collapse the two outer loops into one OpenMP parallel region:

  using EXEC_POL7 =
    RAJA::KernelPolicy<
      RAJA::statement::Collapse<RAJA::omp_parallel_collapse_exec,
                                RAJA::ArgList<1, 0>,   // row, col
        RAJA::statement::Lambda<0, RAJA::Params<0>>,  // dot = 0.0
        RAJA::statement::For<2, RAJA::loop_exec,
          RAJA::statement::Lambda<1> // inner loop: dot += ...
        >,
        RAJA::statement::Lambda<2, RAJA::Segs<0, 1>, RAJA::Params<0>>   // set C(row, col) = dot
      >
    >;

The RAJA::ArgList type indicates which loops in the nest are to be collapsed and their nesting order within the collapse region. The integers passed to ArgList are indices of entries in the tuple of iteration spaces and indicate inner to outer loop levels when read from right to left (i.e., here ‘1, 0’ indicates the column loop is the inner loop and the row loop is the outer). For this transformation there are no statement::For types and policies for the individual loop levels inside the OpenMP collapse region.

Lastly, we show how to use RAJA::statement::CudaKernel and RAJA::statement::HipKernel types to generate GPU kernels launched with a particular thread-block decomposition. We reiterate that although the policies are different, the kernels themselves are identical to the sequential and OpenMP variants above.

Here is a policy that will distribute the row indices across CUDA thread blocks and column indices across threads in the x dimension of each block:

  using EXEC_POL8 =
    RAJA::KernelPolicy<
      RAJA::statement::CudaKernel<
        RAJA::statement::For<1, RAJA::cuda_block_x_loop,    // row
          RAJA::statement::For<0, RAJA::cuda_thread_x_loop, // col
            RAJA::statement::Lambda<0, RAJA::Params<0>>,    // dot = 0.0
            RAJA::statement::For<2, RAJA::seq_exec,
                RAJA::statement::Lambda<1>                  // dot += ...
            >,
            RAJA::statement::Lambda<2, RAJA::Segs<0, 1>, RAJA::Params<0>>   // set C = ...
          >
        >
      >
    >;

This is equivalent to defining a CUDA kernel with the lambda body inside it and defining row and column indices as:

int row = blockIdx.x;
int col = threadIdx.x;

and launching the kernel with appropriate CUDA grid and thread-block dimensions.

The HIP execution policy is similar:

  using EXEC_POL8 =
    RAJA::KernelPolicy<
      RAJA::statement::HipKernel<
        RAJA::statement::For<1, RAJA::hip_block_x_loop,    // row
          RAJA::statement::For<0, RAJA::hip_thread_x_loop, // col
            RAJA::statement::Lambda<0, RAJA::Params<0>>,   // dot = 0.0
            RAJA::statement::For<2, RAJA::seq_exec,
                RAJA::statement::Lambda<1>                 // dot += ...
            >,
            RAJA::statement::Lambda<2,
              RAJA::Segs<0,1>, RAJA::Params<0>>            // set C = ...
          >
        >
      >
    >;

The following policy will tile row and col indices across two-dimensional CUDA thread blocks with ‘x’ and ‘y’ dimensions defined by a ‘CUDA_BLOCK_SIZE’ parameter that can be set at compile time. Within each tile, the kernel iterates are executed by CUDA threads.

  using EXEC_POL9a =
    RAJA::KernelPolicy<
      RAJA::statement::CudaKernel<
        RAJA::statement::Tile<1, RAJA::tile_fixed<CUDA_BLOCK_SIZE>,
                                 RAJA::cuda_block_y_loop,
          RAJA::statement::Tile<0, RAJA::tile_fixed<CUDA_BLOCK_SIZE>,
                                   RAJA::cuda_block_x_loop,
            RAJA::statement::For<1, RAJA::cuda_thread_y_loop,   // row
              RAJA::statement::For<0, RAJA::cuda_thread_x_loop, // col
                RAJA::statement::Lambda<0, RAJA::Params<0>>,    // dot = 0.0
                RAJA::statement::For<2, RAJA::seq_exec,
                    RAJA::statement::Lambda<1>                 // dot += ...
                >,
                RAJA::statement::Lambda<2, RAJA::Segs<0, 1>, RAJA::Params<0>>   // set C = ...
              >
            >
          >
        >
      >
    >;

Note that the tiling mechanism requires a RAJA::statement::Tile type, with a tile size and a tiling execution policy, plus a RAJA::statement::For type with an execution execution policy for each tile dimension.

The analogous HIP policy is:

  using EXEC_POL9b =
    RAJA::KernelPolicy<
      RAJA::statement::HipKernel<
        RAJA::statement::Tile<1, RAJA::tile_fixed<HIP_BLOCK_SIZE>,
                                 RAJA::hip_block_y_loop,
          RAJA::statement::Tile<0, RAJA::tile_fixed<HIP_BLOCK_SIZE>,
                                   RAJA::hip_block_x_loop,
            RAJA::statement::For<1, RAJA::hip_thread_y_loop,    // row
              RAJA::statement::For<0, RAJA::hip_thread_x_loop,  // col
                RAJA::statement::Lambda<0, Params<0>>,          // dot = 0.0
                RAJA::statement::For<2, RAJA::seq_exec,
                  RAJA::statement::Lambda<1, Segs<0,1,2>, Params<0>> // dot += ...
                >,
                  RAJA::statement::Lambda<2, Segs<0,1>, Params<0>>   // set C = ...
              >
            >
          >
        >
      >
    >;

In Tiled Matrix Transpose and Matrix Transpose with Local Array, we will discuss loop tiling in more detail including how it can be used to improve performance of certain algorithms.

The file RAJA/examples/tut_matrix-multiply.cpp contains the complete working code for all examples described in this section, plus others that show a variety of RAJA::kernel execution policy types. It also contains a raw CUDA version of the kernel for comparison.