Matrix Multiplication: RAJA::kernel

The file RAJA/examples/tut_matrix-multiply.cpp contains the complete working code for all examples described in this section, plus others that show a variety of RAJA::kernel execution policy types. It also contains raw CUDA and HIP versions of the kernel for comparison.

Key RAJA features shown in the following examples:

  • RAJA::kernel template for nested-loop execution

  • RAJA kernel execution policies

  • RAJA::View multi-dimensional data access

  • RAJA nested-loop interchange

  • Specifying lambda arguments through statements

In this example, we present different ways to perform multiplication of two square matrices ‘A’ and ‘B’ of dimension N x N and store the result in matrix ‘C’. To motivate the use of the RAJA::View abstraction that we use, we define the following macros to access the matrix entries in the C-version:

#define A(r, c) A[c + N * r]
#define B(r, c) B[c + N * r]
#define C(r, c) C[c + N * r]

Then, a typical C-style sequential matrix multiplication operation might look like this:

  for (int row = 0; row < N; ++row) {
    for (int col = 0; col < N; ++col) {

      double dot = 0.0;
      for (int k = 0; k < N; ++k) {
        dot += A(row, k) * B(k, col);
      }
      C(row, col) = dot;

    }
  }

For the RAJA variants of the matrix multiple operation presented below, we use RAJA::View objects, which allow us to access matrix entries in a multi-dimensional manner similar to the C-style version that uses macros. We create a two-dimensional N x N ‘view’ for each matrix:

  RAJA::View<double, RAJA::Layout<DIM>> Aview(A, N, N);
  RAJA::View<double, RAJA::Layout<DIM>> Bview(B, N, N);
  RAJA::View<double, RAJA::Layout<DIM>> Cview(C, N, N);

We show the most basic RAJA view usage here – to simplify multi-dimensional array indexing. RAJA views can be used to abstract a variety of different data layouts and access patterns, including stride permutations, offsets, etc. For more information about RAJA views, see View and Layout.

We also use the following RAJA::TypedRangeSegment objects to define the matrix row and column and dot product iteration spaces:

  RAJA::TypedRangeSegment<int> row_range(0, N);
  RAJA::TypedRangeSegment<int> col_range(0, N);
  RAJA::TypedRangeSegment<int> dot_range(0, N);

Basic RAJA::kernel Variants

Next, we show how to cast the matrix-multiplication operation using the RAJA::kernel interface, which was introduced in Complex Loops (RAJA::kernel). We first present a complete example, and then describe its key elements, noting important differences between RAJA::kernel and RAJA::forall loop execution interfaces.

  using EXEC_POL =
    RAJA::KernelPolicy<
      RAJA::statement::For<1, RAJA::seq_exec,    // row
        RAJA::statement::For<0, RAJA::seq_exec,  // col
          RAJA::statement::Lambda<0>
        >
      >
    >;

  RAJA::kernel<EXEC_POL>(RAJA::make_tuple(col_range, row_range),
    [=](int col, int row) {

    double dot = 0.0;
    for (int k = 0; k < N; ++k) {
      dot += Aview(row, k) * Bview(k, col);
    }
    Cview(row, col) = dot;

  });

Here, RAJA::kernel expresses the outer ‘row’ and ‘col’ loops; the inner ‘k’ loop is included in the lambda expression for the loop body. Note that the RAJA::kernel template takes two arguments. Similar to RAJA::forall, the first argument describes the iteration space and the second argument is the lambda loop body. Unlike RAJA::forall, the iteration space for RAJA::kernel is defined as a tuple of ranges (created via the RAJA::make_tuple method), one for the ‘col’ loop and one for the ‘row’ loop. Also, the lambda expression takes an iteration index argument for each entry in the iteration space tuple.

Note

The number and order of lambda arguments must match the number and order of the elements in the tuple for this type of RAJA::kernel usage to be correct.

Another important difference between RAJA::forall and RAJA::kernel involves the execution policy template parameter. The execution policy defined by the RAJA::KernelPolicy type shown here specifies a policy for each level in the loop nest via nested RAJA::statement::For types. Here, the row and column loops will both execute sequentially. The integer that appears as the first template parameter to each ‘For’ statement corresponds to the position of a range in the iteration space tuple and also to the associated iteration index argument to the lambda. Here, ‘0’ is the ‘col’ range and ‘1’ is the ‘row’ range because that is the order those ranges appear in the tuple. The innermost type RAJA::statement::Lambda<0> indicates that the first lambda expression (the only one in this case) argument passed to the RAJA::kernel method will be invoked inside the inner loop.

The integer arguments to the RAJA::statement::For types are needed to enable the desired kernel execution pattern and potential transformations, without changing the kernel code. Since the kernel policy is a single unified construct, it can be used to parallelize the nested loop iterations together, which we show next.

If we want to execute the row loop using OpenMP multithreaded parallelism and keep the column loop sequential, the policy we would use is:

  using EXEC_POL1 =
    RAJA::KernelPolicy<
      RAJA::statement::For<1, RAJA::omp_parallel_for_exec,  // row
        RAJA::statement::For<0, RAJA::seq_exec,            // col
          RAJA::statement::Lambda<0>
        >
      >
    >;

To swap the loop nest ordering and keep the same execution policy on each loop, we would use the following policy, which swaps the RAJA::statement::For types. The inner loop is now the ‘row’ loop and is run in parallel; the outer loop is now the ‘col’ loop and is run sequentially:

  using EXEC_POL2 =
    RAJA::KernelPolicy<
      RAJA::statement::For<0, RAJA::seq_exec,                  // col
        RAJA::statement::For<1, RAJA::omp_parallel_for_exec,    // row
          RAJA::statement::Lambda<0>
        >
      >
    >;

Note

It is important to emphasize that these kernel transformations, and others, can be done by switching the RAJA::KernelPolicy type with no changes to the loop kernel code.

In Basic RAJA::kernel Mechanics and Nested Loop Ordering, we provide a more detailed discussion of the mechanics of loop nest ordering. Next, we show other variations of the matrix multiplication kernel that illustrate other RAJA::kernel features.

More Complex RAJA::kernel Variants

The matrix multiplication kernel variations described in this section use execution policies to express the outer row and col loops as well as the inner dot product loop using the RAJA kernel interface. They illustrate more complex policy examples and show additional RAJA kernel features.

The first example uses sequential execution for all loops:

  using EXEC_POL6a =
    RAJA::KernelPolicy<
      RAJA::statement::For<1, RAJA::seq_exec,
        RAJA::statement::For<0, RAJA::seq_exec,
          RAJA::statement::Lambda<0, RAJA::Params<0>>,  // dot = 0.0
          RAJA::statement::For<2, RAJA::seq_exec,
            RAJA::statement::Lambda<1> // inner loop: dot += ...
          >,
          RAJA::statement::Lambda<2, RAJA::Segs<0, 1>, RAJA::Params<0>>   // set C(row, col) = dot
        >
      >
    >;

  RAJA::kernel_param<EXEC_POL6a>(
    RAJA::make_tuple(col_range, row_range, dot_range),

    RAJA::tuple<double>{0.0},    // thread local variable for 'dot'

    // lambda 0
    [=] (double& dot) {
       dot = 0.0;
    },

    // lambda 1
    [=] (int col, int row, int k, double& dot) {
       dot += Aview(row, k) * Bview(k, col);
    },

    // lambda 2
    [=] (int col, int row, double& dot) {
       Cview(row, col) = dot;
    }

  );

Note that we use a RAJA::kernel_param method to execute the kernel. It is similar to RAJA::kernel except that it accepts a tuple as the second argument (between the iteration space tuple and the lambda expressions). In general, the tuple is a set of parameters that can be used in the lambda expressions comprising the kernel. Here, the parameter tuple holds a single scalar variable for the dot product of each row-column pair.

The remaining arguments include a sequence of lambda expressions representing different parts of the kernel body. We use three lambda expressions that: initialize the dot product variable (lambda 0), define the ‘k’ inner loop row-col dot product operation (lambda 1), and store the computed row-col dot product in the proper location in the result matrix (lambda 2). Note that all lambdas take the same arguments in the same order, which is required for the kernel to be well-formed. In addition to the loop index variables, we pass the scalar dot product variable into each lambda. This enables the same variables to be used in all three lambda expressions. However, observe that not all lambda expressions use all three index variables. This is the result of using the RAJA::Params and RAJA::Segs template parameter types in the RAJA::statement::Lambda types for lambdas ‘0’ and ‘2’. Specifically, RAJA::statement::Lambda<0, RAJA::Params<0>> indicates that lambda ‘0’ will take only the scalar parameter as an argument, and RAJA::statement::Lambda<2, RAJA::Segs<0, 1>, RAJA::Params<0>> indicates that lambda ‘2’ will take index values for the column and row ranges and the scalar parameter as arguments, in that order. Since lambda ‘1’ takes all arguments, we do not specify them.

Alternatively, the statement to invoke lambda ‘1’ could be augmented to specify the arguments it takes:

  // Alias for convenience
  using RAJA::Segs;
  using RAJA::Params;

  using EXEC_POL6b =
    RAJA::KernelPolicy<
      RAJA::statement::For<1, RAJA::seq_exec,
        RAJA::statement::For<0, RAJA::seq_exec,
          RAJA::statement::Lambda<0, Params<0>>,  // dot = 0.0
          RAJA::statement::For<2, RAJA::seq_exec,
            RAJA::statement::Lambda<1, Segs<0,1,2>, Params<0>> // dot += ...
          >,
          RAJA::statement::Lambda<2, Segs<0,1>, Params<0>>  // C(row, col) = dot
        >
      >
    >;

  RAJA::kernel_param<EXEC_POL6b>(
    RAJA::make_tuple(col_range, row_range, dot_range),

    RAJA::tuple<double>{0.0},    // thread local variable for 'dot'

    // lambda 0
    [=] (double& dot) {
       dot = 0.0;
    },

    // lambda 1
    [=] (int col, int row, int k, double& dot) {
       dot += Aview(row, k) * Bview(k, col);
    },

    // lambda 2
    [=] (int col, int row, double& dot) {
       Cview(row, col) = dot;
    }

  );

The result is the same.

By using RAJA::statement::Lambda parameters in this way, the code potentially indicates more clearly which arguments are used. Of course, this makes the execution policy more verbose, but that is typically hidden away in a header file, so it need not make the code harder to read.

As we noted earlier, the execution policy type passed to the RAJA::kernel_param method as a template parameter describes how the statements and lambda expressions are assembled to form the complete kernel. To illustrate this, we describe various policies that enable the kernel to run in different ways. In each case, the RAJA::kernel_param method call, including its arguments is the same. The curious reader may inspect the example code in the file noted above to see that this is indeed the case.

Next, we show how to collapse nested loops in an OpenMP parallel region using a RAJA::statement::Collapse type in the execution policy. This allows one to parallelize multiple levels in a loop nest using OpenMP directives. The following policy will collapse the two outer loops into one OpenMP parallel region:

  using EXEC_POL7 =
    RAJA::KernelPolicy<
      RAJA::statement::Collapse<RAJA::omp_parallel_collapse_exec,
                                RAJA::ArgList<1, 0>,   // row, col
        RAJA::statement::Lambda<0, RAJA::Params<0>>,  // dot = 0.0
        RAJA::statement::For<2, RAJA::seq_exec,
          RAJA::statement::Lambda<1> // inner loop: dot += ...
        >,
        RAJA::statement::Lambda<2, RAJA::Segs<0, 1>, RAJA::Params<0>>   // set C(row, col) = dot
      >
    >;

The RAJA::ArgList type indicates which loops in the nest are to be collapsed and their nesting order within the collapse region. The integers passed to ArgList are indices of entries in the tuple of iteration spaces and indicate inner to outer loop levels when read from right to left. Here, ‘1, 0’ indicates that the column loop is the inner loop and the row loop is the outer loop. For this transformation there are no statement::For types and policies for the individual loop levels inside the OpenMP collapse region.

Lastly, we show how to use RAJA::statement::CudaKernel and RAJA::statement::HipKernel types to generate GPU kernels launched with a particular thread-block decomposition. We reiterate that although the policies are different, the kernels themselves are identical to the sequential and OpenMP variants above.

Here is a policy that will distribute the row indices across CUDA thread blocks and column indices across threads in the x dimension of each block:

  using EXEC_POL8 =
    RAJA::KernelPolicy<
      RAJA::statement::CudaKernel<
        RAJA::statement::For<1, RAJA::cuda_block_x_loop,    // row
          RAJA::statement::For<0, RAJA::cuda_thread_x_loop, // col
            RAJA::statement::Lambda<0, RAJA::Params<0>>,    // dot = 0.0
            RAJA::statement::For<2, RAJA::seq_exec,
                RAJA::statement::Lambda<1>                  // dot += ...
            >,
            RAJA::statement::Lambda<2, RAJA::Segs<0, 1>, RAJA::Params<0>>   // set C = ...
          >
        >
      >
    >;

This is equivalent to defining a CUDA kernel with the lambda body inside it and defining row and column indices as:

int row = blockIdx.x;
int col = threadIdx.x;

and launching the kernel with appropriate CUDA grid and thread-block dimensions.

The HIP execution policy is similar:

  using EXEC_POL8 =
    RAJA::KernelPolicy<
      RAJA::statement::HipKernel<
        RAJA::statement::For<1, RAJA::hip_block_x_loop,    // row
          RAJA::statement::For<0, RAJA::hip_thread_x_loop, // col
            RAJA::statement::Lambda<0, RAJA::Params<0>>,   // dot = 0.0
            RAJA::statement::For<2, RAJA::seq_exec,
                RAJA::statement::Lambda<1>                 // dot += ...
            >,
            RAJA::statement::Lambda<2,
              RAJA::Segs<0,1>, RAJA::Params<0>>            // set C = ...
          >
        >
      >
    >;

The following policy will tile row and col indices across two-dimensional CUDA thread blocks with ‘x’ and ‘y’ dimensions defined by a ‘CUDA_BLOCK_SIZE’ parameter that can be set at compile time. Within each tile, the kernel iterates are executed by CUDA threads.

  using EXEC_POL9a =
    RAJA::KernelPolicy<
      RAJA::statement::CudaKernel<
        RAJA::statement::Tile<1, RAJA::tile_fixed<CUDA_BLOCK_SIZE>,
                                 RAJA::cuda_block_y_loop,
          RAJA::statement::Tile<0, RAJA::tile_fixed<CUDA_BLOCK_SIZE>,
                                   RAJA::cuda_block_x_loop,
            RAJA::statement::For<1, RAJA::cuda_thread_y_loop,   // row
              RAJA::statement::For<0, RAJA::cuda_thread_x_loop, // col
                RAJA::statement::Lambda<0, RAJA::Params<0>>,    // dot = 0.0
                RAJA::statement::For<2, RAJA::seq_exec,
                    RAJA::statement::Lambda<1>                 // dot += ...
                >,
                RAJA::statement::Lambda<2, RAJA::Segs<0, 1>, RAJA::Params<0>>   // set C = ...
              >
            >
          >
        >
      >
    >;

Note that the tiling mechanism requires a RAJA::statement::Tile type, with a tile size and a tiling execution policy, plus a RAJA::statement::For type with an execution execution policy for each tile dimension.

The analogous HIP execution policy is:

  using EXEC_POL9b =
    RAJA::KernelPolicy<
      RAJA::statement::HipKernel<
        RAJA::statement::Tile<1, RAJA::tile_fixed<HIP_BLOCK_SIZE>,
                                 RAJA::hip_block_y_loop,
          RAJA::statement::Tile<0, RAJA::tile_fixed<HIP_BLOCK_SIZE>,
                                   RAJA::hip_block_x_loop,
            RAJA::statement::For<1, RAJA::hip_thread_y_loop,    // row
              RAJA::statement::For<0, RAJA::hip_thread_x_loop,  // col
                RAJA::statement::Lambda<0, Params<0>>,          // dot = 0.0
                RAJA::statement::For<2, RAJA::seq_exec,
                  RAJA::statement::Lambda<1, Segs<0,1,2>, Params<0>> // dot += ...
                >,
                  RAJA::statement::Lambda<2, Segs<0,1>, Params<0>>   // set C = ...
              >
            >
          >
        >
      >
    >;

In Tiled Matrix Transpose and Tiled Matrix Transpose with Local Array, we discuss loop tiling in more detail.