OffsetLayout: Five-point Stencil

This section contains an exercise file RAJA/exercises/offset-layout-stencil.cpp for you to work through if you wish to get some practice with RAJA. The file RAJA/exercises/offset-layout-stencil.cpp contains complete working code for the examples discussed in this section. You can use the solution file to check your work and for guidance if you get stuck. To build the exercises execute make offset-layout-stencil and make offset-layout-stencil_solution from the build directory.

Key RAJA features shown in the following example:

  • RAJA::kernel loop execution template and execution policies

  • RAJA::View multi-dimensional data access

  • RAJA:make_offset_layout method to create an offset Layout

The examples in this section apply a five-point stencil to the interior cells of a two-dimensional lattice and store a resulting sum in a second lattice of equal size. The five-point stencil associated with a lattice cell accumulates the value in the cell and each of its four neighbors. We use RAJA::View and RAJA::OffsetLayout constructs to simplify the multi-dimensional indexing so that we can write the stencil operation naturally, as such:

output(row, col) = input(row, col) +
                   input(row - 1, col) + input(row + 1, col) +
                   input(row, col - 1) + input(row, col + 1)

A lattice is assumed to have \(N_r \times N_c\) interior cells with unit values surrounded by a halo of cells containing zero values for a total dimension of \((N_r + 2) \times (N_c + 2)\). For example, when \(N_r = N_c = 3\), the input lattice and values are:

0

0

0

0

0

0

1

1

1

0

0

1

1

1

0

0

1

1

1

0

0

0

0

0

0

After applying the stencil, the output lattice and values are:

0

0

0

0

0

0

3

4

3

0

0

4

5

4

0

0

3

4

3

0

0

0

0

0

0

For this \((N_r + 2) \times (N_c + 2)\) lattice case, here is our (row, col) indexing scheme.

(-1, 3)

(0, 3)

(1, 3)

(2, 3)

(3, 3)

(-1, 2)

(0, 2)

(1, 2)

(2, 2)

(3, 2)

(-1, 1)

(0, 1)

(1, 1)

(2, 1)

(3, 1)

(-1, 0)

(0, 0)

(1, 0)

(2, 0)

(3, 0)

(-1, -1)

(0, -1)

(1, -1)

(2, -1)

(3, -1)

Notably, \([0, N_r) \times [0, N_c)\) corresponds to the interior index range over which we apply the stencil, and \([-1,N_r+1) \times [-1, N_c+1)\) is the full lattice index range.

For reference and comparison to the RAJA::kernel implementations described below, we begin by walking through a C-style version of the stencil computation. First, we define the size of our lattice:

//
// Define num of interior cells in row/cols in a lattice
//
  constexpr int N_r = 5;
  constexpr int N_c = 4;

//
// Define total num of cells in rows/cols in a lattice
//
  constexpr int totCellsInRow = N_r + 2;
  constexpr int totCellsInCol = N_c + 2;

//
// Define total num of cells in a lattice
//
  constexpr int totCells = totCellsInRow * totCellsInCol;

Then, after allocating input and output arrays, we initialize the input:

  for (int row = 1; row <= N_r; ++row) {
    for (int col = 1; col <= N_c; ++col) {
      int id = col + totCellsInCol * row;
      input[id] = 1;
    }
  }

and compute the reference output solution:

  for (int row = 1; row <= N_r; ++row) {
    for (int col = 1; col <= N_c; ++col) {

      int id = col + totCellsInCol * row;
      output_ref[id] = input[id] + input[id + 1]
                        + input[id - 1]
                        + input[id + totCellsInCol]
                        + input[id - totCellsInCol];
    }
  }

RAJA Offset Layouts

We use the RAJA::make_offset_layout method to construct a RAJA::OffsetLayout object that we use to create RAJA::View objects for our input and output data arrays:

  const int DIM = 2;

  RAJA::OffsetLayout<DIM, int> layout =
      RAJA::make_offset_layout<DIM, int>({{-1, -1}}, {{N_r+1, N_c+1}});

  RAJA::View<int, RAJA::OffsetLayout<DIM, int>> inputView(input, layout);
  RAJA::View<int, RAJA::OffsetLayout<DIM, int>> outputView(output, layout);

Here, the row index range is \([-1, N_r+1)\), and the column index range is \([-1, N_c+1)\). The first argument to each call to the RAJA::View constructor is the pointer to the array that holds the View data. The second argument is the RAJA::OffsetLayout object.

RAJA::OffsetLayout objects allow us to write loops over data arrays using non-zero based indexing and without having to manually compute offsets into the arrays.

For more information about RAJA View and Layout types, please see View and Layout.

RAJA Kernel Variants

For the RAJA implementations of the stencil computation, we use two RAJA::TypedRangeSegment objects to define the row and column iteration spaces for the interior cells:

  RAJA::TypedRangeSegment<int> col_range(0, N_c);
  RAJA::TypedRangeSegment<int> row_range(0, N_r);

Now, we have all the ingredients to implement the stencil computation using RAJA::kernel. Here is a sequential CPU variant:

  using NESTED_EXEC_POL1 =
    RAJA::KernelPolicy<
      RAJA::statement::For<1, RAJA::seq_exec,    // row
        RAJA::statement::For<0, RAJA::seq_exec,  // col
          RAJA::statement::Lambda<0>
        >
      >  
    >;  

  RAJA::kernel<NESTED_EXEC_POL1>(RAJA::make_tuple(col_range, row_range),
                                 [=](int col, int row) {

                                   outputView(row, col) =
                                       inputView(row, col)
                                       + inputView(row - 1, col)
                                       + inputView(row + 1, col)
                                       + inputView(row, col - 1)
                                       + inputView(row, col + 1);

                                 });

This RAJA variant does the computation as the C-style variant introduced above.

Since the input and output arrays are distinct, the stencil computation is data parallel. Thus, we can use RAJA::kernel and an appropriate execution policy to run the computation in parallel. Here is an OpenMP collapse variant that maps the row-column product index space to OpenMP threads:

  using NESTED_EXEC_POL2 = 
    RAJA::KernelPolicy<
      RAJA::statement::Collapse<RAJA::omp_parallel_collapse_exec,
                                RAJA::ArgList<1, 0>,   // row, col
        RAJA::statement::Lambda<0>
      > 
    >;

  RAJA::kernel<NESTED_EXEC_POL2>(RAJA::make_tuple(col_range, row_range),
                                 [=](int col, int row) {

                                   outputView(row, col) =
                                       inputView(row, col)
                                       + inputView(row - 1, col)
                                       + inputView(row + 1, col)
                                       + inputView(row, col - 1)
                                       + inputView(row, col + 1);

                                 });

Note that the lambda expression representing the kernel body is identical to the RAJA::kernel sequential version.

Here are variants for CUDA

  using NESTED_EXEC_POL3 =
    RAJA::KernelPolicy<
      RAJA::statement::CudaKernel<
        RAJA::statement::For<1, RAJA::cuda_block_x_loop, //row
          RAJA::statement::For<0, RAJA::cuda_thread_x_loop, //col
            RAJA::statement::Lambda<0>
          >
        >
      >
    >;                                                     

  RAJA::kernel<NESTED_EXEC_POL3>(RAJA::make_tuple(col_range, row_range),
                                 [=] RAJA_DEVICE(int col, int row) {

                                   outputView(row, col) =
                                       inputView(row, col)
                                       + inputView(row - 1, col)
                                       + inputView(row + 1, col)
                                       + inputView(row, col - 1)
                                       + inputView(row, col + 1);

                                 });

and HIP

  using NESTED_EXEC_POL4 =
    RAJA::KernelPolicy<
      RAJA::statement::HipKernel<
        RAJA::statement::For<1, RAJA::hip_block_x_loop, //row
          RAJA::statement::For<0, RAJA::hip_thread_x_loop, //col
            RAJA::statement::Lambda<0>
          >
        >
      >
    >;

  RAJA::kernel<NESTED_EXEC_POL4>(RAJA::make_tuple(col_range, row_range),
                                 [=] RAJA_DEVICE(int col, int row) {

                                   d_outputView(row, col) =
                                         d_inputView(row, col)
                                       + d_inputView(row - 1, col)
                                       + d_inputView(row + 1, col)
                                       + d_inputView(row, col - 1)
                                       + d_inputView(row, col + 1);
                                 });

The only difference between the CPU and GPU variants is that the RAJA macro RAJA_DEVICE is used to decorate the lambda expression with the __device__ annotation, which is required when capturing a lambda for use in a GPU device environment as we have discussed in other examples in this tutorial.

One other point to note is that the CUDA variant in the exercise files uses Unified Memory and the HIP variant uses distinct host and device memory arrays, with explicit host-device data copy operations. Thus, new RAJA::View objects were created for the HIP variant to wrap the device data pointers used in the HIP kernel. Please see the exercise files for this example for details.