OffsetLayout: Five-point Stencil¶
This section contains an exercise file RAJA/exercises/offset-layout-stencil.cpp
for you to work through if you wish to get some practice with RAJA. The
file RAJA/exercises/offset-layout-stencil.cpp
contains
complete working code for the examples discussed in this section. You can use
the solution file to check your work and for guidance if you get stuck. To build
the exercises execute make offset-layout-stencil
and make offset-layout-stencil_solution
from the build directory.
Key RAJA features shown in the following example:
RAJA::kernel
loop execution template and execution policiesRAJA::View
multi-dimensional data accessRAJA:make_offset_layout
method to create an offset Layout
The examples in this section apply a five-point stencil to the interior cells
of a two-dimensional lattice and store a resulting sum in a second
lattice of equal size. The five-point stencil associated with a lattice cell
accumulates the value in the cell and each of its four neighbors. We use
RAJA::View
and RAJA::OffsetLayout
constructs to simplify
the multi-dimensional indexing so that we can write the stencil operation
naturally, as such:
output(row, col) = input(row, col) +
input(row - 1, col) + input(row + 1, col) +
input(row, col - 1) + input(row, col + 1)
A lattice is assumed to have \(N_r \times N_c\) interior cells with unit values surrounded by a halo of cells containing zero values for a total dimension of \((N_r + 2) \times (N_c + 2)\). For example, when \(N_r = N_c = 3\), the input lattice and values are:
0 0 0 0 0 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0 0 0 0 0 0
After applying the stencil, the output lattice and values are:
0 0 0 0 0 0 3 4 3 0 0 4 5 4 0 0 3 4 3 0 0 0 0 0 0
For this \((N_r + 2) \times (N_c + 2)\) lattice case, here is our (row, col) indexing scheme.
(-1, 3) (0, 3) (1, 3) (2, 3) (3, 3) (-1, 2) (0, 2) (1, 2) (2, 2) (3, 2) (-1, 1) (0, 1) (1, 1) (2, 1) (3, 1) (-1, 0) (0, 0) (1, 0) (2, 0) (3, 0) (-1, -1) (0, -1) (1, -1) (2, -1) (3, -1)
Notably, \([0, N_r) \times [0, N_c)\) corresponds to the interior index range over which we apply the stencil, and \([-1,N_r+1) \times [-1, N_c+1)\) is the full lattice index range.
For reference and comparison to the RAJA::kernel
implementations
described below, we begin by walking through a C-style version of the stencil
computation. First, we define the size of our lattice:
//
// Define num of interior cells in row/cols in a lattice
//
constexpr int N_r = 5;
constexpr int N_c = 4;
//
// Define total num of cells in rows/cols in a lattice
//
constexpr int totCellsInRow = N_r + 2;
constexpr int totCellsInCol = N_c + 2;
//
// Define total num of cells in a lattice
//
constexpr int totCells = totCellsInRow * totCellsInCol;
Then, after allocating input and output arrays, we initialize the input:
for (int row = 1; row <= N_r; ++row) {
for (int col = 1; col <= N_c; ++col) {
int id = col + totCellsInCol * row;
input[id] = 1;
}
}
and compute the reference output solution:
for (int row = 1; row <= N_r; ++row) {
for (int col = 1; col <= N_c; ++col) {
int id = col + totCellsInCol * row;
output_ref[id] = input[id] + input[id + 1]
+ input[id - 1]
+ input[id + totCellsInCol]
+ input[id - totCellsInCol];
}
}
RAJA Offset Layouts¶
We use the RAJA::make_offset_layout
method to construct a
RAJA::OffsetLayout
object that we use to create RAJA::View
objects
for our input and output data arrays:
const int DIM = 2;
RAJA::OffsetLayout<DIM, int> layout =
RAJA::make_offset_layout<DIM, int>({{-1, -1}}, {{N_r+1, N_c+1}});
RAJA::View<int, RAJA::OffsetLayout<DIM, int>> inputView(input, layout);
RAJA::View<int, RAJA::OffsetLayout<DIM, int>> outputView(output, layout);
Here, the row index range is \([-1, N_r+1)\), and the column index
range is \([-1, N_c+1)\). The first argument to each call to the
RAJA::View
constructor is the pointer to the array that holds the View
data. The second argument is the RAJA::OffsetLayout
object.
RAJA::OffsetLayout
objects allow us to write loops over
data arrays using non-zero based indexing and without having to manually
compute offsets into the arrays.
For more information about RAJA View and Layout types, please see View and Layout.
RAJA Kernel Variants¶
For the RAJA implementations of the stencil computation, we use two
RAJA::TypedRangeSegment
objects to define the row and column iteration
spaces for the interior cells:
RAJA::TypedRangeSegment<int> col_range(0, N_c);
RAJA::TypedRangeSegment<int> row_range(0, N_r);
Now, we have all the ingredients to implement the stencil computation using
RAJA::kernel
. Here is a sequential CPU variant:
using NESTED_EXEC_POL1 =
RAJA::KernelPolicy<
RAJA::statement::For<1, RAJA::seq_exec, // row
RAJA::statement::For<0, RAJA::seq_exec, // col
RAJA::statement::Lambda<0>
>
>
>;
RAJA::kernel<NESTED_EXEC_POL1>(RAJA::make_tuple(col_range, row_range),
[=](int col, int row) {
outputView(row, col) =
inputView(row, col)
+ inputView(row - 1, col)
+ inputView(row + 1, col)
+ inputView(row, col - 1)
+ inputView(row, col + 1);
});
This RAJA variant does the computation as the C-style variant introduced above.
Since the input and output arrays are distinct, the stencil computation is
data parallel. Thus, we can use RAJA::kernel
and an appropriate
execution policy to run the computation in parallel. Here is an OpenMP
collapse variant that maps the row-column product index space to OpenMP
threads:
using NESTED_EXEC_POL2 =
RAJA::KernelPolicy<
RAJA::statement::Collapse<RAJA::omp_parallel_collapse_exec,
RAJA::ArgList<1, 0>, // row, col
RAJA::statement::Lambda<0>
>
>;
RAJA::kernel<NESTED_EXEC_POL2>(RAJA::make_tuple(col_range, row_range),
[=](int col, int row) {
outputView(row, col) =
inputView(row, col)
+ inputView(row - 1, col)
+ inputView(row + 1, col)
+ inputView(row, col - 1)
+ inputView(row, col + 1);
});
Note that the lambda expression representing the kernel body is identical to
the RAJA::kernel
sequential version.
Here are variants for CUDA
using NESTED_EXEC_POL3 =
RAJA::KernelPolicy<
RAJA::statement::CudaKernel<
RAJA::statement::For<1, RAJA::cuda_block_x_loop, //row
RAJA::statement::For<0, RAJA::cuda_thread_x_loop, //col
RAJA::statement::Lambda<0>
>
>
>
>;
RAJA::kernel<NESTED_EXEC_POL3>(RAJA::make_tuple(col_range, row_range),
[=] RAJA_DEVICE(int col, int row) {
outputView(row, col) =
inputView(row, col)
+ inputView(row - 1, col)
+ inputView(row + 1, col)
+ inputView(row, col - 1)
+ inputView(row, col + 1);
});
and HIP
using NESTED_EXEC_POL4 =
RAJA::KernelPolicy<
RAJA::statement::HipKernel<
RAJA::statement::For<1, RAJA::hip_block_x_loop, //row
RAJA::statement::For<0, RAJA::hip_thread_x_loop, //col
RAJA::statement::Lambda<0>
>
>
>
>;
RAJA::kernel<NESTED_EXEC_POL4>(RAJA::make_tuple(col_range, row_range),
[=] RAJA_DEVICE(int col, int row) {
d_outputView(row, col) =
d_inputView(row, col)
+ d_inputView(row - 1, col)
+ d_inputView(row + 1, col)
+ d_inputView(row, col - 1)
+ d_inputView(row, col + 1);
});
The only difference between the CPU and GPU variants is that the RAJA macro
RAJA_DEVICE
is used to decorate the lambda expression with the
__device__
annotation, which is required when capturing a lambda for use
in a GPU device environment as we have discussed in other examples in this
tutorial.
One other point to note is that the CUDA variant in the exercise files
uses Unified Memory and the HIP variant uses distinct host and device memory
arrays, with explicit host-device data copy operations. Thus, new
RAJA::View
objects were created for the HIP variant to wrap the
device data pointers used in the HIP kernel. Please see the exercise files
for this example for details.