.. ## .. ## Copyright (c) Lawrence Livermore National Security, LLC and other .. ## RAJA Project Developers. See top-level LICENSE and COPYRIGHT .. ## files for dates and other details. No copyright assignment is required .. ## to contribute to RAJA. .. ## .. ## SPDX-License-Identifier: (BSD-3-Clause) .. ## .. _tut-offsetlayout-label: ------------------------------------------------ OffsetLayout: Five-point Stencil ------------------------------------------------ This section contains an exercise file ``RAJA/exercises/offset-layout-stencil.cpp`` for you to work through if you wish to get some practice with RAJA. The file ``RAJA/exercises/offset-layout-stencil.cpp`` contains complete working code for the examples discussed in this section. You can use the solution file to check your work and for guidance if you get stuck. To build the exercises execute ``make offset-layout-stencil`` and ``make offset-layout-stencil_solution`` from the build directory. Key RAJA features shown in the following example: * ``RAJA::kernel`` loop execution template and execution policies * ``RAJA::View`` multi-dimensional data access * ``RAJA:make_offset_layout`` method to create an offset Layout The examples in this section apply a five-point stencil to the interior cells of a two-dimensional lattice and store a resulting sum in a second lattice of equal size. The five-point stencil associated with a lattice cell accumulates the value in the cell and each of its four neighbors. We use ``RAJA::View`` and ``RAJA::OffsetLayout`` constructs to simplify the multi-dimensional indexing so that we can write the stencil operation naturally, as such:: output(row, col) = input(row, col) + input(row - 1, col) + input(row + 1, col) + input(row, col - 1) + input(row, col + 1) A lattice is assumed to have :math:`N_r \times N_c` interior cells with unit values surrounded by a halo of cells containing zero values for a total dimension of :math:`(N_r + 2) \times (N_c + 2)`. For example, when :math:`N_r = N_c = 3`, the input lattice and values are: +---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+ | 0 | 1 | 1 | 1 | 0 | +---+---+---+---+---+ | 0 | 1 | 1 | 1 | 0 | +---+---+---+---+---+ | 0 | 1 | 1 | 1 | 0 | +---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+ After applying the stencil, the output lattice and values are: +---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+ | 0 | 3 | 4 | 3 | 0 | +---+---+---+---+---+ | 0 | 4 | 5 | 4 | 0 | +---+---+---+---+---+ | 0 | 3 | 4 | 3 | 0 | +---+---+---+---+---+ | 0 | 0 | 0 | 0 | 0 | +---+---+---+---+---+ For this :math:`(N_r + 2) \times (N_c + 2)` lattice case, here is our (row, col) indexing scheme. +----------+---------+---------+---------+---------+ | (-1, 3) | (0, 3) | (1, 3) | (2, 3) | (3, 3) | +----------+---------+---------+---------+---------+ | (-1, 2) | (0, 2) | (1, 2) | (2, 2) | (3, 2) | +----------+---------+---------+---------+---------+ | (-1, 1) | (0, 1) | (1, 1) | (2, 1) | (3, 1) | +----------+---------+---------+---------+---------+ | (-1, 0) | (0, 0) | (1, 0) | (2, 0) | (3, 0) | +----------+---------+---------+---------+---------+ | (-1, -1) | (0, -1) | (1, -1) | (2, -1) | (3, -1) | +----------+---------+---------+---------+---------+ Notably, :math:`[0, N_r) \times [0, N_c)` corresponds to the interior index range over which we apply the stencil, and :math:`[-1,N_r+1) \times [-1, N_c+1)` is the full lattice index range. For reference and comparison to the ``RAJA::kernel`` implementations described below, we begin by walking through a C-style version of the stencil computation. First, we define the size of our lattice: .. literalinclude:: ../../../../exercises/offset-layout-stencil_solution.cpp :start-after: _stencil_define_start :end-before: _stencil_define_end :language: C++ Then, after allocating input and output arrays, we initialize the input: .. literalinclude:: ../../../../exercises/offset-layout-stencil_solution.cpp :start-after: _stencil_input_init_start :end-before: _stencil_input_init_end :language: C++ and compute the reference output solution: .. literalinclude:: ../../../../exercises/offset-layout-stencil_solution.cpp :start-after: _stencil_output_ref_start :end-before: _stencil_output_ref_end :language: C++ ^^^^^^^^^^^^^^^^^^^ RAJA Offset Layouts ^^^^^^^^^^^^^^^^^^^ We use the ``RAJA::make_offset_layout`` method to construct a ``RAJA::OffsetLayout`` object that we use to create ``RAJA::View`` objects for our input and output data arrays: .. literalinclude:: ../../../../exercises/offset-layout-stencil_solution.cpp :start-after: _offsetlayout_views_start :end-before: _offsetlayout_views_end :language: C++ Here, the row index range is :math:`[-1, N_r+1)`, and the column index range is :math:`[-1, N_c+1)`. The first argument to each call to the ``RAJA::View`` constructor is the pointer to the array that holds the View data. The second argument is the ``RAJA::OffsetLayout`` object. ``RAJA::OffsetLayout`` objects allow us to write loops over data arrays using non-zero based indexing and without having to manually compute offsets into the arrays. For more information about RAJA View and Layout types, please see :ref:`feat-view-label`. ^^^^^^^^^^^^^^^^^^^^^^^^^^^ RAJA Kernel Variants ^^^^^^^^^^^^^^^^^^^^^^^^^^^ For the RAJA implementations of the stencil computation, we use two ``RAJA::TypedRangeSegment`` objects to define the row and column iteration spaces for the interior cells: .. literalinclude:: ../../../../exercises/offset-layout-stencil_solution.cpp :start-after: _offsetlayout_ranges_start :end-before: _offsetlayout_ranges_end :language: C++ Now, we have all the ingredients to implement the stencil computation using ``RAJA::kernel``. Here is a sequential CPU variant: .. literalinclude:: ../../../../exercises/offset-layout-stencil_solution.cpp :start-after: _offsetlayout_rajaseq_start :end-before: _offsetlayout_rajaseq_end :language: C++ This RAJA variant does the computation as the C-style variant introduced above. Since the input and output arrays are distinct, the stencil computation is data parallel. Thus, we can use ``RAJA::kernel`` and an appropriate execution policy to run the computation in parallel. Here is an OpenMP collapse variant that maps the row-column product index space to OpenMP threads: .. literalinclude:: ../../../../exercises/offset-layout-stencil_solution.cpp :start-after: _offsetlayout_rajaomp_start :end-before: _offsetlayout_rajaomp_end :language: C++ Note that the lambda expression representing the kernel body is identical to the ``RAJA::kernel`` sequential version. Here are variants for CUDA .. literalinclude:: ../../../../exercises/offset-layout-stencil_solution.cpp :start-after: _offsetlayout_rajacuda_start :end-before: _offsetlayout_rajacuda_end :language: C++ and HIP .. literalinclude:: ../../../../exercises/offset-layout-stencil_solution.cpp :start-after: _offsetlayout_rajahip_start :end-before: _offsetlayout_rajahip_end :language: C++ The only difference between the CPU and GPU variants is that the RAJA macro ``RAJA_DEVICE`` is used to decorate the lambda expression with the ``__device__`` annotation, which is required when capturing a lambda for use in a GPU device environment as we have discussed in other examples in this tutorial. One other point to note is that the CUDA variant in the exercise files uses Unified Memory and the HIP variant uses distinct host and device memory arrays, with explicit host-device data copy operations. Thus, new ``RAJA::View`` objects were created for the HIP variant to wrap the device data pointers used in the HIP kernel. Please see the exercise files for this example for details.