.. ##
.. ## Copyright (c) Lawrence Livermore National Security, LLC and other
.. ## RAJA Project Developers. See top-level LICENSE and COPYRIGHT
.. ## files for dates and other details. No copyright assignment is required
.. ## to contribute to RAJA.
.. ##
.. ## SPDX-License-Identifier: (BSD-3-Clause)
.. ##

.. _tut-halo_exchange-label:

------------------------------------
Workgroup Constructs: Halo Exchange
------------------------------------

The example code discussed in this section can be found in the file
``RAJA/examples/tut_halo-exchange.cpp``. The file contains complete working 
code for multiple OpenMP, CUDA, and HIP RAJA variants. Here, we describe
a subset of these variants.

Key RAJA features shown in this example:

  * ``RAJA::WorkPool`` workgroup construct
  * ``RAJA::WorkGroup`` workgroup construct
  * ``RAJA::WorkSite`` workgroup construct
  * ``RAJA::TypedRangeSegment`` iteration space construct
  * RAJA workgroup policies

In this example, we show how to use the RAJA workgroup constructs to implement
buffer packing and unpacking for data halo exchange on a computational grid,
a common MPI communication operation for distributed memory applications. 
This technique may not provide a performance gain on a CPU system, but it can 
significantly speedup halo exchange on a GPU system compared to running
many individual packing/unpacking kernels, for example.

.. note:: Using an abstraction layer over RAJA can make it easy to switch
          between using individual ``RAJA::forall`` loops or the RAJA workgroup
          constructs to implement halo exchange packing and unpacking at
          compile time or run time.

We start by setting the parameters for the halo exchange by using default
values or values provided via command line input to the example code. These 
parameters determine the size of the grid, the width of the halo, the number 
of grid variables to pack/unpack, and the number of cycles; (iterations
to run).

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_input_params_start
   :end-before: _halo_exchange_input_params_end
   :language: C++

Next, we allocate the variable data arrays (the memory manager in
the example uses CUDA Unified Memory if CUDA is enabled). These grid variables
are reset each cycle to allow checking the results of the packing and
unpacking.

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_vars_allocate_start
   :end-before: _halo_exchange_vars_allocate_end
   :language: C++

We also allocate and initialize index lists of the grid elements to pack and
unpack:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_index_list_generate_start
   :end-before: _halo_exchange_index_list_generate_end
   :language: C++

All the code examples presented below copy the data packed from the grid 
interior: 

  +---+---+---+---+---+
  | 0 | 0 | 0 | 0 | 0 |
  +---+---+---+---+---+
  | 0 | 1 | 2 | 3 | 0 |
  +---+---+---+---+---+
  | 0 | 4 | 5 | 6 | 0 |
  +---+---+---+---+---+
  | 0 | 7 | 8 | 9 | 0 |
  +---+---+---+---+---+
  | 0 | 0 | 0 | 0 | 0 |
  +---+---+---+---+---+

into the adjacent halo cells:

  +---+---+---+---+---+
  | 1 | 1 | 2 | 3 | 3 |
  +---+---+---+---+---+
  | 1 | 1 | 2 | 3 | 3 |
  +---+---+---+---+---+
  | 4 | 4 | 5 | 6 | 6 |
  +---+---+---+---+---+
  | 7 | 7 | 8 | 9 | 9 |
  +---+---+---+---+---+
  | 7 | 7 | 8 | 9 | 9 |
  +---+---+---+---+---+

Although the example code does not use MPI and multiple domains (one per
MPI rank, for example), as would be the case in a real distributed memory
parallel application, the data copy operations represent the spirit of how
data communication would be done.


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Packing and Unpacking (Basic Loop Execution)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A sequential non-RAJA example of data packing and unpacking would look like:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_sequential_cstyle_packing_start
   :end-before: _halo_exchange_sequential_cstyle_packing_end
   :language: C++

and:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_sequential_cstyle_unpacking_start
   :end-before: _halo_exchange_sequential_cstyle_unpacking_end
   :language: C++


^^^^^^^^^^^^^^^^^^^^^^^^^^
RAJA Variants using forall
^^^^^^^^^^^^^^^^^^^^^^^^^^

A sequential RAJA example uses this execution policy type:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_seq_forall_policies_start
   :end-before: _halo_exchange_seq_forall_policies_end
   :language: C++

to pack the grid variable data into a buffer:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_seq_forall_packing_start
   :end-before: _halo_exchange_seq_forall_packing_end
   :language: C++

and unpack the buffer data into the grid variable array:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_seq_forall_unpacking_start
   :end-before: _halo_exchange_seq_forall_unpacking_end
   :language: C++


For parallel multithreading execution via OpenMP, the example can be run
by replacing the execution policy with:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_openmp_forall_policies_start
   :end-before: _halo_exchange_openmp_forall_policies_end
   :language: C++

Similarly, to run the loops in parallel on a CUDA GPU, we would use this 
policy:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_cuda_forall_policies_start
   :end-before: _halo_exchange_cuda_forall_policies_end
   :language: C++

Note that we can use an asynchronous execution policy because there are
no data races due to the intermediate buffer usage.


^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RAJA Variants using workgroup constructs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Using the workgroup constructs in the example requires defining a few more
policies and types:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_seq_workgroup_policies_start
   :end-before: _halo_exchange_seq_workgroup_policies_end
   :language: C++

which are used in a slightly rearranged version of packing. See how the comment
indicating where messages are sent has been moved down after the call to
run the operations enqueued on the workgroup:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_seq_workgroup_packing_start
   :end-before: _halo_exchange_seq_workgroup_packing_end
   :language: C++

Similarly, in the unpacking we wait to receive all of the messages before
unpacking the data:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_seq_workgroup_unpacking_start
   :end-before: _halo_exchange_seq_workgroup_unpacking_end
   :language: C++

This reorganization has the downside of not overlapping the message sends with
packing and the message receives with unpacking.

For parallel multithreading execution via OpenMP, the example using workgroup
can be run by replacing the policies and types with:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_openmp_workgroup_policies_start
   :end-before: _halo_exchange_openmp_workgroup_policies_end
   :language: C++

The main differences between these types and the ones defined for the sequential
case above are the ``forall_policy`` and the ``workgroup_policy``, which use
OpenMP execution policy types.

Similarly, to run the loops in parallel on a CUDA GPU use these policies and
types, taking note of the unordered work ordering policy that allows the
enqueued loops to all be run using a single CUDA kernel:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_cuda_workgroup_policies_start
   :end-before: _halo_exchange_cuda_workgroup_policies_end
   :language: C++

The main differences between these types and the ones defined for the 
sequential and OpenMP cases above are the ``forall_policy`` and the 
``workgroup_policy``, which use different template parameters, and the
``workpool``, ``workgroup``, and ``worksite`` types which use 'pinned' 
memory allocation.

The packing is the same as the previous workgroup packing examples with the
exception of added synchronization after calling the workgroup run method
and before sending the messages. In the example code, there is a CUDA version
that uses forall to launch ``num_neighbors * num_vars`` CUDA kernels and 
performs ``num_neighbors`` synchronizations to send each message in turn. 
Here, the reorganization to pack all messages before sending lets us use an 
unordered CUDA work ordering policy in the ``workgroup_policy`` that reduces 
the number of CUDA kernel launches to one. It also allows us to need to 
synchronize only once before sending all of the messages:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_cuda_workgroup_packing_start
   :end-before: _halo_exchange_cuda_workgroup_packing_end
   :language: C++

After waiting to receive all of the messages we use workgroup constructs with
a CUDA unordered work ordering policy to unpack all of the messages using a
single kernel launch:

.. literalinclude:: ../../../../examples/tut_halo-exchange.cpp
   :start-after: _halo_exchange_cuda_workgroup_unpacking_start
   :end-before: _halo_exchange_cuda_workgroup_unpacking_end
   :language: C++

Note that the synchronization after unpacking is done to ensure that
``group_unpack`` and ``site_unpack`` survive until the unpacking loop has
finished executing.