.. ##
.. ## Copyright (c) Lawrence Livermore National Security, LLC and other
.. ## RAJA Project Developers. See top-level LICENSE and COPYRIGHT
.. ## files for dates and other details. No copyright assignment is required
.. ## to contribute to RAJA.
.. ##
.. ## SPDX-License-Identifier: (BSD-3-Clause)
.. ##

.. _tut-matrixtranspose-label:

----------------------
Matrix Transpose
----------------------

In :ref:`tut-kernelexecpols-label` and :ref:`tut-launchexecpols-label`,
we presented a simple array initialization kernel using ``RAJA::kernel`` and 
``RAJA::launch`` interfaces, respectively, and compared the two. This 
section describes the implementation of a matrix transpose kernel using both 
``RAJA::kernel`` and ``RAJA::launch`` interfaces. The intent is to 
compare and contrast the two, as well as introduce additional features of the 
interfaces.

There are exercise files 
``RAJA/exercises/kernel-matrix-transpose.cpp`` and
``RAJA/exercises/launch-matrix-transpose.cpp`` for you to work through if you 
wish to get some practice with RAJA. The files 
``RAJA/exercises/kernel-matrix-transpose_solution.cpp`` and
``RAJA/exercises/launch-matrix-transpose_solution.cpp`` contain
complete working code for the examples. You can use the solution files to 
check your work and for guidance if you get stuck. To build
the exercises execute ``make (kernel/launch)-matrix-transpose`` and ``make (kernel/launch)-matrix-transpose_solution``
from the build directory.

Key RAJA features shown in this example are:

  * ``RAJA::kernel`` method and kernel execution policies
  * ``RAJA::launch`` method and kernel execution interface

In the example, we compute the transpose of an input matrix
:math:`A` of size :math:`N_r \times N_c` and store the result in a second
matrix :math:`At` of size :math:`N_c \times N_r`.

First we define our matrix dimensions

.. literalinclude:: ../../../../exercises/kernel-matrix-transpose_solution.cpp
   :start-after: // _mattranspose_dims_start
   :end-before: // _mattranspose_dims_end
   :language: C++

and wrap the data pointers for the matrices in ``RAJA::View`` objects to
simplify the multi-dimensional indexing:

.. literalinclude:: ../../../../exercises/kernel-matrix-transpose_solution.cpp
   :start-after: // _mattranspose_views_start
   :end-before: // _mattranspose_views_end
   :language: C++

Then, a C-style for-loop implementation looks like this:

.. literalinclude:: ../../../../exercises/kernel-matrix-transpose_solution.cpp
   :start-after: // _cstyle_mattranspose_start
   :end-before: // _cstyle_mattranspose_end
   :language: C++

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``RAJA::kernel`` Implementation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For ``RAJA::kernel`` variants, we use ``RAJA::statement::For`` and 
``RAJA::statement::Lambda`` statement types in the execution policies.
The complete sequential ``RAJA::kernel`` variant is:

.. literalinclude:: ../../../../exercises/kernel-matrix-transpose_solution.cpp
   :start-after: // _raja_mattranspose_start
   :end-before: // _raja_mattranspose_end
   :language: C++

A CUDA ``RAJA::kernel`` variant for the GPU is similar with different policies
in the ``RAJA::statement::For`` statements:

.. literalinclude:: ../../../../exercises/kernel-matrix-transpose_solution.cpp
   :start-after: // _raja_mattranspose_cuda_start
   :end-before: // _raja_mattranspose_cuda_end
   :language: C++

A notable difference between the CPU and GPU execution policy is the insertion 
of the ``RAJA::statement::CudaKernel`` type in the GPU version, which indicates
that the execution will launch a CUDA device kernel.

In the CUDA ``RAJA::kernel`` variant above, the thread-block size and
and number of blocks to launch is determined by the implementation of the 
``RAJA::kernel`` execution policy constructs using the sizes of the 
``RAJA::TypedRangeSegment`` objects in the iteration space tuple.

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
``RAJA::launch`` Implementation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For ``RAJA::launch`` variants, we use ``RAJA::loop`` methods 
to write a loop hierarchy within the kernel execution space. For a sequential 
implementation, we pass the ``RAJA::seq_launch_t`` template parameter
to the launch method and pass the ``RAJA::seq_exec`` parameter to the loop 
methods. The complete sequential ``RAJA::launch`` variant is:

.. literalinclude:: ../../../../exercises/launch-matrix-transpose_solution.cpp
   :start-after: // _raja_mattranspose_start
   :end-before: // _raja_mattranspose_end
   :language: C++

A CUDA ``RAJA::launch`` variant for the GPU is similar with CUDA
policies in the ``RAJA::loop`` methods. The complete 
``RAJA::launch`` variant is:

.. literalinclude:: ../../../../exercises/launch-matrix-transpose_solution.cpp
   :start-after: // _raja_mattranspose_cuda_start
   :end-before: // _raja_mattranspose_cuda_end
   :language: C++

A notable difference between the CPU and GPU ``RAJA::launch`` 
implementations is the definition of the compute grid. For the CPU
version, the argument list is empty for the ``RAJA::LaunchParams`` constructor.
For the CUDA GPU implementation, we define a 'Team' of one two-dimensional 
thread-block with 16 x 16 = 256 threads.