.. ## .. ## Copyright (c) Lawrence Livermore National Security, LLC and other .. ## RAJA Project Developers. See top-level LICENSE and COPYRIGHT .. ## files for dates and other details. No copyright assignment is required .. ## to contribute to RAJA. .. ## .. ## SPDX-License-Identifier: (BSD-3-Clause) .. ## .. _tut-tiledmatrixtranspose-label: ---------------------- Tiled Matrix Transpose ---------------------- This section describes the implementation of a tiled matrix transpose kernel using both ``RAJA::kernel`` and ``RAJA::launch`` interfaces. The intent is to compare and contrast the two. The discussion builds on :ref:`tut-matrixtranspose-label` by adding tiling to the matrix transpose implementation. There are exercise files ``RAJA/exercises/kernel-matrix-transpose-tiled.cpp`` and ``RAJA/exercises/launch-matrix-transpose-tiled.cpp`` for you to work through if you wish to get some practice with RAJA. The files ``RAJA/exercises/kernel-matrix-transpose-tiled_solution.cpp`` and ``RAJA/exercises/launch-matrix-transpose-tiled_solution.cpp`` contain complete working code for the examples. You can use the solution files to check your work and for guidance if you get stuck. To build the exercises execute ``make (kernel/launch)-matrix-transpose-tiled`` and ``make (kernel/launch)-matrix-transpose-tiled_solution`` from the build directory. Key RAJA features shown in this example are: * ``RAJA::kernel`` method and execution policies, and the ``RAJA::statement::Tile`` type * ``RAJA::launch`` method and execution policies, and the ``RAJA::tile`` type As in :ref:`tut-matrixtranspose-label`, we compute the transpose of an input matrix :math:`A` of size :math:`N_r \times N_c` and storing the result in a second matrix :math:`At` of size :math:`N_c \times N_r`. We will compute the matrix transpose using a tiling algorithm, which iterates over tiles and transposes the matrix entries in each tile. The algorithm involves outer and inner loops to iterate over the tiles and matrix entries within each tile, respectively. As in :ref:`tut-matrixtranspose-label`, we start by defining the matrix dimensions. Additionally, we define a tile size smaller than the matrix dimensions and determine the number of tiles in each dimension. Note that we do not assume that tiles divide evenly the number of rows and and columns of the matrix. However, we do assume square tiles. .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-tiled_solution.cpp :start-after: // _tiled_mattranspose_dims_start :end-before: // _tiled_mattranspose_dims_end :language: C++ Then, we wrap the matrix data pointers in ``RAJA::View`` objects to simplify the multi-dimensional indexing: .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-tiled_solution.cpp :start-after: // _tiled_mattranspose_views_start :end-before: // _tiled_mattranspose_views_end :language: C++ The C-style for-loop implementation looks like this: .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-tiled_solution.cpp :start-after: // _cstyle_tiled_mattranspose_start :end-before: // _cstyle_tiled_mattranspose_end :language: C++ .. note:: To prevent indexing out of bounds, when the tile dimensions do not divide evenly the matrix dimensions, the algorithm requires a bounds check in the inner loops. ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``RAJA::kernel`` Variants ^^^^^^^^^^^^^^^^^^^^^^^^^^^ For ``RAJA::kernel`` variants, we use ``RAJA::statement::Tile`` types for the outer loop tiling and ``RAJA::tile_fixed`` types to indicate the tile dimensions. The complete sequential RAJA variant is: .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-tiled_solution.cpp :start-after: // _raja_tiled_mattranspose_start :end-before: // _raja_tiled_mattranspose_end :language: C++ The ``RAJA::statement::Tile`` types compute the number of tiles needed to iterate over all matrix entries in each dimension and generate iteration index bounds for each tile, which are used to generate loops for the inner ``RAJA::statement::For`` types. Thus, the explicit bounds checking logic in the C-style variant is not needed. Note that the integer template parameters in the ``RAJA::statement::For`` types refer to the entries in the iteration space tuple passed to the ``RAJA::kernel`` method. The ``RAJA::kernel`` CUDA variant is similar with sequential policies replaced with CUDA execution policies: .. literalinclude:: ../../../../exercises/kernel-matrix-transpose-tiled_solution.cpp :start-after: // _raja_mattranspose_cuda_start :end-before: // _raja_mattranspose_cuda_end :language: C++ A notable difference between the CPU and GPU execution policy is the insertion of the ``RAJA::statement::CudaKernel`` type in the GPU version, which indicates that the execution will launch a CUDA device kernel. The CUDA thread-block dimensions are set based on the tile dimensions and the iterates withing each tile are mapped directly to GPU threads in each block due to the ``RAJA::cuda_thread_{x, y}_direct`` policies. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ``RAJA::launch`` Variants ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For ``RAJA::launch`` variants, we use ``RAJA::tile`` methods for the outer loop tiling and ``RAJA::loop`` methods to iterate within the tiles. The complete sequential tiled ``RAJA::launch`` variant is: .. literalinclude:: ../../../../exercises/launch-matrix-transpose-tiled_solution.cpp :start-after: // _raja_tiled_mattranspose_start :end-before: // _raja_tiled_mattranspose_end :language: C++ Similar to the ``RAJA::statement::Tile`` type in the ``RAJA::kernel`` variant above, the ``RAJA::tile`` method computes the number of tiles needed to iterate over all matrix entries in each dimension and generates a corresponding iteration space for each tile, which is used to generate loops for the inner ``RAJA::loop`` methods. Thus, the explicit bounds checking logic in the C-style variant is not needed. A CUDA ``RAJA::launch`` tiled variant for the GPU is similar with CUDA policies in the ``RAJA::loop`` methods. The complete ``RAJA::launch`` variant is: .. literalinclude:: ../../../../exercises/launch-matrix-transpose-tiled_solution.cpp :start-after: // _raja_mattranspose_cuda_start :end-before: // _raja_mattranspose_cuda_end :language: C++ A notable difference between the CPU and GPU ``RAJA::launch`` implementations is the definition of the compute grid. For the CPU version, the argument list is empty for the ``RAJA::LaunchParams`` constructor. For the CUDA GPU implementation, we define a 'Team' of one two-dimensional thread-block with 16 x 16 = 256 threads.