Tiled Matrix Transpose

This section describes the implementation of a tiled matrix transpose kernel using both RAJA::kernel and RAJA::launch interfaces. The intent is to compare and contrast the two. The discussion builds on Matrix Transpose by adding tiling to the matrix transpose implementation.

There are exercise files RAJA/exercises/kernel-matrix-transpose-tiled.cpp and RAJA/exercises/launch-matrix-transpose-tiled.cpp for you to work through if you wish to get some practice with RAJA. The files RAJA/exercises/kernel-matrix-transpose-tiled_solution.cpp and RAJA/exercises/launch-matrix-transpose-tiled_solution.cpp contain complete working code for the examples. You can use the solution files to check your work and for guidance if you get stuck. To build the exercises execute make (kernel/launch)-matrix-transpose-tiled and make (kernel/launch)-matrix-transpose-tiled_solution from the build directory.

Key RAJA features shown in this example are:

  • RAJA::kernel method and execution policies, and the RAJA::statement::Tile type

  • RAJA::launch method and execution policies, and the RAJA::tile type

As in Matrix Transpose, we compute the transpose of an input matrix \(A\) of size \(N_r \times N_c\) and storing the result in a second matrix \(At\) of size \(N_c \times N_r\).

We will compute the matrix transpose using a tiling algorithm, which iterates over tiles and transposes the matrix entries in each tile. The algorithm involves outer and inner loops to iterate over the tiles and matrix entries within each tile, respectively.

As in Matrix Transpose, we start by defining the matrix dimensions. Additionally, we define a tile size smaller than the matrix dimensions and determine the number of tiles in each dimension. Note that we do not assume that tiles divide evenly the number of rows and and columns of the matrix. However, we do assume square tiles.

  constexpr int N_r = 56;
  constexpr int N_c = 75;

  constexpr int TILE_DIM = 16;

  constexpr int outer_Dimc = (N_c - 1) / TILE_DIM + 1;
  constexpr int outer_Dimr = (N_r - 1) / TILE_DIM + 1;

Then, we wrap the matrix data pointers in RAJA::View objects to simplify the multi-dimensional indexing:

  RAJA::View<int, RAJA::Layout<DIM>> Aview(A, N_r, N_c);
  RAJA::View<int, RAJA::Layout<DIM>> Atview(At, N_c, N_r);

The C-style for-loop implementation looks like this:

  //
  // (0) Outer loops to iterate over tiles
  //
  for (int by = 0; by < outer_Dimr; ++by) {
    for (int bx = 0; bx < outer_Dimc; ++bx) {
      //
      // (1) Loops to iterate over tile entries
      //
      for (int ty = 0; ty < TILE_DIM; ++ty) {
        for (int tx = 0; tx < TILE_DIM; ++tx) {

          int col = bx * TILE_DIM + tx;  // Matrix column index
          int row = by * TILE_DIM + ty;  // Matrix row index

          // Bounds check
          if (row < N_r && col < N_c) {
            Atview(col, row) = Aview(row, col);
          }
        }
      }
      
    }
  }

Note

To prevent indexing out of bounds, when the tile dimensions do not divide evenly the matrix dimensions, the algorithm requires a bounds check in the inner loops.

RAJA::kernel Variants

For RAJA::kernel variants, we use RAJA::statement::Tile types for the outer loop tiling and RAJA::tile_fixed types to indicate the tile dimensions. The complete sequential RAJA variant is:

  using TILED_KERNEL_EXEC_POL = 
    RAJA::KernelPolicy<
      RAJA::statement::Tile<1, RAJA::tile_fixed<TILE_DIM>, RAJA::seq_exec,
        RAJA::statement::Tile<0, RAJA::tile_fixed<TILE_DIM>, RAJA::seq_exec,
          RAJA::statement::For<1, RAJA::seq_exec, 
            RAJA::statement::For<0, RAJA::seq_exec,
              RAJA::statement::Lambda<0>
            >
          >
        >
      >
    >;

  RAJA::kernel<TILED_KERNEL_EXEC_POL>( RAJA::make_tuple(col_Range, row_Range),
    [=](int col, int row) {
      Atview(col, row) = Aview(row, col);
  });

The RAJA::statement::Tile types compute the number of tiles needed to iterate over all matrix entries in each dimension and generate iteration index bounds for each tile, which are used to generate loops for the inner RAJA::statement::For types. Thus, the explicit bounds checking logic in the C-style variant is not needed. Note that the integer template parameters in the RAJA::statement::For types refer to the entries in the iteration space tuple passed to the RAJA::kernel method.

The RAJA::kernel CUDA variant is similar with sequential policies replaced with CUDA execution policies:

  using TILED_KERNEL_EXEC_POL_CUDA = 
    RAJA::KernelPolicy<
      RAJA::statement::CudaKernel<
        RAJA::statement::Tile<1, RAJA::tile_fixed<TILE_DIM>, RAJA::cuda_block_y_loop,
          RAJA::statement::Tile<0, RAJA::tile_fixed<TILE_DIM>, RAJA::cuda_block_x_loop,
            RAJA::statement::For<1, RAJA::cuda_thread_x_direct,
              RAJA::statement::For<0, RAJA::cuda_thread_y_direct,
                RAJA::statement::Lambda<0> 
              >
            >
          >
        >
      >
    >;

  RAJA::kernel<TILED_KERNEL_EXEC_POL_CUDA>( RAJA::make_tuple(col_Range, row_Range), 
    [=] RAJA_DEVICE (int col, int row) {
      Atview(col, row) = Aview(row, col);
  });

A notable difference between the CPU and GPU execution policy is the insertion of the RAJA::statement::CudaKernel type in the GPU version, which indicates that the execution will launch a CUDA device kernel.

The CUDA thread-block dimensions are set based on the tile dimensions and the iterates withing each tile are mapped directly to GPU threads in each block due to the RAJA::cuda_thread_{x, y}_direct policies.

RAJA::launch Variants

For RAJA::launch variants, we use RAJA::tile methods for the outer loop tiling and RAJA::loop methods to iterate within the tiles. The complete sequential tiled RAJA::launch variant is:

  using loop_pol_1 = RAJA::LoopPolicy<RAJA::seq_exec>;
  using launch_policy_1 = RAJA::LaunchPolicy<RAJA::seq_launch_t>;

  RAJA::launch<launch_policy_1>(RAJA::LaunchParams(), //LaunchParams may be empty when running on the cpu
    [=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {

      RAJA::tile<loop_pol_1>(ctx, TILE_DIM, row_Range, [&] (RAJA::TypedRangeSegment<int> const &row_tile) {

        RAJA::tile<loop_pol_1>(ctx, TILE_DIM, col_Range, [&] (RAJA::TypedRangeSegment<int> const &col_tile) {

          RAJA::loop<loop_pol_1>(ctx, row_tile, [&] (int row) {
            RAJA::loop<loop_pol_1>(ctx, col_tile, [&] (int col) {

              Atview(col, row) = Aview(row, col);

            });
          });

        });
      });

  });

Similar to the RAJA::statement::Tile type in the RAJA::kernel variant above, the RAJA::tile method computes the number of tiles needed to iterate over all matrix entries in each dimension and generates a corresponding iteration space for each tile, which is used to generate loops for the inner RAJA::loop methods. Thus, the explicit bounds checking logic in the C-style variant is not needed.

A CUDA RAJA::launch tiled variant for the GPU is similar with CUDA policies in the RAJA::loop methods. The complete RAJA::launch variant is:

  using cuda_teams_y = RAJA::LoopPolicy<RAJA::cuda_block_y_direct>;
  using cuda_teams_x = RAJA::LoopPolicy<RAJA::cuda_block_x_direct>;

  using cuda_threads_y = RAJA::LoopPolicy<RAJA::cuda_thread_y_direct>;
  using cuda_threads_x = RAJA::LoopPolicy<RAJA::cuda_thread_x_direct>;

  const bool cuda_async = false;
  using cuda_launch_policy = RAJA::LaunchPolicy<RAJA::cuda_launch_t<cuda_async>>;

  RAJA::launch<cuda_launch_policy>(
    RAJA::LaunchParams(RAJA::Teams(n_blocks_c, n_blocks_r),
                     RAJA::Threads(c_block_sz, r_block_sz)),
    [=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {

      RAJA::tile<cuda_teams_y>(ctx, TILE_DIM, row_Range, [&] (RAJA::TypedRangeSegment<int> const &row_tile) {

        RAJA::tile<cuda_teams_x>(ctx, TILE_DIM, col_Range, [&] (RAJA::TypedRangeSegment<int> const &col_tile) {

          RAJA::loop<cuda_threads_y>(ctx, row_tile, [&] (int row) {
            RAJA::loop<cuda_threads_x>(ctx, col_tile, [&] (int col) {

              Atview(col, row) = Aview(row, col);

            });
          });

        });
      });

  });

A notable difference between the CPU and GPU RAJA::launch implementations is the definition of the compute grid. For the CPU version, the argument list is empty for the RAJA::LaunchParams constructor. For the CUDA GPU implementation, we define a ‘Team’ of one two-dimensional thread-block with 16 x 16 = 256 threads.