Matrix Transpose

In RAJA::kernel Execution Policies and RAJA::Launch Execution Policies, we presented a simple array initialization kernel using RAJA::kernel and RAJA::launch interfaces, respectively, and compared the two. This section describes the implementation of a matrix transpose kernel using both RAJA::kernel and RAJA::launch interfaces. The intent is to compare and contrast the two, as well as introduce additional features of the interfaces.

There are exercise files RAJA/exercises/kernel-matrix-transpose.cpp and RAJA/exercises/launch-matrix-transpose.cpp for you to work through if you wish to get some practice with RAJA. The files RAJA/exercises/kernel-matrix-transpose_solution.cpp and RAJA/exercises/launch-matrix-transpose_solution.cpp contain complete working code for the examples. You can use the solution files to check your work and for guidance if you get stuck. To build the exercises execute make (kernel/launch)-matrix-transpose and make (kernel/launch)-matrix-transpose_solution from the build directory.

Key RAJA features shown in this example are:

  • RAJA::kernel method and kernel execution policies

  • RAJA::launch method and kernel execution interface

In the example, we compute the transpose of an input matrix \(A\) of size \(N_r \times N_c\) and store the result in a second matrix \(At\) of size \(N_c \times N_r\).

First we define our matrix dimensions

  constexpr int N_r = 56;
  constexpr int N_c = 75;

and wrap the data pointers for the matrices in RAJA::View objects to simplify the multi-dimensional indexing:

  RAJA::View<int, RAJA::Layout<DIM>> Aview(A, N_r, N_c);
  RAJA::View<int, RAJA::Layout<DIM>> Atview(At, N_c, N_r);

Then, a C-style for-loop implementation looks like this:

  for (int row = 0; row < N_r; ++row) {
    for (int col = 0; col < N_c; ++col) {
        Atview(col, row) = Aview(row, col);
    }
  }

RAJA::kernel Implementation

For RAJA::kernel variants, we use RAJA::statement::For and RAJA::statement::Lambda statement types in the execution policies. The complete sequential RAJA::kernel variant is:

  using KERNEL_EXEC_POL = 
    RAJA::KernelPolicy<
      RAJA::statement::For<1, RAJA::seq_exec, 
        RAJA::statement::For<0, RAJA::seq_exec,
          RAJA::statement::Lambda<0>
         >
      >
    >;

  RAJA::kernel<KERNEL_EXEC_POL>( RAJA::make_tuple(col_Range, row_Range),
    [=](int col, int row) {
      Atview(col, row) = Aview(row, col);
  });

A CUDA RAJA::kernel variant for the GPU is similar with different policies in the RAJA::statement::For statements:

  using KERNEL_EXEC_POL_CUDA = 
    RAJA::KernelPolicy<
      RAJA::statement::CudaKernel<
        RAJA::statement::For<1, RAJA::cuda_thread_x_loop,
          RAJA::statement::For<0, RAJA::cuda_thread_y_loop,
                                  RAJA::statement::Lambda<0> 
          >
        >
      >
    >;

  RAJA::kernel<KERNEL_EXEC_POL_CUDA>( RAJA::make_tuple(col_Range, row_Range), 
    [=] RAJA_DEVICE (int col, int row) {
      Atview(col, row) = Aview(row, col);
  });

A notable difference between the CPU and GPU execution policy is the insertion of the RAJA::statement::CudaKernel type in the GPU version, which indicates that the execution will launch a CUDA device kernel.

In the CUDA RAJA::kernel variant above, the thread-block size and and number of blocks to launch is determined by the implementation of the RAJA::kernel execution policy constructs using the sizes of the RAJA::TypedRangeSegment objects in the iteration space tuple.

RAJA::launch Implementation

For RAJA::launch variants, we use RAJA::loop methods to write a loop hierarchy within the kernel execution space. For a sequential implementation, we pass the RAJA::seq_launch_t template parameter to the launch method and pass the RAJA::seq_exec parameter to the loop methods. The complete sequential RAJA::launch variant is:

  using loop_policy_seq = RAJA::LoopPolicy<RAJA::seq_exec>;
  using launch_policy_seq = RAJA::LaunchPolicy<RAJA::seq_launch_t>;

  RAJA::launch<launch_policy_seq>(
    RAJA::LaunchParams(), //LaunchParams may be empty when running on the host
    [=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {

      RAJA::loop<loop_policy_seq>(ctx, row_Range, [&] (int row) {
        RAJA::loop<loop_policy_seq>(ctx, col_Range, [&] (int col) {

            Atview(col, row) = Aview(row, col);

        });
      });

  });

A CUDA RAJA::launch variant for the GPU is similar with CUDA policies in the RAJA::loop methods. The complete RAJA::launch variant is:

  using cuda_thread_x = RAJA::LoopPolicy<RAJA::cuda_thread_x_loop>;
  using cuda_thread_y = RAJA::LoopPolicy<RAJA::cuda_thread_y_loop>;

  const bool async = false; //execute asynchronously
  using launch_policy_cuda = RAJA::LaunchPolicy<RAJA::cuda_launch_t<async>>;

  RAJA::launch<launch_policy_cuda>
    (RAJA::LaunchParams(RAJA::Teams(1), RAJA::Threads(16,16)),
    [=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {

      RAJA::loop<cuda_thread_y>(ctx, row_Range, [&] (int row) {
        RAJA::loop<cuda_thread_x>(ctx, col_Range, [&] (int col) {

            Atview(col, row) = Aview(row, col);

        });
      });

  });

A notable difference between the CPU and GPU RAJA::launch implementations is the definition of the compute grid. For the CPU version, the argument list is empty for the RAJA::LaunchParams constructor. For the CUDA GPU implementation, we define a ‘Team’ of one two-dimensional thread-block with 16 x 16 = 256 threads.