Tiled Matrix Transpose

Key RAJA features shown in this example are:

  • RAJA::kernel usage with multiple lambdas
  • RAJA::statement::Tile type

In this example, we compute the transpose of an input matrix \(A\) of size \(N_r \times N_c\) and store the result in a second matrix \(At\) of size \(N_c \times N_r\).

We compute the matrix transpose using a tiling algorithm, which iterates over tiles of the matrix A and performs a transpose copy of a tile without storing the tile in another array. The algorithm is expressed as a collection of outer and inner loops. Iterations of the inner loop will transpose each tile, while outer loops iterate over the tiles.

We start with a non-RAJA C++ implementation, where we choose tile dimensions smaller than the matrix dimensions. Note that we do not assume that tiles divide evenly the number of rows and and columns of the matrix. However, we do assume square tiles. First, we define matrix dimensions:

  const int N_r = 56;
  const int N_c = 75;

  const int TILE_DIM = 16;

  const int outer_Dimc = (N_c - 1) / TILE_DIM + 1;
  const int outer_Dimr = (N_r - 1) / TILE_DIM + 1;

Then, we wrap the matrix data pointers in RAJA::View objects to simplify the multi-dimensional indexing:

  RAJA::View<int, RAJA::Layout<DIM>> Aview(A, N_r, N_c);
  RAJA::View<int, RAJA::Layout<DIM>> Atview(At, N_c, N_r);

Then, the non-RAJA C++ implementation looks like this:

  // (0) Outer loops to iterate over tiles
  for (int by = 0; by < outer_Dimr; ++by) {
    for (int bx = 0; bx < outer_Dimc; ++bx) {
      // (1) Loops to iterate over tile entries
      for (int ty = 0; ty < TILE_DIM; ++ty) {
        for (int tx = 0; tx < TILE_DIM; ++tx) {

          int col = bx * TILE_DIM + tx;  // Matrix column index
          int row = by * TILE_DIM + ty;  // Matrix row index

          // Bounds check
          if (row < N_r && col < N_c) {
            Atview(col, row) = Aview(row, col);

Note that we need to include a bounds check in the code to avoid indexing out of bounds when the tile sizes do not divide the matrix dimensions evenly.

RAJA::kernel Variants

For RAJA::kernel variants, we use RAJA::statement::Tile types for the outer loop tiling and RAJA::tile_fixed types to indicate the tile dimensions. The complete sequential RAJA variant is:

  using KERNEL_EXEC_POL = 
      RAJA::statement::Tile<1, RAJA::tile_fixed<TILE_DIM>, RAJA::seq_exec,
        RAJA::statement::Tile<0, RAJA::tile_fixed<TILE_DIM>, RAJA::seq_exec,
          RAJA::statement::For<1, RAJA::seq_exec, 
            RAJA::statement::For<0, RAJA::seq_exec,

  RAJA::kernel<KERNEL_EXEC_POL>( RAJA::make_tuple(col_Range, row_Range),
    [=](int col, int row) {
      Atview(col, row) = Aview(row, col);

The RAJA::statement::Tile types compute the number of tiles needed to iterate over all matrix entries in each dimension and generate iteration index bounds for each tile, which are used to generate loops for the inner RAJA::statement::For types. Thus, the bounds checking logic in the non-RAJA variant is not needed. Note that the integer template parameters to these statement types refer to the entries in the iteration space tuple passed to the RAJA::kernel method.

The file RAJA/examples/tut_tiled-matrix-transpose.cpp contains the complete working example code for the examples described in this section, including OpenMP, CUDA, and HIP variants.

A more advanced version using RAJA local arrays for CPU cache blocking and using GPU shared memory is discussed in Matrix Transpose with Local Array.