# Tiled Matrix Transpose¶

This section describes the implementation of a tiled matrix transpose kernel
using both `RAJA::kernel`

and `RAJA::launch`

interfaces. The intent
is to compare and contrast the two. The discussion builds on
Matrix Transpose by adding tiling to the matrix transpose
implementation.

There are exercise files
`RAJA/exercises/kernel-matrix-transpose-tiled.cpp`

and
`RAJA/exercises/launch-matrix-transpose-tiled.cpp`

for you to work through
if you wish to get some practice with RAJA. The files
`RAJA/exercises/kernel-matrix-transpose-tiled_solution.cpp`

and
`RAJA/exercises/launch-matrix-transpose-tiled_solution.cpp`

contain
complete working code for the examples. You can use the solution files to
check your work and for guidance if you get stuck. To build
the exercises execute `make (kernel/launch)-matrix-transpose-tiled`

and
`make (kernel/launch)-matrix-transpose-tiled_solution`

from the build directory.

Key RAJA features shown in this example are:

`RAJA::kernel`

method and execution policies, and the`RAJA::statement::Tile`

type`RAJA::launch`

method and execution policies, and the`RAJA::tile`

type

As in Matrix Transpose, we compute the transpose of an input matrix \(A\) of size \(N_r \times N_c\) and storing the result in a second matrix \(At\) of size \(N_c \times N_r\).

We will compute the matrix transpose using a tiling algorithm, which iterates over tiles and transposes the matrix entries in each tile. The algorithm involves outer and inner loops to iterate over the tiles and matrix entries within each tile, respectively.

As in Matrix Transpose, we start by defining the matrix dimensions. Additionally, we define a tile size smaller than the matrix dimensions and determine the number of tiles in each dimension. Note that we do not assume that tiles divide evenly the number of rows and and columns of the matrix. However, we do assume square tiles.

```
constexpr int N_r = 56;
constexpr int N_c = 75;
constexpr int TILE_DIM = 16;
constexpr int outer_Dimc = (N_c - 1) / TILE_DIM + 1;
constexpr int outer_Dimr = (N_r - 1) / TILE_DIM + 1;
```

Then, we wrap the matrix data pointers in `RAJA::View`

objects to
simplify the multi-dimensional indexing:

```
RAJA::View<int, RAJA::Layout<DIM>> Aview(A, N_r, N_c);
RAJA::View<int, RAJA::Layout<DIM>> Atview(At, N_c, N_r);
```

The C-style for-loop implementation looks like this:

```
//
// (0) Outer loops to iterate over tiles
//
for (int by = 0; by < outer_Dimr; ++by) {
for (int bx = 0; bx < outer_Dimc; ++bx) {
//
// (1) Loops to iterate over tile entries
//
for (int ty = 0; ty < TILE_DIM; ++ty) {
for (int tx = 0; tx < TILE_DIM; ++tx) {
int col = bx * TILE_DIM + tx; // Matrix column index
int row = by * TILE_DIM + ty; // Matrix row index
// Bounds check
if (row < N_r && col < N_c) {
Atview(col, row) = Aview(row, col);
}
}
}
}
}
```

Note

To prevent indexing out of bounds, when the tile dimensions do not divide evenly the matrix dimensions, the algorithm requires a bounds check in the inner loops.

`RAJA::kernel`

Variants¶

For `RAJA::kernel`

variants, we use `RAJA::statement::Tile`

types
for the outer loop tiling and `RAJA::tile_fixed`

types to
indicate the tile dimensions. The complete sequential RAJA variant is:

```
using TILED_KERNEL_EXEC_POL =
RAJA::KernelPolicy<
RAJA::statement::Tile<1, RAJA::tile_fixed<TILE_DIM>, RAJA::loop_exec,
RAJA::statement::Tile<0, RAJA::tile_fixed<TILE_DIM>, RAJA::loop_exec,
RAJA::statement::For<1, RAJA::loop_exec,
RAJA::statement::For<0, RAJA::loop_exec,
RAJA::statement::Lambda<0>
>
>
>
>
>;
RAJA::kernel<TILED_KERNEL_EXEC_POL>( RAJA::make_tuple(col_Range, row_Range),
[=](int col, int row) {
Atview(col, row) = Aview(row, col);
});
```

The `RAJA::statement::Tile`

types compute the number of tiles needed to
iterate over all matrix entries in each dimension and generate iteration
index bounds for each tile, which are used to generate loops for the inner
`RAJA::statement::For`

types. Thus, the explicit bounds checking logic in the
C-style variant is not needed. Note that the integer template parameters
in the `RAJA::statement::For`

types refer to the entries in the iteration
space tuple passed to the `RAJA::kernel`

method.

The `RAJA::kernel`

CUDA variant is similar with sequential policies replaced
with CUDA execution policies:

```
using TILED_KERNEL_EXEC_POL_CUDA =
RAJA::KernelPolicy<
RAJA::statement::CudaKernel<
RAJA::statement::Tile<1, RAJA::tile_fixed<TILE_DIM>, RAJA::cuda_block_y_loop,
RAJA::statement::Tile<0, RAJA::tile_fixed<TILE_DIM>, RAJA::cuda_block_x_loop,
RAJA::statement::For<1, RAJA::cuda_thread_x_direct,
RAJA::statement::For<0, RAJA::cuda_thread_y_direct,
RAJA::statement::Lambda<0>
>
>
>
>
>
>;
RAJA::kernel<TILED_KERNEL_EXEC_POL_CUDA>( RAJA::make_tuple(col_Range, row_Range),
[=] RAJA_DEVICE (int col, int row) {
Atview(col, row) = Aview(row, col);
});
```

A notable difference between the CPU and GPU execution policy is the insertion
of the `RAJA::statement::CudaKernel`

type in the GPU version, which indicates
that the execution will launch a CUDA device kernel.

The CUDA thread-block dimensions are set based on the tile dimensions and the
iterates withing each tile are mapped directly to GPU threads in each block
due to the `RAJA::cuda_thread_{x, y}_direct`

policies.

`RAJA::launch`

Variants¶

For `RAJA::launch`

variants, we use `RAJA::tile`

methods
for the outer loop tiling and `RAJA::loop`

methods
to iterate within the tiles. The complete sequential tiled
`RAJA::launch`

variant is:

```
using loop_pol_1 = RAJA::LoopPolicy<RAJA::loop_exec>;
using launch_policy_1 = RAJA::LaunchPolicy<RAJA::seq_launch_t>;
RAJA::launch<launch_policy_1>(RAJA::LaunchParams(), //LaunchParams may be empty when running on the cpu
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::tile<loop_pol_1>(ctx, TILE_DIM, row_Range, [&] (RAJA::TypedRangeSegment<int> const &row_tile) {
RAJA::tile<loop_pol_1>(ctx, TILE_DIM, col_Range, [&] (RAJA::TypedRangeSegment<int> const &col_tile) {
RAJA::loop<loop_pol_1>(ctx, row_tile, [&] (int row) {
RAJA::loop<loop_pol_1>(ctx, col_tile, [&] (int col) {
Atview(col, row) = Aview(row, col);
});
});
});
});
});
```

Similar to the `RAJA::statement::Tile`

type in the `RAJA::kernel`

variant
above, the `RAJA::tile`

method computes the number of tiles needed to
iterate over all matrix entries in each dimension and generates a corresponding
iteration space for each tile, which is used to generate loops for the inner
`RAJA::loop`

methods. Thus, the explicit bounds checking logic in the
C-style variant is not needed.

A CUDA `RAJA::launch`

tiled variant for the GPU is similar with
CUDA policies in the `RAJA::loop`

methods. The complete
`RAJA::launch`

variant is:

```
using cuda_teams_y = RAJA::LoopPolicy<RAJA::cuda_block_y_direct>;
using cuda_teams_x = RAJA::LoopPolicy<RAJA::cuda_block_x_direct>;
using cuda_threads_y = RAJA::LoopPolicy<RAJA::cuda_thread_y_direct>;
using cuda_threads_x = RAJA::LoopPolicy<RAJA::cuda_thread_x_direct>;
const bool cuda_async = false;
using cuda_launch_policy = RAJA::LaunchPolicy<RAJA::cuda_launch_t<cuda_async>>;
RAJA::launch<cuda_launch_policy>(
RAJA::LaunchParams(RAJA::Teams(n_blocks_c, n_blocks_r),
RAJA::Threads(c_block_sz, r_block_sz)),
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::tile<cuda_teams_y>(ctx, TILE_DIM, row_Range, [&] (RAJA::TypedRangeSegment<int> const &row_tile) {
RAJA::tile<cuda_teams_x>(ctx, TILE_DIM, col_Range, [&] (RAJA::TypedRangeSegment<int> const &col_tile) {
RAJA::loop<cuda_threads_y>(ctx, row_tile, [&] (int row) {
RAJA::loop<cuda_threads_x>(ctx, col_tile, [&] (int col) {
Atview(col, row) = Aview(row, col);
});
});
});
});
});
```

A notable difference between the CPU and GPU `RAJA::launch`

implementations is the definition of the compute grid. For the CPU
version, the argument list is empty for the `RAJA::LaunchParams`

constructor.
For the CUDA GPU implementation, we define a ‘Team’ of one two-dimensional
thread-block with 16 x 16 = 256 threads.