# Matrix Transpose¶

In RAJA::kernel Execution Policies and RAJA::Launch Execution Policies,
we presented a simple array initialization kernel using `RAJA::kernel`

and
`RAJA::launch`

interfaces, respectively, and compared the two. This
section describes the implementation of a matrix transpose kernel using both
`RAJA::kernel`

and `RAJA::launch`

interfaces. The intent is to
compare and contrast the two, as well as introduce additional features of the
interfaces.

There are exercise files
`RAJA/exercises/kernel-matrix-transpose.cpp`

and
`RAJA/exercises/launch-matrix-transpose.cpp`

for you to work through if you
wish to get some practice with RAJA. The files
`RAJA/exercises/kernel-matrix-transpose_solution.cpp`

and
`RAJA/exercises/launch-matrix-transpose_solution.cpp`

contain
complete working code for the examples. You can use the solution files to
check your work and for guidance if you get stuck. To build
the exercises execute `make (kernel/launch)-matrix-transpose`

and `make (kernel/launch)-matrix-transpose_solution`

from the build directory.

Key RAJA features shown in this example are:

`RAJA::kernel`

method and kernel execution policies`RAJA::launch`

method and kernel execution interface

In the example, we compute the transpose of an input matrix \(A\) of size \(N_r \times N_c\) and store the result in a second matrix \(At\) of size \(N_c \times N_r\).

First we define our matrix dimensions

```
constexpr int N_r = 56;
constexpr int N_c = 75;
```

and wrap the data pointers for the matrices in `RAJA::View`

objects to
simplify the multi-dimensional indexing:

```
RAJA::View<int, RAJA::Layout<DIM>> Aview(A, N_r, N_c);
RAJA::View<int, RAJA::Layout<DIM>> Atview(At, N_c, N_r);
```

Then, a C-style for-loop implementation looks like this:

```
for (int row = 0; row < N_r; ++row) {
for (int col = 0; col < N_c; ++col) {
Atview(col, row) = Aview(row, col);
}
}
```

`RAJA::kernel`

Implementation¶

For `RAJA::kernel`

variants, we use `RAJA::statement::For`

and
`RAJA::statement::Lambda`

statement types in the execution policies.
The complete sequential `RAJA::kernel`

variant is:

```
using KERNEL_EXEC_POL =
RAJA::KernelPolicy<
RAJA::statement::For<1, RAJA::loop_exec,
RAJA::statement::For<0, RAJA::loop_exec,
RAJA::statement::Lambda<0>
>
>
>;
RAJA::kernel<KERNEL_EXEC_POL>( RAJA::make_tuple(col_Range, row_Range),
[=](int col, int row) {
Atview(col, row) = Aview(row, col);
});
```

A CUDA `RAJA::kernel`

variant for the GPU is similar with different policies
in the `RAJA::statement::For`

statements:

```
using KERNEL_EXEC_POL_CUDA =
RAJA::KernelPolicy<
RAJA::statement::CudaKernel<
RAJA::statement::For<1, RAJA::cuda_thread_x_loop,
RAJA::statement::For<0, RAJA::cuda_thread_y_loop,
RAJA::statement::Lambda<0>
>
>
>
>;
RAJA::kernel<KERNEL_EXEC_POL_CUDA>( RAJA::make_tuple(col_Range, row_Range),
[=] RAJA_DEVICE (int col, int row) {
Atview(col, row) = Aview(row, col);
});
```

A notable difference between the CPU and GPU execution policy is the insertion
of the `RAJA::statement::CudaKernel`

type in the GPU version, which indicates
that the execution will launch a CUDA device kernel.

In the CUDA `RAJA::kernel`

variant above, the thread-block size and
and number of blocks to launch is determined by the implementation of the
`RAJA::kernel`

execution policy constructs using the sizes of the
`RAJA::TypedRangeSegment`

objects in the iteration space tuple.

`RAJA::launch`

Implementation¶

For `RAJA::launch`

variants, we use `RAJA::loop`

methods
to write a loop hierarchy within the kernel execution space. For a sequential
implementation, we pass the `RAJA::seq_launch_t`

template parameter
to the launch method and pass the `RAJA::loop_exec`

parameter to the loop
methods. The complete sequential `RAJA::launch`

variant is:

```
using loop_policy_seq = RAJA::LoopPolicy<RAJA::loop_exec>;
using launch_policy_seq = RAJA::LaunchPolicy<RAJA::seq_launch_t>;
RAJA::launch<launch_policy_seq>(
RAJA::LaunchParams(), //LaunchParams may be empty when running on the host
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::loop<loop_policy_seq>(ctx, row_Range, [&] (int row) {
RAJA::loop<loop_policy_seq>(ctx, col_Range, [&] (int col) {
Atview(col, row) = Aview(row, col);
});
});
});
```

A CUDA `RAJA::launch`

variant for the GPU is similar with CUDA
policies in the `RAJA::loop`

methods. The complete
`RAJA::launch`

variant is:

```
using cuda_thread_x = RAJA::LoopPolicy<RAJA::cuda_thread_x_loop>;
using cuda_thread_y = RAJA::LoopPolicy<RAJA::cuda_thread_y_loop>;
const bool async = false; //execute asynchronously
using launch_policy_cuda = RAJA::LaunchPolicy<RAJA::cuda_launch_t<async>>;
RAJA::launch<launch_policy_cuda>
(RAJA::LaunchParams(RAJA::Teams(1), RAJA::Threads(16,16)),
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::loop<cuda_thread_y>(ctx, row_Range, [&] (int row) {
RAJA::loop<cuda_thread_x>(ctx, col_Range, [&] (int col) {
Atview(col, row) = Aview(row, col);
});
});
});
```

A notable difference between the CPU and GPU `RAJA::launch`

implementations is the definition of the compute grid. For the CPU
version, the argument list is empty for the `RAJA::LaunchParams`

constructor.
For the CUDA GPU implementation, we define a ‘Team’ of one two-dimensional
thread-block with 16 x 16 = 256 threads.