# Matrix Transpose with Local Array¶

This section extends the discussion in Tiled Matrix Transpose,
where only loop tiling is considered. Here, we combine loop tiling with
`RAJA::LocalArray`

objects which enable us to store data for each tile in
CPU stack-allocated arrays or GPU thread local and shared memory to be used
within kernels. For more information about `RAJA::LocalArray`

, please
see Local Array.

Key RAJA features shown in this example include:

`RAJA::kernel_param`

method with multiple lambda expressions`RAJA::statement::Tile`

type`RAJA::statement::ForICount`

type`RAJA::LocalArray`

- Specifying lambda arguments through statements

As in Tiled Matrix Transpose, this example computes the transpose of an input matrix \(A\) of size \(N_r \times N_c\) and stores the result in a second matrix \(At\) of size \(N_c \times N_r\). The operation uses a local memory tiling algorithm. The algorithm tiles the outer loops and iterates over tiles in inner loops. The algorithm first loads input matrix entries into a local two-dimensional array for a tile, and then reads from the tile swapping the row and column indices to generate the output matrix.

We start with a non-RAJA C++ implementation to show the algorithm pattern. We choose tile dimensions smaller than the dimensions of the matrix and note that it is not necessary for the tile dimensions to divide evenly the number of rows and columns in the matrix A. As in the Tiled Matrix Transpose example, we start by defining the number of rows and columns in the matrices, the tile dimensions, and the number of tiles.

```
const int N_r = 267;
const int N_c = 251;
const int TILE_DIM = 16;
const int outer_Dimc = (N_c - 1) / TILE_DIM + 1;
const int outer_Dimr = (N_r - 1) / TILE_DIM + 1;
```

We also use RAJA View objects to simplify the multi-dimensional indexing as in the Tiled Matrix Transpose example.

```
RAJA::View<int, RAJA::Layout<DIM>> Aview(A, N_r, N_c);
RAJA::View<int, RAJA::Layout<DIM>> Atview(At, N_c, N_r);
```

The complete sequential C++ implementation of the tiled transpose operation using a stack-allocated local array for the tiles is:

```
//
// (0) Outer loops to iterate over tiles
//
for (int by = 0; by < outer_Dimr; ++by) {
for (int bx = 0; bx < outer_Dimc; ++bx) {
// Stack-allocated local array for data on a tile
int Tile[TILE_DIM][TILE_DIM];
//
// (1) Inner loops to read input matrix tile data into the array
//
// Note: loops are ordered so that input matrix data access
// is stride-1.
//
for (int ty = 0; ty < TILE_DIM; ++ty) {
for (int tx = 0; tx < TILE_DIM; ++tx) {
int col = bx * TILE_DIM + tx; // Matrix column index
int row = by * TILE_DIM + ty; // Matrix row index
// Bounds check
if (row < N_r && col < N_c) {
Tile[ty][tx] = Aview(row, col);
}
}
}
//
// (2) Inner loops to write array data into output array tile
//
// Note: loop order is swapped from above so that output matrix
// data access is stride-1.
//
for (int tx = 0; tx < TILE_DIM; ++tx) {
for (int ty = 0; ty < TILE_DIM; ++ty) {
int col = bx * TILE_DIM + tx; // Matrix column index
int row = by * TILE_DIM + ty; // Matrix row index
// Bounds check
if (row < N_r && col < N_c) {
Atview(col, row) = Tile[ty][tx];
}
}
}
}
}
```

Note

- To prevent indexing out of bounds, when the tile dimensions do not divide evenly the matrix dimensions, we use a bounds check in the inner loops.
- For efficiency, we order the inner loops so that reading from the input matrix and writing to the output matrix both use stride-1 data access.

## RAJA::kernel Version of Tiled Loops with Local Array¶

RAJA provides mechanisms to tile loops and use *local arrays*
in kernels so that algorithm patterns like we just described can be
implemented with RAJA. A `RAJA::LocalArray`

type specifies an object whose
memory is created inside a kernel using a `RAJA::statement`

type in a RAJA
kernel execution policy. The local array data is only usable within the kernel.
See Local Array for more information.

`RAJA::kernel`

methods also support loop tiling statements which determine
the number of tiles needed to perform an operation based on tile size and
extent of the corresponding iteration space. Moreover, lambda expressions for
the kernel will not be invoked for iterations outside the bounds of an
iteration space when tile dimensions do not divide evenly the size of the
iteration space; thus, no conditional checks on loop bounds are needed
inside inner loops.

For the RAJA version of the matrix transpose kernel above, we define the
type of the `RAJA::LocalArray`

used for matrix entries in a tile and
create an object to represent it:

```
using TILE_MEM =
RAJA::LocalArray<int, RAJA::Perm<0, 1>, RAJA::SizeList<TILE_DIM, TILE_DIM>>;
TILE_MEM Tile_Array;
```

The template parameters that define the type are: array data type, data stride permutation for the array indices (here the identity permutation is given, so the default RAJA conventions apply; i.e., the rightmost array index will be stride-1), and the array dimensions. Next, we compare two RAJA implementations of matrix transpose with RAJA.

The complete RAJA sequential CPU variant with kernel execution policy and kernel is:

```
using SEQ_EXEC_POL_I =
RAJA::KernelPolicy<
RAJA::statement::Tile<1, RAJA::tile_fixed<TILE_DIM>, RAJA::loop_exec,
RAJA::statement::Tile<0, RAJA::tile_fixed<TILE_DIM>, RAJA::loop_exec,
RAJA::statement::InitLocalMem<RAJA::cpu_tile_mem, RAJA::ParamList<2>,
RAJA::statement::ForICount<1, RAJA::statement::Param<0>, RAJA::loop_exec,
RAJA::statement::ForICount<0, RAJA::statement::Param<1>, RAJA::loop_exec,
RAJA::statement::Lambda<0>
>
>,
RAJA::statement::ForICount<0, RAJA::statement::Param<1>, RAJA::loop_exec,
RAJA::statement::ForICount<1, RAJA::statement::Param<0>, RAJA::loop_exec,
RAJA::statement::Lambda<1>
>
>
>
>
>
>;
RAJA::kernel_param<SEQ_EXEC_POL_I>( RAJA::make_tuple(RAJA::RangeSegment(0, N_c),
RAJA::RangeSegment(0, N_r)),
RAJA::make_tuple((int)0, (int)0, Tile_Array),
[=](int col, int row, int tx, int ty, TILE_MEM &Tile_Array) {
Tile_Array(ty, tx) = Aview(row, col);
},
[=](int col, int row, int tx, int ty, TILE_MEM &Tile_Array) {
Atview(col, row) = Tile_Array(ty, tx);
});
```

The `RAJA::statement::Tile`

types in the execution policy define
tiling of the outer ‘row’ (iteration space tuple index ‘1’) and ‘col’
(iteration space tuple index ‘0’) loops, including tile sizes
(`RAJA::tile_fixed`

types) and loop execution policies. Next,
the `RAJA::statement::InitLocalMem`

type initializes the local stack array
based on the memory policy type (here, we use `RAJA::cpu_tile_mem`

for
a CPU stack-allocated array). The `RAJA::ParamList<2>`

parameter indicates
that the local array object is associated with position ‘2’ in the parameter
tuple argument passed to the `RAJA::kernel_param`

method. The first two
entries in the parameter tuple indicate storage for the local tile indices
which can be used in multiple lambdas in the kernel. Finally, we have two sets
of nested inner loops for reading the input matrix entries into the local
array and writing them out to the output matrix transpose. The inner bodies of
each of these loop nests are identified by lambda expression arguments
‘0’ and ‘1’, respectively.

Note that the loops over tiles use `RAJA::statement::ForICount`

types
rather than `RAJA::statement::For`

types that we have seen in other
nested loop examples. The `RAJA::statement::ForICount`

type generates
local tile indices that are passed to lambda loop body expressions. As
the reader will observe, there is no local tile index computation
needed in the lambdas for the RAJA version of the kernel as a result. The
first integer template parameter for each `RAJA::statement::ForICount`

type
indicates the item in the iteration space tuple passed to the
`RAJA::kernel_param`

method to which it applies; this is similar to
`RAJA::statement::For`

usage. The second template parameter for each
`RAJA::statement::ForICount`

type indicates the position in the parameter
tuple passed to the `RAJA::kernel_param`

method that will hold the
associated local tile index. The loop execution policy template
argument that follows works the same as in `RAJA::statement::For`

usage.
For more detailed discussion of RAJA loop tiling statement types, please see
Loop Tiling.

Now that we have described the execution policy in some detail, let’s pull
everything together by briefly walking though the call to the
`RAJA::kernel_param`

method. The first argument is a tuple of iteration
spaces that define the iteration ranges for the level in the loop nest.
Again, the first integer parameters given to the `RAJA::statement::Tile`

and
`RAJA::statement::ForICount`

types identify the tuple entry they apply to.
The second argument is a tuple of data parameters that will hold the local
tile indices and `RAJA::LocalArray`

tile memory. The tuple entries are
associated with various statements in the execution policy as we described
earlier. Next, two lambda expression arguments are passed to the
`RAJA::kernel_param`

method for reading and writing the input and output
matrix entries, respectively.

Note that each lambda expression takes five arguments. The first two are
the matrix column and row indices associated with the iteration space tuple.
The next three arguments correspond to the parameter tuple entries. The first
two of these are the local tile indices used to access entries in the
`RAJA::LocalArray`

object memory. The last argument is a reference to the
`RAJA::LocalArray`

object itself.

## RAJA::kernel Version of Tiled Loops with Local Array Specifying Lambda Arguments¶

The second RAJA variant works the same as the one above. The main differences
between the two variants is due to the fact that in this second one, we use
`RAJA::statement::Lambda`

types to indicate which arguments each lambda
takes and in which order. Here is the complete version including
execution policy and kernel:

```
using SEQ_EXEC_POL_II =
RAJA::KernelPolicy<
RAJA::statement::Tile<1, RAJA::tile_fixed<TILE_DIM>, RAJA::loop_exec,
RAJA::statement::Tile<0, RAJA::tile_fixed<TILE_DIM>, RAJA::loop_exec,
RAJA::statement::InitLocalMem<RAJA::cpu_tile_mem, RAJA::ParamList<0>,
RAJA::statement::For<1, RAJA::loop_exec,
RAJA::statement::For<0, RAJA::loop_exec,
RAJA::statement::Lambda<0, Segs<0>, Segs<1>, Offsets<0>, Offsets<1>, Params<0> >
>
>,
RAJA::statement::For<0, RAJA::loop_exec,
RAJA::statement::For<1, RAJA::loop_exec,
RAJA::statement::Lambda<1, Segs<0, 1>, Offsets<0, 1>, Params<0> >
>
>
>
>
>
>;
RAJA::kernel_param<SEQ_EXEC_POL_II>( RAJA::make_tuple(RAJA::RangeSegment(0, N_c),
RAJA::RangeSegment(0, N_r)),
RAJA::make_tuple(Tile_Array),
[=](int col, int row, int tx, int ty, TILE_MEM &Tile_Array) {
Tile_Array(ty, tx) = Aview(row, col);
},
[=](int col, int row, int tx, int ty, TILE_MEM &Tile_Array) {
Atview(col, row) = Tile_Array(ty, tx);
});
```

Here, the two `RAJA::statement::Lambda`

types in the execution policy show
two different ways to specify the segments (`RAJA::Segs`

)
associated with the matrix column and row indices. That is, we can use a
`Segs`

statement for each argument, or include multiple segment ids in one
statement.

Note that we are using `RAJA::statement::For`

types for the inner tile
loops instead of RAJA::statement::ForICount` types used in the first variant.
As a consequence of specifying lambda arguments, there are two main differences.
The local tile indices are properly computed and passed to the lambda
expressions as a result of the `RAJA::Offsets`

types that appear
in the lambda statement types. The `RAJA::statement::Lambda`

type for each
lambda shows the two ways to specify the local tile index args; we can use an
`Offsets`

statement for each argument, or include multiple segment ids in one
statement. Lastly, there is only one entry in the parameter
tuple in this case, the local tile array. The placeholders are not needed.

The file `RAJA/examples/tut_matrix-transpose-local-array.cpp`

contains the
complete working example code for the examples described in this section along
with OpenMP and CUDA variants.