Tiled Matrix Transpose¶
Key RAJA features shown in this example are:
RAJA::kernel
usage with multiple lambdasRAJA::statement::Tile
type
In this example, we compute the transpose of an input matrix \(A\) of size \(N_r \times N_c\) and store the result in a second matrix \(At\) of size \(N_c \times N_r\).
We compute the matrix transpose using a tiling algorithm, which iterates over tiles of the matrix A and performs a transpose copy of a tile without storing the tile in another array. The algorithm is expressed as a collection of outer and inner loops. Iterations of the inner loop will transpose each tile, while outer loops iterate over the tiles.
We start with a non-RAJA C++ implementation, where we choose tile dimensions smaller than the matrix dimensions. Note that we do not assume that tiles divide evenly the number of rows and and columns of the matrix. However, we do assume square tiles. First, we define matrix dimensions:
const int N_r = 56;
const int N_c = 75;
const int TILE_DIM = 16;
const int outer_Dimc = (N_c - 1) / TILE_DIM + 1;
const int outer_Dimr = (N_r - 1) / TILE_DIM + 1;
Then, we wrap the matrix data pointers in RAJA::View
objects to
simplify the multi-dimensional indexing:
RAJA::View<int, RAJA::Layout<DIM>> Aview(A, N_r, N_c);
RAJA::View<int, RAJA::Layout<DIM>> Atview(At, N_c, N_r);
Then, the non-RAJA C++ implementation looks like this:
//
// (0) Outer loops to iterate over tiles
//
for (int by = 0; by < outer_Dimr; ++by) {
for (int bx = 0; bx < outer_Dimc; ++bx) {
//
// (1) Loops to iterate over tile entries
//
for (int ty = 0; ty < TILE_DIM; ++ty) {
for (int tx = 0; tx < TILE_DIM; ++tx) {
int col = bx * TILE_DIM + tx; // Matrix column index
int row = by * TILE_DIM + ty; // Matrix row index
// Bounds check
if (row < N_r && col < N_c) {
Atview(col, row) = Aview(row, col);
}
}
}
}
}
Note that we need to include a bounds check in the code to avoid indexing out of bounds when the tile sizes do not divide the matrix dimensions evenly.
RAJA::kernel Variants¶
For RAJA::kernel
variants, we use RAJA::statement::Tile
types
for the outer loop tiling and RAJA::tile_fixed
types to
indicate the tile dimensions. The complete sequential RAJA variant is:
using KERNEL_EXEC_POL =
RAJA::KernelPolicy<
RAJA::statement::Tile<1, RAJA::tile_fixed<TILE_DIM>, RAJA::seq_exec,
RAJA::statement::Tile<0, RAJA::tile_fixed<TILE_DIM>, RAJA::seq_exec,
RAJA::statement::For<1, RAJA::seq_exec,
RAJA::statement::For<0, RAJA::seq_exec,
RAJA::statement::Lambda<0>
>
>
>
>
>;
RAJA::kernel<KERNEL_EXEC_POL>( RAJA::make_tuple(col_Range, row_Range),
[=](int col, int row) {
Atview(col, row) = Aview(row, col);
});
The RAJA::statement::Tile
types compute the number of tiles needed to
iterate over all matrix entries in each dimension and generate iteration
index bounds for each tile, which are used to generate loops for the inner
RAJA::statement::For
types. Thus, the bounds checking logic in the
non-RAJA variant is not needed. Note that the integer template parameters
to these statement types refer to the entries in the iteration space tuple
passed to the RAJA::kernel
method.
The file RAJA/examples/tut_tiled-matrix-transpose.cpp
contains the complete working example code for the examples described in this section, including
OpenMP, CUDA, and HIP variants.
A more advanced version using RAJA local arrays for CPU cache blocking and using GPU shared memory is discussed in Matrix Transpose with Local Array.