Tiled Matrix Transpose¶
This section describes the implementation of a tiled matrix transpose kernel
using both RAJA::kernel
and RAJA::launch
interfaces. The intent
is to compare and contrast the two. The discussion builds on
Matrix Transpose by adding tiling to the matrix transpose
implementation.
There are exercise files
RAJA/exercises/kernel-matrix-transpose-tiled.cpp
and
RAJA/exercises/launch-matrix-transpose-tiled.cpp
for you to work through
if you wish to get some practice with RAJA. The files
RAJA/exercises/kernel-matrix-transpose-tiled_solution.cpp
and
RAJA/exercises/launch-matrix-transpose-tiled_solution.cpp
contain
complete working code for the examples. You can use the solution files to
check your work and for guidance if you get stuck. To build
the exercises execute make (kernel/launch)-matrix-transpose-tiled
and
make (kernel/launch)-matrix-transpose-tiled_solution
from the build directory.
Key RAJA features shown in this example are:
RAJA::kernel
method and execution policies, and theRAJA::statement::Tile
typeRAJA::launch
method and execution policies, and theRAJA::tile
type
As in Matrix Transpose, we compute the transpose of an input matrix \(A\) of size \(N_r \times N_c\) and storing the result in a second matrix \(At\) of size \(N_c \times N_r\).
We will compute the matrix transpose using a tiling algorithm, which iterates over tiles and transposes the matrix entries in each tile. The algorithm involves outer and inner loops to iterate over the tiles and matrix entries within each tile, respectively.
As in Matrix Transpose, we start by defining the matrix dimensions. Additionally, we define a tile size smaller than the matrix dimensions and determine the number of tiles in each dimension. Note that we do not assume that tiles divide evenly the number of rows and and columns of the matrix. However, we do assume square tiles.
constexpr int N_r = 56;
constexpr int N_c = 75;
constexpr int TILE_DIM = 16;
constexpr int outer_Dimc = (N_c - 1) / TILE_DIM + 1;
constexpr int outer_Dimr = (N_r - 1) / TILE_DIM + 1;
Then, we wrap the matrix data pointers in RAJA::View
objects to
simplify the multi-dimensional indexing:
RAJA::View<int, RAJA::Layout<DIM>> Aview(A, N_r, N_c);
RAJA::View<int, RAJA::Layout<DIM>> Atview(At, N_c, N_r);
The C-style for-loop implementation looks like this:
//
// (0) Outer loops to iterate over tiles
//
for (int by = 0; by < outer_Dimr; ++by) {
for (int bx = 0; bx < outer_Dimc; ++bx) {
//
// (1) Loops to iterate over tile entries
//
for (int ty = 0; ty < TILE_DIM; ++ty) {
for (int tx = 0; tx < TILE_DIM; ++tx) {
int col = bx * TILE_DIM + tx; // Matrix column index
int row = by * TILE_DIM + ty; // Matrix row index
// Bounds check
if (row < N_r && col < N_c) {
Atview(col, row) = Aview(row, col);
}
}
}
}
}
Note
To prevent indexing out of bounds, when the tile dimensions do not divide evenly the matrix dimensions, the algorithm requires a bounds check in the inner loops.
RAJA::kernel
Variants¶
For RAJA::kernel
variants, we use RAJA::statement::Tile
types
for the outer loop tiling and RAJA::tile_fixed
types to
indicate the tile dimensions. The complete sequential RAJA variant is:
using TILED_KERNEL_EXEC_POL =
RAJA::KernelPolicy<
RAJA::statement::Tile<1, RAJA::tile_fixed<TILE_DIM>, RAJA::seq_exec,
RAJA::statement::Tile<0, RAJA::tile_fixed<TILE_DIM>, RAJA::seq_exec,
RAJA::statement::For<1, RAJA::seq_exec,
RAJA::statement::For<0, RAJA::seq_exec,
RAJA::statement::Lambda<0>
>
>
>
>
>;
RAJA::kernel<TILED_KERNEL_EXEC_POL>( RAJA::make_tuple(col_Range, row_Range),
[=](int col, int row) {
Atview(col, row) = Aview(row, col);
});
The RAJA::statement::Tile
types compute the number of tiles needed to
iterate over all matrix entries in each dimension and generate iteration
index bounds for each tile, which are used to generate loops for the inner
RAJA::statement::For
types. Thus, the explicit bounds checking logic in the
C-style variant is not needed. Note that the integer template parameters
in the RAJA::statement::For
types refer to the entries in the iteration
space tuple passed to the RAJA::kernel
method.
The RAJA::kernel
CUDA variant is similar with sequential policies replaced
with CUDA execution policies:
using TILED_KERNEL_EXEC_POL_CUDA =
RAJA::KernelPolicy<
RAJA::statement::CudaKernel<
RAJA::statement::Tile<1, RAJA::tile_fixed<TILE_DIM>, RAJA::cuda_block_y_loop,
RAJA::statement::Tile<0, RAJA::tile_fixed<TILE_DIM>, RAJA::cuda_block_x_loop,
RAJA::statement::For<1, RAJA::cuda_thread_x_direct,
RAJA::statement::For<0, RAJA::cuda_thread_y_direct,
RAJA::statement::Lambda<0>
>
>
>
>
>
>;
RAJA::kernel<TILED_KERNEL_EXEC_POL_CUDA>( RAJA::make_tuple(col_Range, row_Range),
[=] RAJA_DEVICE (int col, int row) {
Atview(col, row) = Aview(row, col);
});
A notable difference between the CPU and GPU execution policy is the insertion
of the RAJA::statement::CudaKernel
type in the GPU version, which indicates
that the execution will launch a CUDA device kernel.
The CUDA thread-block dimensions are set based on the tile dimensions and the
iterates withing each tile are mapped directly to GPU threads in each block
due to the RAJA::cuda_thread_{x, y}_direct
policies.
RAJA::launch
Variants¶
For RAJA::launch
variants, we use RAJA::tile
methods
for the outer loop tiling and RAJA::loop
methods
to iterate within the tiles. The complete sequential tiled
RAJA::launch
variant is:
using loop_pol_1 = RAJA::LoopPolicy<RAJA::seq_exec>;
using launch_policy_1 = RAJA::LaunchPolicy<RAJA::seq_launch_t>;
RAJA::launch<launch_policy_1>(RAJA::LaunchParams(), //LaunchParams may be empty when running on the cpu
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::tile<loop_pol_1>(ctx, TILE_DIM, row_Range, [&] (RAJA::TypedRangeSegment<int> const &row_tile) {
RAJA::tile<loop_pol_1>(ctx, TILE_DIM, col_Range, [&] (RAJA::TypedRangeSegment<int> const &col_tile) {
RAJA::loop<loop_pol_1>(ctx, row_tile, [&] (int row) {
RAJA::loop<loop_pol_1>(ctx, col_tile, [&] (int col) {
Atview(col, row) = Aview(row, col);
});
});
});
});
});
Similar to the RAJA::statement::Tile
type in the RAJA::kernel
variant
above, the RAJA::tile
method computes the number of tiles needed to
iterate over all matrix entries in each dimension and generates a corresponding
iteration space for each tile, which is used to generate loops for the inner
RAJA::loop
methods. Thus, the explicit bounds checking logic in the
C-style variant is not needed.
A CUDA RAJA::launch
tiled variant for the GPU is similar with
CUDA policies in the RAJA::loop
methods. The complete
RAJA::launch
variant is:
using cuda_teams_y = RAJA::LoopPolicy<RAJA::cuda_block_y_direct>;
using cuda_teams_x = RAJA::LoopPolicy<RAJA::cuda_block_x_direct>;
using cuda_threads_y = RAJA::LoopPolicy<RAJA::cuda_thread_y_direct>;
using cuda_threads_x = RAJA::LoopPolicy<RAJA::cuda_thread_x_direct>;
const bool cuda_async = false;
using cuda_launch_policy = RAJA::LaunchPolicy<RAJA::cuda_launch_t<cuda_async>>;
RAJA::launch<cuda_launch_policy>(
RAJA::LaunchParams(RAJA::Teams(n_blocks_c, n_blocks_r),
RAJA::Threads(c_block_sz, r_block_sz)),
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::tile<cuda_teams_y>(ctx, TILE_DIM, row_Range, [&] (RAJA::TypedRangeSegment<int> const &row_tile) {
RAJA::tile<cuda_teams_x>(ctx, TILE_DIM, col_Range, [&] (RAJA::TypedRangeSegment<int> const &col_tile) {
RAJA::loop<cuda_threads_y>(ctx, row_tile, [&] (int row) {
RAJA::loop<cuda_threads_x>(ctx, col_tile, [&] (int col) {
Atview(col, row) = Aview(row, col);
});
});
});
});
});
A notable difference between the CPU and GPU RAJA::launch
implementations is the definition of the compute grid. For the CPU
version, the argument list is empty for the RAJA::LaunchParams
constructor.
For the CUDA GPU implementation, we define a ‘Team’ of one two-dimensional
thread-block with 16 x 16 = 256 threads.