Matrix Transpose¶
In RAJA::kernel Execution Policies and RAJA::Launch Execution Policies,
we presented a simple array initialization kernel using RAJA::kernel
and
RAJA::launch
interfaces, respectively, and compared the two. This
section describes the implementation of a matrix transpose kernel using both
RAJA::kernel
and RAJA::launch
interfaces. The intent is to
compare and contrast the two, as well as introduce additional features of the
interfaces.
There are exercise files
RAJA/exercises/kernel-matrix-transpose.cpp
and
RAJA/exercises/launch-matrix-transpose.cpp
for you to work through if you
wish to get some practice with RAJA. The files
RAJA/exercises/kernel-matrix-transpose_solution.cpp
and
RAJA/exercises/launch-matrix-transpose_solution.cpp
contain
complete working code for the examples. You can use the solution files to
check your work and for guidance if you get stuck. To build
the exercises execute make (kernel/launch)-matrix-transpose
and make (kernel/launch)-matrix-transpose_solution
from the build directory.
Key RAJA features shown in this example are:
RAJA::kernel
method and kernel execution policiesRAJA::launch
method and kernel execution interface
In the example, we compute the transpose of an input matrix \(A\) of size \(N_r \times N_c\) and store the result in a second matrix \(At\) of size \(N_c \times N_r\).
First we define our matrix dimensions
constexpr int N_r = 56;
constexpr int N_c = 75;
and wrap the data pointers for the matrices in RAJA::View
objects to
simplify the multi-dimensional indexing:
RAJA::View<int, RAJA::Layout<DIM>> Aview(A, N_r, N_c);
RAJA::View<int, RAJA::Layout<DIM>> Atview(At, N_c, N_r);
Then, a C-style for-loop implementation looks like this:
for (int row = 0; row < N_r; ++row) {
for (int col = 0; col < N_c; ++col) {
Atview(col, row) = Aview(row, col);
}
}
RAJA::kernel
Implementation¶
For RAJA::kernel
variants, we use RAJA::statement::For
and
RAJA::statement::Lambda
statement types in the execution policies.
The complete sequential RAJA::kernel
variant is:
using KERNEL_EXEC_POL =
RAJA::KernelPolicy<
RAJA::statement::For<1, RAJA::seq_exec,
RAJA::statement::For<0, RAJA::seq_exec,
RAJA::statement::Lambda<0>
>
>
>;
RAJA::kernel<KERNEL_EXEC_POL>( RAJA::make_tuple(col_Range, row_Range),
[=](int col, int row) {
Atview(col, row) = Aview(row, col);
});
A CUDA RAJA::kernel
variant for the GPU is similar with different policies
in the RAJA::statement::For
statements:
using KERNEL_EXEC_POL_CUDA =
RAJA::KernelPolicy<
RAJA::statement::CudaKernel<
RAJA::statement::For<1, RAJA::cuda_thread_x_loop,
RAJA::statement::For<0, RAJA::cuda_thread_y_loop,
RAJA::statement::Lambda<0>
>
>
>
>;
RAJA::kernel<KERNEL_EXEC_POL_CUDA>( RAJA::make_tuple(col_Range, row_Range),
[=] RAJA_DEVICE (int col, int row) {
Atview(col, row) = Aview(row, col);
});
A notable difference between the CPU and GPU execution policy is the insertion
of the RAJA::statement::CudaKernel
type in the GPU version, which indicates
that the execution will launch a CUDA device kernel.
In the CUDA RAJA::kernel
variant above, the thread-block size and
and number of blocks to launch is determined by the implementation of the
RAJA::kernel
execution policy constructs using the sizes of the
RAJA::TypedRangeSegment
objects in the iteration space tuple.
RAJA::launch
Implementation¶
For RAJA::launch
variants, we use RAJA::loop
methods
to write a loop hierarchy within the kernel execution space. For a sequential
implementation, we pass the RAJA::seq_launch_t
template parameter
to the launch method and pass the RAJA::loop_exec
parameter to the loop
methods. The complete sequential RAJA::launch
variant is:
using loop_policy_seq = RAJA::LoopPolicy<RAJA::seq_exec>;
using launch_policy_seq = RAJA::LaunchPolicy<RAJA::seq_launch_t>;
RAJA::launch<launch_policy_seq>(
RAJA::LaunchParams(), //LaunchParams may be empty when running on the host
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::loop<loop_policy_seq>(ctx, row_Range, [&] (int row) {
RAJA::loop<loop_policy_seq>(ctx, col_Range, [&] (int col) {
Atview(col, row) = Aview(row, col);
});
});
});
A CUDA RAJA::launch
variant for the GPU is similar with CUDA
policies in the RAJA::loop
methods. The complete
RAJA::launch
variant is:
using cuda_thread_x = RAJA::LoopPolicy<RAJA::cuda_thread_x_loop>;
using cuda_thread_y = RAJA::LoopPolicy<RAJA::cuda_thread_y_loop>;
const bool async = false; //execute asynchronously
using launch_policy_cuda = RAJA::LaunchPolicy<RAJA::cuda_launch_t<async>>;
RAJA::launch<launch_policy_cuda>
(RAJA::LaunchParams(RAJA::Teams(1), RAJA::Threads(16,16)),
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::loop<cuda_thread_y>(ctx, row_Range, [&] (int row) {
RAJA::loop<cuda_thread_x>(ctx, col_Range, [&] (int col) {
Atview(col, row) = Aview(row, col);
});
});
});
A notable difference between the CPU and GPU RAJA::launch
implementations is the definition of the compute grid. For the CPU
version, the argument list is empty for the RAJA::LaunchParams
constructor.
For the CUDA GPU implementation, we define a ‘Team’ of one two-dimensional
thread-block with 16 x 16 = 256 threads.