Matrix Transpose¶
In RAJA::kernel Execution Policies and RAJA::Launch Execution Policies,
we presented a simple array initialization kernel using RAJA::kernel and
RAJA::launch interfaces, respectively, and compared the two. This
section describes the implementation of a matrix transpose kernel using both
RAJA::kernel and RAJA::launch interfaces. The intent is to
compare and contrast the two, as well as introduce additional features of the
interfaces.
There are exercise files
RAJA/exercises/kernel-matrix-transpose.cpp and
RAJA/exercises/launch-matrix-transpose.cpp for you to work through if you
wish to get some practice with RAJA. The files
RAJA/exercises/kernel-matrix-transpose_solution.cpp and
RAJA/exercises/launch-matrix-transpose_solution.cpp contain
complete working code for the examples. You can use the solution files to
check your work and for guidance if you get stuck. To build
the exercises execute make (kernel/launch)-matrix-transpose and make (kernel/launch)-matrix-transpose_solution
from the build directory.
Key RAJA features shown in this example are:
RAJA::kernelmethod and kernel execution policies
RAJA::launchmethod and kernel execution interface
In the example, we compute the transpose of an input matrix \(A\) of size \(N_r \times N_c\) and store the result in a second matrix \(At\) of size \(N_c \times N_r\).
First we define our matrix dimensions
constexpr int N_r = 56;
constexpr int N_c = 75;
and wrap the data pointers for the matrices in RAJA::View objects to
simplify the multi-dimensional indexing:
RAJA::View<int, RAJA::Layout<DIM>> Aview(A, N_r, N_c);
RAJA::View<int, RAJA::Layout<DIM>> Atview(At, N_c, N_r);
Then, a C-style for-loop implementation looks like this:
for (int row = 0; row < N_r; ++row) {
for (int col = 0; col < N_c; ++col) {
Atview(col, row) = Aview(row, col);
}
}
RAJA::kernel Implementation¶
For RAJA::kernel variants, we use RAJA::statement::For and
RAJA::statement::Lambda statement types in the execution policies.
The complete sequential RAJA::kernel variant is:
using KERNEL_EXEC_POL =
RAJA::KernelPolicy<
RAJA::statement::For<1, RAJA::seq_exec,
RAJA::statement::For<0, RAJA::seq_exec,
RAJA::statement::Lambda<0>
>
>
>;
RAJA::kernel<KERNEL_EXEC_POL>( RAJA::make_tuple(col_Range, row_Range),
[=](int col, int row) {
Atview(col, row) = Aview(row, col);
});
A CUDA RAJA::kernel variant for the GPU is similar with different policies
in the RAJA::statement::For statements:
using KERNEL_EXEC_POL_CUDA =
RAJA::KernelPolicy<
RAJA::statement::CudaKernel<
RAJA::statement::For<1, RAJA::cuda_thread_x_loop,
RAJA::statement::For<0, RAJA::cuda_thread_y_loop,
RAJA::statement::Lambda<0>
>
>
>
>;
RAJA::kernel<KERNEL_EXEC_POL_CUDA>( RAJA::make_tuple(col_Range, row_Range),
[=] RAJA_DEVICE (int col, int row) {
Atview(col, row) = Aview(row, col);
});
A notable difference between the CPU and GPU execution policy is the insertion
of the RAJA::statement::CudaKernel type in the GPU version, which indicates
that the execution will launch a CUDA device kernel.
In the CUDA RAJA::kernel variant above, the thread-block size and
and number of blocks to launch is determined by the implementation of the
RAJA::kernel execution policy constructs using the sizes of the
RAJA::TypedRangeSegment objects in the iteration space tuple.
RAJA::launch Implementation¶
For RAJA::launch variants, we use RAJA::loop methods
to write a loop hierarchy within the kernel execution space. For a sequential
implementation, we pass the RAJA::seq_launch_t template parameter
to the launch method and pass the RAJA::seq_exec parameter to the loop
methods. The complete sequential RAJA::launch variant is:
using loop_policy_seq = RAJA::LoopPolicy<RAJA::seq_exec>;
using launch_policy_seq = RAJA::LaunchPolicy<RAJA::seq_launch_t>;
RAJA::launch<launch_policy_seq>(
RAJA::LaunchParams(), //LaunchParams may be empty when running on the host
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::loop<loop_policy_seq>(ctx, row_Range, [&] (int row) {
RAJA::loop<loop_policy_seq>(ctx, col_Range, [&] (int col) {
Atview(col, row) = Aview(row, col);
});
});
});
A CUDA RAJA::launch variant for the GPU is similar with CUDA
policies in the RAJA::loop methods. The complete
RAJA::launch variant is:
using cuda_thread_x = RAJA::LoopPolicy<RAJA::cuda_thread_x_loop>;
using cuda_thread_y = RAJA::LoopPolicy<RAJA::cuda_thread_y_loop>;
const bool async = false; //execute asynchronously
using launch_policy_cuda = RAJA::LaunchPolicy<RAJA::cuda_launch_t<async>>;
RAJA::launch<launch_policy_cuda>
(RAJA::LaunchParams(RAJA::Teams(1), RAJA::Threads(16,16)),
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::loop<cuda_thread_y>(ctx, row_Range, [&] (int row) {
RAJA::loop<cuda_thread_x>(ctx, col_Range, [&] (int col) {
Atview(col, row) = Aview(row, col);
});
});
});
A notable difference between the CPU and GPU RAJA::launch
implementations is the definition of the compute grid. For the CPU
version, the argument list is empty for the RAJA::LaunchParams constructor.
For the CUDA GPU implementation, we define a ‘Team’ of one two-dimensional
thread-block with 16 x 16 = 256 threads.