RAJA::Launch
Execution Policies¶
This section contains an exercise file
RAJA/exercises/launchintro-execpols.cpp
for you to work through if you
wish to get some practice with RAJA. The file
RAJA/exercises/launchintro-execpols_solution.cpp
contains complete working
code for the examples discussed in this section. You can use the solution file
to check your work and for guidance if you get stuck. To build the exercises
execute make launchintro-execpols
and make launchintro-execpols_solution
from the build directory.
Key RAJA features shown in this section are:
RAJA::launch
kernel execution environment template
RAJA::loop
loop execution template and execution policies
The examples in this section illustrate how to construct nested loop kernels
inside an RAJA::launch
execution environment. In particular,
the goal is for you to gain an understanding of how to use execution policies
with nested RAJA::loop
method calls to perform various nested
loop execution patterns. All examples use the same simple kernel, which
is a three-level loop nest to initialize the entries in an array. The kernels
perform the same operations as the examples in RAJA::kernel Execution Policies.
By comparing the two sets of examples, you will gain an understanding of the
differences between the RAJA::kernel
and the RAJA::launch
interfaces.
We begin by defining some constants used throughout the examples and allocating arrays to represent the array data:
//
// 3D tensor has N^3 entries
//
constexpr int N = 100;
constexpr int N_tot = N * N * N;
constexpr double c = 0.0001;
double* a = memoryManager::allocate<double>(N_tot);
double* a_ref = memoryManager::allocate<double>(N_tot);
Note that we use the ‘memory manager’ routines contained in the exercise directory to simplify the allocation process. In particular, CUDA unified memory is used when CUDA is enabled to simplify accessing the data on the host or device.
Next, we execute a C-style nested for-loop version of the kernel to initialize the entries in the ‘reference’ array that we will use to compare the results of other variants for correctness:
for (int k = 0; k < N; ++k ) {
for (int j = 0; j < N; ++j ) {
for (int i = 0; i < N; ++i ) {
a_ref[i+N*(j+N*k)] = c * i * j * k ;
}
}
}
Note that we manually compute the pointer offsets for the (i,j,k) indices.
To simplify the remaining kernel variants we introduce a RAJA::View
object, which wraps the array data pointer and simplifies the multi-dimensional
indexing:
RAJA::View< double, RAJA::Layout<3, int> > aView(a, N, N, N);
Here ‘aView’ is a three-dimensional View with extent ‘N’ in each
coordinate based on a three-dimensional RAJA::Layout
object where the
array entries will be accessed using indices of type ‘int’.
indices of type int
. Please see View and Layout for more information
about the View and Layout types that RAJA provides for various indexing
patterns and data layouts.
Using the View, the C-style kernel looks like:
for (int k = 0; k < N; ++k ) {
for (int j = 0; j < N; ++j ) {
for (int i = 0; i < N; ++i ) {
aView(i, j, k) = c * i * j * k ;
}
}
}
Notice how accessing each (i,j,k) entry in the array is more natural, and less error prone, using the View.
The corresponding RAJA sequential version using RAJA::launch
is:
using loop_policy_1 = RAJA::LoopPolicy<RAJA::seq_exec>;
using launch_policy_1 = RAJA::LaunchPolicy<RAJA::seq_launch_t>;
RAJA::launch<launch_policy_1>
(RAJA::LaunchParams(), //LaunchParams may be empty when running on the host
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::loop<loop_policy_1>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int k) {
RAJA::loop<loop_policy_1>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int j) {
RAJA::loop<loop_policy_1>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int i) {
aView(i, j, k) = c * i * j * k ;
});
});
});
});
This should be familiar to the reader who has read through the preceding
RAJA::Launch Basics section of this tutorial. As the
RAJA::launch
method is templated on a host execution policy, the
RAJA::LaunchParams
object can be defined without arguments as loop methods
will get dispatched as standard C-Style for-loops.
Suppose we wanted to parallelize the outer ‘k’ loop using OpenMP multithreading. A C-style version of this is:
#pragma omp parallel for
for (int k = 0; k < N; ++k ) {
for (int j = 0; j < N; ++j ) {
for (int i = 0; i < N; ++i ) {
aView(i, j, k) = c * i * j * k ;
}
}
}
where we have placed the OpenMP directive #pragma omp parallel for
before
the outer loop of the kernel.
The corresponding RAJA versions of the C-style OpenMP variant is:
using omp_policy_2 = RAJA::LoopPolicy<RAJA::omp_for_exec>;
using loop_policy_2 = RAJA::LoopPolicy<RAJA::seq_exec>;
using launch_policy_2 = RAJA::LaunchPolicy<RAJA::omp_launch_t>;
RAJA::launch<launch_policy_2>
(RAJA::LaunchParams(), //LaunchParams may be empty when running on the host
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::loop<omp_policy_2>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int k) {
RAJA::loop<loop_policy_2>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int j) {
RAJA::loop<loop_policy_2>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int i) {
aView(i, j, k) = c * i * j * k ;
});
});
});
});
With the OpenMP version above, RAJA::launch
method is templated on
a RAJA::omp_launch_t
execution policy. The policy is used
to create an OpenMP parallel region, loop iterations may then be distributed
using RAJA::loop
methods templated on RAJA::omp_for_exec
execution policies. As before, the RAJA::LaunchParams
object may be
initialized without grid dimensions as the CPU does not require specifying a
compute grid.
The first RAJA-based kernel for parallel GPU execution using the RAJA CUDA back-end we introduce is:
using cuda_teams_z_3 = RAJA::LoopPolicy<RAJA::cuda_block_z_direct>;
using cuda_global_thread_y_3 = RAJA::LoopPolicy<RAJA::cuda_global_thread_y>;
using cuda_global_thread_x_3 = RAJA::LoopPolicy<RAJA::cuda_global_thread_x>;
const bool async_3 = false;
using launch_policy_3 = RAJA::LaunchPolicy<RAJA::cuda_launch_t<async_3>>;
RAJA::launch<launch_policy_3>
(RAJA::LaunchParams(RAJA::Teams(n_blocks_i ,n_blocks_j, n_blocks_k),
RAJA::Threads(i_block_sz, j_block_sz, k_block_sz)),
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::loop<cuda_teams_z_3>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int k) {
RAJA::loop<cuda_global_thread_y_3>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int j) {
RAJA::loop<cuda_global_thread_x_3>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int i) {
aView(i, j, k) = c * i * j * k ;
});
});
});
});
where we have defined the CUDA thread-block dimensions as:
constexpr int block_size = 256;
constexpr int i_block_sz = 32;
constexpr int j_block_sz = block_size / i_block_sz;
constexpr int k_block_sz = 1;
const int n_blocks_i = RAJA_DIVIDE_CEILING_INT(N,i_block_sz);
const int n_blocks_j = RAJA_DIVIDE_CEILING_INT(N,j_block_sz);
const int n_blocks_k = RAJA_DIVIDE_CEILING_INT(N,k_block_sz);
Here, we use the RAJA::cuda_launch_t
policy type to
indicate that we want a CUDA kernel to be launched. The ‘k’, ‘j’, ‘i’
iteration variables are mapped to CUDA threads and blocks using the CUDA
execution policy types RAJA::cuda_block_z_direct
,
RAJA::cuda_global_thread_y
, and RAJA::cuda_global_thread_x
,
respectively. Thus, we use a two-dimensional CUDA thread-block and
three-dimensional compute grid to map the loop iterations to CUDA threads. In
comparison to the RAJA CUDA example in RAJA::kernel Execution Policies ,
RAJA::loop
methods support execution policies, which enable mapping
directly to the global thread ID of a compute grid.
Using a combination of RAJA::tile
and RAJA::loop
methods,
we can create a loop tiling platform portable implementation. Here, is a
CUDA variant:
using cuda_teams_z_4 = RAJA::LoopPolicy<RAJA::cuda_block_z_direct>;
using cuda_teams_y_4 = RAJA::LoopPolicy<RAJA::cuda_block_y_direct>;
using cuda_teams_x_4 = RAJA::LoopPolicy<RAJA::cuda_block_x_direct>;
using cuda_threads_y_4 = RAJA::LoopPolicy<RAJA::cuda_thread_y_direct>;
using cuda_threads_x_4 = RAJA::LoopPolicy<RAJA::cuda_thread_x_direct>;
const bool async_4 = false;
using launch_policy_4 = RAJA::LaunchPolicy<RAJA::cuda_launch_t<async_4>>;
RAJA::launch<launch_policy_4>
(RAJA::LaunchParams(RAJA::Teams(n_blocks_i, n_blocks_j, n_blocks_k),
RAJA::Threads(i_block_sz, j_block_sz, k_block_sz)),
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::loop<cuda_teams_z_4>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int k) {
RAJA::tile<cuda_teams_y_4>
(ctx, j_block_sz, RAJA::TypedRangeSegment<int>(0, N), [&] (RAJA::TypedRangeSegment<int> const &j_tile) {
RAJA::tile<cuda_teams_x_4>
(ctx, i_block_sz, RAJA::TypedRangeSegment<int>(0, N), [&] (RAJA::TypedRangeSegment<int> const &i_tile) {
RAJA::loop<cuda_threads_y_4>(ctx, j_tile, [&] (int j) {
RAJA::loop<cuda_threads_x_4>(ctx, i_tile, [&] (int i) {
aView(i, j, k) = c * i * j * k ;
});
});
});
});
});
});
We consider the kernel to be portable, because all of the execution policy types and execution parameters can be replaced by other types and values without changing the kernel code directly.
The RAJA::tile
methods are used to partition an iteration space into
tiles to be used within a RAJA::loop
method. The ‘{i,j,k}_block_sz’
arguments passed to the RAJA::tile
function specify the tile size
for each loop. In the case of GPU programming models, we define the tile size
to correspond to the number of threads in a given dimension. Execution tile
and loop execution policies are chosen to have CUDA blocks and threads map
directly to tiles and entries in a tile.
For context and comparison, here is the same kernel implementation using CUDA directly:
dim3 nthreads_per_block(i_block_sz, j_block_sz, k_block_sz);
static_assert(i_block_sz*j_block_sz*k_block_sz == block_size,
"Invalid block_size");
dim3 nblocks(static_cast<size_t>(RAJA_DIVIDE_CEILING_INT(N, i_block_sz)),
static_cast<size_t>(RAJA_DIVIDE_CEILING_INT(N, j_block_sz)),
static_cast<size_t>(RAJA_DIVIDE_CEILING_INT(N, k_block_sz)));
nested_init<i_block_sz, j_block_sz, k_block_sz>
<<<nblocks, nthreads_per_block>>>(a, c, N);
cudaErrchk( cudaGetLastError() );
cudaErrchk(cudaDeviceSynchronize());
The nested_init
device kernel used here is:
template< int i_block_size, int j_block_size, int k_block_size >
__launch_bounds__(i_block_size*j_block_size*k_block_size)
__global__ void nested_init(double* a, double c, int N)
{
int i = blockIdx.x * i_block_size + threadIdx.x;
int j = blockIdx.y * j_block_size + threadIdx.y;
int k = blockIdx.z;
if ( i < N && j < N && k < N ) {
a[i+N*(j+N*k)] = c * i * j * k ;
}
}
A few differences between the CUDA and RAJA-CUDA versions are worth noting.
First, the CUDA version uses the CUDA dim3
construct to express the
threads-per-block and number of thread-blocks to use: i.e., the
nthreads_per_block
and nblocks
variable definitions. The
RAJA::launch
interface takes compute dimensions through a
RAJA::LaunchParams
object. RAJA provides a macro RAJA_DIVIDE_CEILING_INT
to perform the proper integer arithmetic to calculate the number of blocks
based on the size of the array and the block size in each dimension. Second, the
mapping of thread identifiers to the (i,j,k) indices is explicit in the device
kernel. Third, an explicit check of the (i,j,k) values is required in the CUDA
implementation to avoid addressing memory out-of-bounds; i.e.,
if ( i < N && j < N && k < N )...
. The RAJA variants set similar
definitions internally and mask out indices that would be out-of-bounds.
Note that we also inserted additional error checking with static_assert
and cudaErrchk
, which is a RAJA macro, for printing CUDA device error
codes, to catch device errors if there are any.
Lastly, we show the RAJA HIP variants of the kernel, which are semantically identical to the RAJA CUDA variants. First, the RAJA-HIP global-thread variant:
using hip_teams_z_5 = RAJA::LoopPolicy<RAJA::hip_block_z_direct>;
using hip_global_thread_y_5 = RAJA::LoopPolicy<RAJA::hip_global_thread_y>;
using hip_global_thread_x_5 = RAJA::LoopPolicy<RAJA::hip_global_thread_x>;
const bool async_5 = false;
using launch_policy_5 = RAJA::LaunchPolicy<RAJA::hip_launch_t<async_5>>;
RAJA::launch<launch_policy_5>
(RAJA::LaunchParams(RAJA::Teams(n_blocks_i, n_blocks_j, n_blocks_k),
RAJA::Threads(i_block_sz, j_block_sz, k_block_sz)),
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::loop<hip_teams_z_5>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int k) {
RAJA::loop<hip_global_thread_y_5>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int j) {
RAJA::loop<hip_global_thread_x_5>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int i) {
d_aView(i, j, k) = c * i * j * k ;
});
});
});
});
and then the RAJA Launch HIP fixed thread-block size, tiled, direct thread mapping version:
using hip_teams_z_6 = RAJA::LoopPolicy<RAJA::hip_block_z_direct>;
using hip_teams_y_6 = RAJA::LoopPolicy<RAJA::hip_block_y_direct>;
using hip_teams_x_6 = RAJA::LoopPolicy<RAJA::hip_block_x_direct>;
using hip_threads_y_6 = RAJA::LoopPolicy<RAJA::hip_thread_y_direct>;
using hip_threads_x_6 = RAJA::LoopPolicy<RAJA::hip_thread_x_direct>;
const bool async_6 = false;
using launch_policy_6 = RAJA::LaunchPolicy<RAJA::hip_launch_t<async_6>>;
RAJA::launch<launch_policy_6>
(RAJA::LaunchParams(RAJA::Teams(n_blocks_i, n_blocks_j, n_blocks_k),
RAJA::Threads(i_block_sz, j_block_sz, k_block_sz)),
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::loop<hip_teams_z_6>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int k) {
RAJA::tile<hip_teams_y_6>
(ctx, j_block_sz, RAJA::TypedRangeSegment<int>(0, N), [&] (RAJA::TypedRangeSegment<int> const &j_tile) {
RAJA::tile<hip_teams_x_6>
(ctx, i_block_sz, RAJA::TypedRangeSegment<int>(0, N), [&] (RAJA::TypedRangeSegment<int> const &i_tile) {
RAJA::loop<hip_threads_y_6>(ctx, j_tile, [&] (int j) {
RAJA::loop<hip_threads_x_6>(ctx, i_tile, [&] (int i) {
d_aView(i, j, k) = c * i * j * k ;
});
});
});
});
});
});
The only differences are that type names are changed to replace ‘CUDA’ types with ‘HIP’ types to use the RAJA HIP back-end.