RAJA::Launch
BasicsΒΆ
There are no exercise files to work through for this section. Instead, there
is an example source file RAJA/examples/tut_launch_basic.cpp
which
contains complete code examples of the concepts described here.
Key RAJA features shown in the following examples are:
RAJA::launch
method to create a run-time selectable host/device execution space.
RAJA::loop
methods to express algorithms in terms of nested for loops.
In this example, we introduce the RAJA Launch framework and discuss
hierarchical loop-based parallelism. Kernel execution details
with RAJA Launch occur inside the lambda expression
passed to the RAJA::launch
method, which defines an execution
space:
RAJA::launch<launch_policy>(RAJA::ExecPlace ,
RAJA::LaunchParams(RAJA::Teams(Nteams,Nteams),
RAJA::Threads(Nthreads,Nthreads)),
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
/* Kernel code goes here */
});
The RAJA::launch
method accepts a RAJA::LaunchPolicy
template parameter that can be defined using up to two policies
(host and device). For example, the following creates an execution space
for a sequential and CUDA kernel dispatch:
using launch_policy = RAJA::LaunchPolicy
<RAJA::seq_launch_t, RAJA::cuda_launch_t<false>>;
Whether a kernel executes on the host or device is determined by the first
argument passed to the RAJA::launch
method, which is a
RAJA::ExecPlace
enum value, either HOST
or DEVICE
.
Similar to GPU thread and block programming models, RAJA Launch carries out
computation in a predefined compute grid made up of threads which are
then grouped into teams when executing on the device. The execution space is
then enclosed by a host/device lambda which takes a
RAJA::LaunchContext
object, which may be used to control the flow
within the kernel, for example by creating thread-team synchronization points.
Inside the execution space, developers write a kernel using nested
RAJA::loop
methods. The manner in which each loop is executed
is determined by a template parameter type, which
indicates how the corresponding iterates are mapped to the Teams/Threads
configuration defined by the RAJA::LaunchParams
type passed as the second
argument to the RAJA::launch
method. Following the CUDA and HIP
programming models, this defines an hierarchical structure in which outer loops
are executed by thread-teams and inner loops are executed by threads in a team.
RAJA::loop<teams_y>(ctx, RAJA::TypedRangeSegment<int>(0, Nteams), [&] (int by) {
RAJA::loop<teams_x>(ctx, RAJA::TypedRangeSegment<int>(0, Nteams), [&] (int bx) {
RAJA::loop<threads_y>(ctx, RAJA::TypedRangeSegment<int>(0, Nthreads), [&] (int ty) {
RAJA::loop<threads_x>(ctx, RAJA::TypedRangeSegment<int>(0, Nthreads), [&] (int tx) {
printf("RAJA Teams: threadId_x %d threadId_y %d teamId_x %d teamId_y %d \n",
tx, ty, bx, by);
});
});
});
});
The mapping between teams and threads to the underlying programming
model depends on how the RAJA::loop
template parameter types are
defined. For example, we may define host and device mapping strategies as:
using teams_x = RAJA::LoopPolicy< RAJA::seq_exec,
RAJA::cuda_block_x_direct >;
using thread_x = RAJA::LoopPolicy< RAJA::seq_exec,
RAJA::cuda_block_x_direct >;
Here, the RAJA::LoopPolicy
type holds both the host (CPU) and
device (CUDA GPU) loop mapping strategies. On the host, both the team/thread
strategies expand out to standard C-style loops for execution:
for (int by=0; by<Nteams; ++by) {
for (int bx=0; bx<Nteams; ++bx) {
for (int ty=0; ty<Nthreads; ++ty) {
for (int tx=0; tx<Nthreads; ++tx) {
printf("c-iter: iter_tx %d iter_ty %d iter_bx %d iter_by %d \n",
tx, ty, bx, by);
}
}
}
}
On the device the teams_x/y
policies will map loop iterations directly to
CUDA (or HIP) thread blocks, while the thread_x/y
policies will map loop
iterations directly to threads in a CUDA (or HIP) thread block. The direct CUDA
equivalent of the kernel body using the policy shown above is:
{int by = blockIdx.y;
{int bx = blockIdx.x;
{int ty = threadIdx.y;
{int tx = blockIdx.x;
printf("device-iter: threadIdx_tx %d threadIdx_ty %d block_bx %d block_by %d \n",
tx, ty, bx, by);
}
}
}
}