Basic RAJA::kernel
Mechanics and Nested Loop Ordering¶
This section contains an exercise file RAJA/exercises/kernelintro-nested-loop-reorder.cpp
for you to work through if you wish to get some practice with RAJA. The
file RAJA/exercises/kernelintro-nested-loop-reorder_solution.cpp
contains
complete working code for the examples discussed in this section. You can use
the solution file to check your work and for guidance if you get stuck. To build
the exercises execute make kernelintro-nested-loop-reorder
and make kernelintro-nested-loop-reorder_solution
from the build directory.
Key RAJA features shown in this section are:
RAJA::kernel
loop iteration templates and execution policiesNested loop reordering
RAJA strongly-types indices
The examples in this section show the nested loop reordering process in more detail. Specifically, we describe how to reorder execution policy statements, which is conceptually analogous to how one would reorder for-loops in a C-style loop nest. We also introduce strongly-typed index variables that can help users write correct nested loop code with RAJA. The examples do not perform any computation; each kernel simply prints out the loop indices in the order that the iteration spaces are traversed. Thus, only sequential execution policies are used to avoid complications resulting from print statements used in parallel programs. The mechanics shown here work the same way for parallel RAJA execution policies.
Before we dive into code, we reiterate important features that
represent the main differences between nested-loop RAJA and the
RAJA::forall
construct for simple, non-nested loop kernels:
An index space (e.g., range segment) and lambda index argument are required for each level in a loop nest. This example contains triply-nested loops, so there will be three ranges and three index arguments.
The index spaces for the nested loop levels are specified in a RAJA tuple object. The order of spaces in the tuple must match the order of index arguments to the lambda for this to be correct in general. RAJA provides strongly-typed indices to help with this, which we show below.
An execution policy is required for each level in a loop nest. These are specified as nested statements in the
RAJA::KernelPolicy
type.The loop nest ordering is specified in the nested kernel policy – the first
statement::For
type identifies the outermost loop, the secondstatement::For
type identifies the loop nested inside the outermost loop, and so on.
We begin by defining three named strongly-typed variables for the loop index variables (i, j, k):
RAJA_INDEX_VALUE_T(KIDX, int, "KIDX");
RAJA_INDEX_VALUE_T(JIDX, int, "JIDX");
RAJA_INDEX_VALUE_T(IIDX, int, "IIDX");
Specifically, the ‘i’ index variable type is IIDX
, the ‘j’ index variable
is JIDX
, and the ‘k’ variable is KIDX
, which are aliases to
int
type.
We also define [min, max) intervals for each loop index:
constexpr int imin = 0;
constexpr int imax = 2;
constexpr int jmin = 1;
constexpr int jmax = 3;
constexpr int kmin = 2;
constexpr int kmax = 4;
and three corresponding typed range segments which bind the ranges to the index variable types via template specialization:
RAJA::TypedRangeSegment<KIDX> KRange(kmin, kmax);
RAJA::TypedRangeSegment<JIDX> JRange(jmin, jmax);
RAJA::TypedRangeSegment<IIDX> IRange(imin, imax);
When these features are used as in this example, the compiler will generate error messages if the lambda expression index argument ordering and types do not match the index ordering in the tuple. This is illustrated at the end of this section.
We begin with a C-style loop nest with ‘i’ in the inner loop, ‘j’ in the middle loop, and ‘k’ in the outer loop, which prints the (i, j, k) triple in the inner loop body:
for (int k = kmin; k < kmax; ++k) {
for (int j = jmin; j < jmax; ++j) {
for (int i = imin; i < imax; ++i) {
printf( " (%d, %d, %d) \n", i, j, k);
}
}
}
The RAJA::kernel
version of this is:
using KJI_EXECPOL = RAJA::KernelPolicy<
RAJA::statement::For<2, RAJA::seq_exec, // k
RAJA::statement::For<1, RAJA::seq_exec, // j
RAJA::statement::For<0, RAJA::seq_exec,// i
RAJA::statement::Lambda<0>
>
>
>
>;
RAJA::kernel<KJI_EXECPOL>( RAJA::make_tuple(IRange, JRange, KRange),
[=] (IIDX i, JIDX j, KIDX k) {
printf( " (%d, %d, %d) \n", (int)(*i), (int)(*j), (int)(*k));
});
The integer template parameters in the RAJA::statement::For
types
represent the lambda expression index argument and the range types in the
iteration space tuple argument to RAJA::kernel
.
Both kernels generate the same output, as expected:
(I, J, K)
---------
(0, 1, 2)
(1, 1, 2)
(0, 2, 2)
(1, 2, 2)
(0, 1, 3)
(1, 1, 3)
(0, 2, 3)
(1, 2, 3)
which you can see by running the exercise code.
Here, the RAJA::kernel
execution template takes two arguments: a tuple of
ranges, one for each of the three levels in the loop nest, and the lambda
expression loop body. Note that the lambda has an index argument for each
range and that their order and types match. This is required for the code to
compile.
Note
RAJA provides mechanisms to explicitly specify which loop variables, for example, and in which order they appear in a lambda expression argument list. Please refer to Complex Loops (RAJA::kernel) for more information.
The execution policy for the loop nest is specified in the
RAJA::KernelPolicy
type. The policy uses two statement types:
RAJA::statement::For
and RAJA::statement::Lambda
.
The RAJA::statement::Lambda
is used to generate code that invokes the
lambda expression. The ‘0’ template parameter refers to the index of the
lambda expression in the RAJA::kernel
argument list following the
iteration space tuple. Since there is only one lambda expression, we reference
it with the ‘0’ identifier. Sometimes more complicated kernels require multiple
lambda expressions, so we need a way to specify where they will appear in the
generated executable code. We show examples of this in the matrix transpose
discussion later in the tutorial.
Each level in the loop nest is identified by a
RAJA::statement::For
type, which identifies the iteration space and
execution policy for the level. Here, each level uses a
sequential execution policy, which is for illustration purposes.
The integer that appears as the first template argument to each
RAJA::statement::For
type corresponds to the index of a range in the tuple
and also to the associated lambda index argument; i.e., ‘0’ for ‘i’,
‘1’ for ‘j’, and ‘2’ for ‘k’.
The integer argument to each RAJA::statement::For
type is needed so
that the levels in the loop nest can be reordered by changing the policy
while the kernel remains the same. To illustrate, we permute the loop nest
ordering so that the ‘j’ loop is the outermost, the ‘i’ loop is in the middle,
and the ‘k’ loop is the innermost with the following policy:
using JIK_EXECPOL = RAJA::KernelPolicy<
RAJA::statement::For<1, RAJA::seq_exec, // j
RAJA::statement::For<0, RAJA::seq_exec, // i
RAJA::statement::For<2, RAJA::seq_exec,// k
RAJA::statement::Lambda<0>
>
>
>
>;
RAJA::kernel<JIK_EXECPOL>( RAJA::make_tuple(IRange, JRange, KRange),
[=] (IIDX i, JIDX j, KIDX k) {
printf( " (%d, %d, %d) \n", (int)(*i), (int)(*j), (int)(*k));
});
This generates the following output:
(I, J, K)
---------
(0, 1, 2)
(0, 1, 3)
(1, 1, 2)
(1, 1, 3)
(0, 2, 2)
(0, 2, 3)
(1, 2, 2)
(1, 2, 3)
which is the same as the corresponding C-style version:
for (int j = jmin; j < jmax; ++j) {
for (int i = imin; i < imax; ++i) {
for (int k = kmin; k < kmax; ++k) {
printf( " (%d, %d, %d) \n", i, j, k);
}
}
}
Note that we have simply reordered the nesting of the RAJA::statement::For
types in the execution policy. This is analogous to reordering the for-loops
in C-style version.
For completeness, we permute the loops again so that the ‘i’ loop is the outermost, the ‘k’ loop is in the middle, and the ‘j’ loop is the innermost with the following policy:
using IKJ_EXECPOL = RAJA::KernelPolicy<
RAJA::statement::For<0, RAJA::seq_exec, // i
RAJA::statement::For<2, RAJA::seq_exec, // k
RAJA::statement::For<1, RAJA::seq_exec,// j
RAJA::statement::Lambda<0>
>
>
>
>;
RAJA::kernel<IKJ_EXECPOL>( RAJA::make_tuple(IRange, JRange, KRange),
[=] (IIDX i, JIDX j, KIDX k) {
printf( " (%d, %d, %d) \n", (int)(*i), (int)(*j), (int)(*k));
});
The analogous C-style loop nest is:
for (int i = imin; i < imax; ++i) {
for (int k = kmin; k < kmax; ++k) {
for (int j = jmin; j < jmax; ++j) {
printf( " (%d, %d, %d) \n", i, j, k);
}
}
}
The output generated by these two kernels is:
(I, J, K)
---------
(0, 1, 2)
(0, 2, 2)
(0, 1, 3)
(0, 2, 3)
(1, 1, 2)
(1, 2, 2)
(1, 1, 3)
(1, 2, 3)
Finally, we show an example that will generate a compilation error because there is a type mismatch in the ordering of the range segments in the tuple and the lambda expression argument list.
RAJA::kernel<IKJ_EXECPOL>( RAJA::make_tuple(IRange, JRange, KRange),
[=] (JIDX i, IIDX j, KIDX k) {
printf( " (%d, %d, %d) \n", (int)(*i), (int)(*j), (int)(*k));
});
Do you see the problem? The last kernel is included in the exercise source file, so you can see what happens when you attempt to compile it.