Elements of Loop Execution

The RAJA::forall, RAJA::expt::dynamic_forall, RAJA::kernel, and RAJA::launch template methods comprise the RAJA interface for kernel execution. forall methods execute simple, non-nested loops, RAJA::kernel methods support nested loops and other complex loop kernels and transformations, and RAJA::launch creates an execution space in which kernels are written in terms of nested loops using the RAJA::loop method.

Note

The forall , kernel, and launch methods are in the RAJA namespace, while dynamic_forall is in the RAJA namespace for experimental features RAJA::expt. RAJA::expt::dynamic_forall will be moved to the RAJA namespace in a future RAJA release.

For more information on RAJA execution policies and iteration space constructs, see Policies and Indices, Segments, and IndexSets, respectively.

The following sections describe the basic aspects of these methods. Detailed examples showing how to use RAJA::forall, RAJA::kernel, RAJA::launch methods may be found in the RAJA Tutorial and Examples. Links to specific RAJA tutorial sections are provided in the sections below.

Simple Loops (RAJA::forall)

Consider a C-style loop that adds two vectors:

for (int i = 0; i < N; ++i) {
  c[i] = a[i] + b[i];
}

This may be written using RAJA::forall as:

RAJA::forall<exec_policy>(RAJA::TypesRangeSegment<int>(0, N), [=] (int i) {
  c[i] = a[i] + b[i];
});

A RAJA::forall loop execution method is a template that takes an execution policy type template parameter. A RAJA::forall method takes two arguments: an iteration space object, such as a contiguous range of loop indices as shown here, and a single lambda expression representing the loop kernel body.

Applying different loop execution policies enables the loop to run in different ways; e.g., using different programming model back-ends. Different iteration space objects enable the loop iterates to be partitioned, reordered, run in different threads, etc. Please see Indices, Segments, and IndexSets for details about RAJA iteration spaces.

Note

Changing loop execution policy types and iteration space constructs enables loops to run in different ways by recompiling the code and without modifying the loop kernel code.

As an extension of RAJA::forall, the RAJA::expt::dynamic_forall method enables users to compile using a list of execution policies and choose the execution policy at run-time. For example, a user may want to have N policies available and at run-time choose which policy to use:

using exec_pol_list = camp::list<pol_0,
                                  ...
                                 pol_N>;
int pol = i; //run-time value

RAJA::expt::dynamic_forall<exec_pol_list>(pol, RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
   c[i]  = a[i] + b[i];
});

While static loop execution using forall methods is a subset of RAJA::kernel functionality, described next, we maintain the forall interfaces for simple loop execution because the syntax is simpler and less verbose for that use case.

Note

Data arrays in lambda expressions used with RAJA are typically RAJA Views (see View and Layout) or bare pointers as shown in the code snippets above. Using something like ‘std::vector’ is non-portable (won’t work in GPU kernels, generally) and would add excessive overhead for copying data into the lambda data environment when captured by value.

Please see the following tutorial sections for detailed examples that use RAJA::forall:

Complex Loops (RAJA::kernel)

A RAJA::kernel template provides ways to compose and execute arbitrary loop nests and other complex kernels. The RAJA::kernel interface employs similar concepts to RAJA::forall but extends it to support much more complex kernel structures. Each RAJA::kernel method is a template that takes an execution policy type template parameter. The execution policy can be an arbitrarily complex sequence of nested templates that define a kernel execution pattern. In its simplest form, RAJA::kernel takes two arguments: a tuple of iteration space objects, and a lambda expression representing the kernel inner loop body. In more complex usage, RAJA::kernel can take multiple lambda expressions representing different portions of the loop kernel body.

To introduce the RAJA kernel interface, consider a (N+1)-level C-style loop nest:

for (int iN = 0; iN < NN; ++iN) {
  ...
     for (int i0 = 0; i0 < N0; ++i0) {s
       \\ inner loop body
     }
}

It is important to note that we do not recommend writing a RAJA version of this by nesting RAJA::forall statements. For example:

RAJA::forall<exec_policyN>(IN, [=] (int iN) {
  ...
     RAJA::forall<exec_policy0>(I0, [=] (int i0)) {
       \\ inner loop body
     }
  ...
}

This would work for some execution policy choices, but not in general. Also, this approach treats each loop level as an independent entity, which makes it difficult to parallelize the levels in the loop nest together. So it may limit the amount of parallelism that can be exposed and the types of parallelism that may be used. For example, if an OpenMP or CUDA parallel execution policy is used on the outermost loop, then all inner loops would be run sequentially in each thread. It also makes it difficult to perform transformations like loop interchange and loop collapse without changing the source code, which breaks RAJA encapsulation.

Note

We do not recommend using nested ``RAJA::forall`` statements.

The RAJA::kernel interface facilitates parallel execution and compile-time transformation of arbitrary loop nests and other complex loop structures. It can treat a complex loop structure as a single entity, which enables the ability to transform and apply different parallel execution patterns by changing the execution policy type and not the kernel code, in many cases.

The C-style loop above nest may be written using RAJA::kernel as:

using KERNEL_POL =
  RAJA::KernelPolicy< RAJA::statement::For<N, exec_policyN,
                        ...
                          RAJA::statement::For<0, exec_policy0,
                            RAJA::statement::Lambda<0>
                          >
                        ...
                      >
                    >;

RAJA::kernel< KERNEL_POL >(
  RAJA::make_tuple(RAJA::TypedRangeSegment<int>(0, NN),
                   ...,
                   RAJA::TypedRangeSegment<int>(0, N0),

  [=] (int iN, ... , int i0) {
     // inner loop body
  }

);

In the case we discuss here, the execution policy contains a nested sequence of RAJA::statement::For types, indicating an iteration over each level in the loop nest. Each of these statement types takes three template parameters:

  • an integral index parameter that binds the statement to the item in the iteration space tuple corresponding to that index

  • an execution policy type for the associated loop nest level

  • an enclosed statement list (described in RAJA Kernel Execution Policies).

Note

The nesting of RAJA::statement::For types is analogous to the nesting of for-statements in the C-style version of the loop nest. One can think of the ‘<, >’ symbols enclosing the template parameter lists as being similar to the curly braces in C-style code.

Here, the innermost type in the kernel policy is a RAJA::statement::Lambda<0> type indicating that the first lambda expression (argument zero of a sequence of lambdas passed to the RAJA::kernel method) will comprise the inner loop body. We only have one lambda in this example but, in general, we can have any number of lambdas and we can use any subset of them, with RAJA::statement::Lambda types placed appropriately in the execution policy, to construct a loop kernel. For example, placing RAJA::statement::Lambda types between RAJA::statement::For statements enables non-perfectly nested loops.

RAJA offers two types of RAJA::statement::Lambda statements. The simplest form, shown above, requires that each lambda expression passed to a RAJA::kernel method must take an index argument for each iteration space. With this type of lambda statement, the entire iteration space must be active in a surrounding For construct. A compile time static_assert will be triggered if any of the arguments are undefined, indicating that something is not correct.

A second RAJA::statement::Lambda type, which is an extension of the first, takes additional template parameters which specify which iteration spaces are passed as lambda arguments. The result is that a kernel lambda only needs to accept iteration space index arguments that are used in the lambda body.

The kernel policy list with lambda arguments may be written as:

using KERNEL_POL =
  RAJA::KernelPolicy< RAJA::statement::For<N, exec_policyN,
                        ...
                          RAJA::statement::For<0, exec_policy0,
                            RAJA::statement::Lambda<0, RAJA::Segs<N,...,0>>
                          >
                        ...
                      >
                    >;

The template parameter RAJA::Segs is used to specify indices from which elements in the segment tuple are passed as arguments to the lambda, and in which argument order. Here, we pass all segment indices so the lambda kernel body definition could be identical to on passed to the previous RAJA version. RAJA offers other types such as RAJA::Offsets, and RAJA::Params to identify offsets and parameters in segments and parameter tuples that could be passed to RAJA::kernel methods. See Matrix Multiplication: RAJA::kernel for an example.

Note

Unless lambda arguments are specified in RAJA lambda statements, the loop index arguments for each lambda expression used in a RAJA kernel loop body must match the contents of the iteration space tuple in number, order, and type. Not all index arguments must be used in a lambda, but they all must appear in the lambda argument list and all must be in active loops to be well-formed. In particular, your code will not compile if this is not done correctly. If an argument is unused in a lambda expression, you may include its type and omit its name in the argument list to avoid compiler warnings just as one would do for a regular C++ method with unused arguments.

For RAJA nested loops implemented with RAJA::kernel, as shown here, the loop nest ordering is determined by the order of the nested policies, starting with the outermost loop and ending with the innermost loop.

Note

The integer value that appears as the first parameter in each RAJA::statement::For template indicates which iteration space tuple entry or lambda index argument it corresponds to. This allows loop nesting order to be changed simply by changing the ordering of the nested policy statements. This is analogous to changing the order of ‘for-loop’ statements in C-style nested loop code.

Note

In general, RAJA execution policies for RAJA::forall and RAJA::kernel are different. A summary of all RAJA execution policies that may be used with RAJA::forall or RAJA::kernel may be found in Policies.

A discussion of how to construct RAJA::KernelPolicy types and available RAJA::statement types can be found in RAJA Kernel Execution Policies.

Please see the following tutorial sections for detailed examples that use RAJA::kernel:

Hierarchical loops (RAJA::launch)

The RAJA::launch template is an alternative interface to RAJA::kernel that may be preferred for certain types of complex kernels or based on coding style preferences.

RAJA::launch optionally allows either host or device execution to be chosen at run time. The method takes an execution policy type that will define the execution environment inside a lambda expression for a kernel to be run on a host, device, or either. Kernel algorithms are written inside main lambda expression using RAJA::loop methods.

The RAJA::launch framework aims to unify thread/block based programming models such as CUDA/HIP/SYCL while maintaining portability on host back-ends (OpenMP, sequential). As we showed earlier, when using the RAJA::kernel interface, developers express all aspects of nested loop execution in an execution policy type on which the RAJA::kernel method is templated. In contrast, the RAJA::launch interface allows users to express nested loop execution in a manner that more closely reflects how one would write conventional nested C-style for-loop code. For example, here is an example of a RAJA::launch kernel that copies values from an array in into a shared memory array:

RAJA::launch<launch_policy>(select_CPU_or_GPU)
RAJA::LaunchParams(RAJA::Teams(NE), RAJA::Threads(Q1D)),
[=] RAJA_HOST_DEVICE (RAJA::Launch ctx) {

  RAJA::loop<team_x> (ctx, RAJA::RAJA::TypedRangeSegment<int>(0, teamRange), [&] (int bx) {

    RAJA_TEAM_SHARED double s_A[SHARE_MEM_SIZE];

    RAJA::loop<thread_x> (ctx, RAJA::RAJA::TypedRangeSegment<int>(0, threadRange), [&] (int tx) {
      s_A[tx] = tx;
    });

      ctx.teamSync();

 )};

});

The idea underlying RAJA::launch is to enable developers to express hierarchical parallelism in terms of teams and threads. Similar to the CUDA programming model, development is done using a collection of threads, and threads are grouped into teams. Using the RAJA::loop methods iterations of the loop may be executed by threads or teams depending on the execution policy type. The launch context serves to synchronize threads within the same team. The RAJA::launch interface has three main concepts:

  • RAJA::launch template. This creates an execution environment in which a kernel implementation is written using nested RAJA::loop statements. The launch policy template parameter used with the RAJA::launch method enables specification of both a host and device execution environment, which enables run time selection of kernel execution.

  • RAJA::LaunchParams type. This type takes a number of teams, threads per team, and optionally the size of dynamic shared memory in bytes.

  • RAJA::loop template. These are used to define hierarchical parallel execution of a kernel. Operations within a loop are mapped to either teams or threads based on the execution policy template parameter provided.

Team shared memory can be allocated by using the RAJA_TEAM_SHARED macro on statically sized arrays or via dynamic allocation in the RAJA::LaunchParams method. Team shared memory enables threads in a given team to have shared access to a shared memory buffer. Loops are then assigned to either teams or threads based on the GPU execution policy. Under the CUDA/HIP nomenclature, teams correspond to blocks while, in SYCL nomenclature, teams correspond to workgroups.

In a host execution environment, team and thread parameters in the RAJA::LaunchParams struct have no effect in execution and may be omitted if only running on the host.

Please see the following tutorial sections for detailed examples that use RAJA::launch:

Multi-dimensional loops using simple loop APIs (RAJA::CombiningAdapter)

A RAJA::CombiningAdapter object provides ways to run perfectly nested loops with simple loop APIs like RAJA::forall and those described in WorkGroup. To introduce the RAJA ::CombiningAdapter interface, consider a (N+1)-level C-style loop nest:

for (int iN = 0; iN < NN; ++iN) {
  ...
     for (int i0 = 0; i0 < N0; ++i0) {
       \\ inner loop body
     }
}

We can use a RAJA::CombiningAdapter to combine the iteration spaces of the loops and pass the adapter to a RAJA::forall statement to execute them:

auto adapter = RAJA::make_CombingingAdapter(
    [=] (int iN, ..., int i0)) {
      \\ inner loop body
    }, IN, ..., I0);

RAJA::forall<exec_policy>(adapter.getRange(), adapter);

A RAJA::CombiningAdapter object is a template combining a loop body and iteration spaces. The RAJA::make_CombingingAdapter template method takes a lambda expression for the loop body and an arbitrary number of index arguments. It provides a flattened iteration space via the getRange method that can be passed as the iteration space to the RAJA::forall method, for example. The object’s call operator does the conversion of the flat single dimensional index into the multi-dimensional index space, calling the provided lambda with the appropriate indices.

Note

CombiningAdapter currently only supports RAJA::TypedRangeSegment segments.