RAJA::Launch Execution Policies

This section contains an exercise file RAJA/exercises/launchintro-execpols.cpp for you to work through if you wish to get some practice with RAJA. The file RAJA/exercises/launchintro-execpols_solution.cpp contains complete working code for the examples discussed in this section. You can use the solution file to check your work and for guidance if you get stuck. To build the exercises execute make launchintro-execpols and make launchintro-execpols_solution from the build directory.

Key RAJA features shown in this section are:

  • RAJA::launch kernel execution environment template

  • RAJA::loop loop execution template and execution policies

The examples in this section illustrate how to construct nested loop kernels inside an RAJA::launch execution environment. In particular, the goal is for you to gain an understanding of how to use execution policies with nested RAJA::loop method calls to perform various nested loop execution patterns. All examples use the same simple kernel, which is a three-level loop nest to initialize the entries in an array. The kernels perform the same operations as the examples in RAJA::kernel Execution Policies. By comparing the two sets of examples, you will gain an understanding of the differences between the RAJA::kernel and the RAJA::launch interfaces.

We begin by defining some constants used throughout the examples and allocating arrays to represent the array data:

//
// 3D tensor has N^3 entries
//
  constexpr int N = 100;
  constexpr int N_tot = N * N * N;
  constexpr double c = 0.0001;
  double* a = memoryManager::allocate<double>(N_tot);
  double* a_ref = memoryManager::allocate<double>(N_tot);

Note that we use the ‘memory manager’ routines contained in the exercise directory to simplify the allocation process. In particular, CUDA unified memory is used when CUDA is enabled to simplify accessing the data on the host or device.

Next, we execute a C-style nested for-loop version of the kernel to initialize the entries in the ‘reference’ array that we will use to compare the results of other variants for correctness:

  for (int k = 0; k < N; ++k ) {
    for (int j = 0; j < N; ++j ) {
      for (int i = 0; i < N; ++i ) {
        a_ref[i+N*(j+N*k)] = c * i * j * k ;
      }
    }
  }

Note that we manually compute the pointer offsets for the (i,j,k) indices. To simplify the remaining kernel variants we introduce a RAJA::View object, which wraps the array data pointer and simplifies the multi-dimensional indexing:

  RAJA::View< double, RAJA::Layout<3, int> > aView(a, N, N, N);

Here ‘aView’ is a three-dimensional View with extent ‘N’ in each coordinate based on a three-dimensional RAJA::Layout object where the array entries will be accessed using indices of type ‘int’. indices of type int. Please see View and Layout for more information about the View and Layout types that RAJA provides for various indexing patterns and data layouts.

Using the View, the C-style kernel looks like:

  for (int k = 0; k < N; ++k ) {
    for (int j = 0; j < N; ++j ) {
      for (int i = 0; i < N; ++i ) {
        aView(i, j, k) = c * i * j * k ;
      }
    }
  }

Notice how accessing each (i,j,k) entry in the array is more natural, and less error prone, using the View.

The corresponding RAJA sequential version using RAJA::launch is:

  using loop_policy_1 = RAJA::LoopPolicy<RAJA::seq_exec>;
  using launch_policy_1 = RAJA::LaunchPolicy<RAJA::seq_launch_t>;

  RAJA::launch<launch_policy_1>
    (RAJA::LaunchParams(), //LaunchParams may be empty when running on the host
    [=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {

      RAJA::loop<loop_policy_1>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int k) {
         RAJA::loop<loop_policy_1>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int j) {
            RAJA::loop<loop_policy_1>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int i) {

                aView(i, j, k) = c * i * j * k ;

            });
         });
      });
  });

This should be familiar to the reader who has read through the preceding RAJA::Launch Basics section of this tutorial. As the RAJA::launch method is templated on a host execution policy, the RAJA::LaunchParams object can be defined without arguments as loop methods will get dispatched as standard C-Style for-loops.

Suppose we wanted to parallelize the outer ‘k’ loop using OpenMP multithreading. A C-style version of this is:

  #pragma omp parallel for
  for (int k = 0; k < N; ++k ) {
    for (int j = 0; j < N; ++j ) {
      for (int i = 0; i < N; ++i ) {
        aView(i, j, k) = c * i * j * k ;
      }
    }
  }

where we have placed the OpenMP directive #pragma omp parallel for before the outer loop of the kernel.

The corresponding RAJA versions of the C-style OpenMP variant is:

  using omp_policy_2 = RAJA::LoopPolicy<RAJA::omp_for_exec>;
  using loop_policy_2 = RAJA::LoopPolicy<RAJA::seq_exec>;
  using launch_policy_2 = RAJA::LaunchPolicy<RAJA::omp_launch_t>;

  RAJA::launch<launch_policy_2>
    (RAJA::LaunchParams(), //LaunchParams may be empty when running on the host
    [=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {

      RAJA::loop<omp_policy_2>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int k) {
         RAJA::loop<loop_policy_2>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int j) {
            RAJA::loop<loop_policy_2>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int i) {

                aView(i, j, k) = c * i * j * k ;

            });
         });
      });
  });

With the OpenMP version above, RAJA::launch method is templated on a RAJA::omp_launch_t execution policy. The policy is used to create an OpenMP parallel region, loop iterations may then be distributed using RAJA::loop methods templated on RAJA::omp_for_exec execution policies. As before, the RAJA::LaunchParams object may be initialized without grid dimensions as the CPU does not require specifying a compute grid.

The first RAJA-based kernel for parallel GPU execution using the RAJA CUDA back-end we introduce is:

  using cuda_teams_z_3 = RAJA::LoopPolicy<RAJA::cuda_block_z_direct>;
  using cuda_global_thread_y_3 = RAJA::LoopPolicy<RAJA::cuda_global_thread_y>;
  using cuda_global_thread_x_3 = RAJA::LoopPolicy<RAJA::cuda_global_thread_x>;

  const bool async_3 = false;
  using launch_policy_3 = RAJA::LaunchPolicy<RAJA::cuda_launch_t<async_3>>;

  RAJA::launch<launch_policy_3>
    (RAJA::LaunchParams(RAJA::Teams(n_blocks_i ,n_blocks_j, n_blocks_k),
                      RAJA::Threads(i_block_sz, j_block_sz, k_block_sz)),
    [=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {

      RAJA::loop<cuda_teams_z_3>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int k) {
        RAJA::loop<cuda_global_thread_y_3>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int j) {
          RAJA::loop<cuda_global_thread_x_3>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int i) {

            aView(i, j, k) = c * i * j * k ;

          });
        });
      });
  });

where we have defined the CUDA thread-block dimensions as:

  constexpr int block_size = 256;
  constexpr int i_block_sz = 32;
  constexpr int j_block_sz = block_size / i_block_sz;
  constexpr int k_block_sz = 1;

  const int n_blocks_i = RAJA_DIVIDE_CEILING_INT(N,i_block_sz);
  const int n_blocks_j = RAJA_DIVIDE_CEILING_INT(N,j_block_sz);
  const int n_blocks_k = RAJA_DIVIDE_CEILING_INT(N,k_block_sz);

Here, we use the RAJA::cuda_launch_t policy type to indicate that we want a CUDA kernel to be launched. The ‘k’, ‘j’, ‘i’ iteration variables are mapped to CUDA threads and blocks using the CUDA execution policy types RAJA::cuda_block_z_direct, RAJA::cuda_global_thread_y, and RAJA::cuda_global_thread_x, respectively. Thus, we use a two-dimensional CUDA thread-block and three-dimensional compute grid to map the loop iterations to CUDA threads. In comparison to the RAJA CUDA example in RAJA::kernel Execution Policies , RAJA::loop methods support execution policies, which enable mapping directly to the global thread ID of a compute grid.

Using a combination of RAJA::tile and RAJA::loop methods, we can create a loop tiling platform portable implementation. Here, is a CUDA variant:

  using cuda_teams_z_4 = RAJA::LoopPolicy<RAJA::cuda_block_z_direct>;
  using cuda_teams_y_4 = RAJA::LoopPolicy<RAJA::cuda_block_y_direct>;
  using cuda_teams_x_4 = RAJA::LoopPolicy<RAJA::cuda_block_x_direct>;

  using cuda_threads_y_4 = RAJA::LoopPolicy<RAJA::cuda_thread_y_direct>;
  using cuda_threads_x_4 = RAJA::LoopPolicy<RAJA::cuda_thread_x_direct>;

  const bool async_4 = false;
  using launch_policy_4 = RAJA::LaunchPolicy<RAJA::cuda_launch_t<async_4>>;

  RAJA::launch<launch_policy_4>
    (RAJA::LaunchParams(RAJA::Teams(n_blocks_i, n_blocks_j, n_blocks_k),
                        RAJA::Threads(i_block_sz, j_block_sz, k_block_sz)),
    [=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {

      RAJA::loop<cuda_teams_z_4>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int k) {

        RAJA::tile<cuda_teams_y_4>
          (ctx, j_block_sz, RAJA::TypedRangeSegment<int>(0, N), [&] (RAJA::TypedRangeSegment<int> const &j_tile) {

          RAJA::tile<cuda_teams_x_4>
            (ctx, i_block_sz, RAJA::TypedRangeSegment<int>(0, N), [&] (RAJA::TypedRangeSegment<int> const &i_tile) {

            RAJA::loop<cuda_threads_y_4>(ctx, j_tile, [&] (int j) {
                RAJA::loop<cuda_threads_x_4>(ctx, i_tile, [&] (int i) {

                    aView(i, j, k) = c * i * j * k ;

                  });
              });

            });
          });

      });
    });

We consider the kernel to be portable, because all of the execution policy types and execution parameters can be replaced by other types and values without changing the kernel code directly.

The RAJA::tile methods are used to partition an iteration space into tiles to be used within a RAJA::loop method. The ‘{i,j,k}_block_sz’ arguments passed to the RAJA::tile function specify the tile size for each loop. In the case of GPU programming models, we define the tile size to correspond to the number of threads in a given dimension. Execution tile and loop execution policies are chosen to have CUDA blocks and threads map directly to tiles and entries in a tile.

For context and comparison, here is the same kernel implementation using CUDA directly:

  dim3 nthreads_per_block(i_block_sz, j_block_sz, k_block_sz);
  static_assert(i_block_sz*j_block_sz*k_block_sz == block_size,
                "Invalid block_size");

  dim3 nblocks(static_cast<size_t>(RAJA_DIVIDE_CEILING_INT(N, i_block_sz)),
               static_cast<size_t>(RAJA_DIVIDE_CEILING_INT(N, j_block_sz)),
               static_cast<size_t>(RAJA_DIVIDE_CEILING_INT(N, k_block_sz)));

  nested_init<i_block_sz, j_block_sz, k_block_sz>
    <<<nblocks, nthreads_per_block>>>(a, c, N);
  cudaErrchk( cudaGetLastError() );
  cudaErrchk(cudaDeviceSynchronize());

The nested_init device kernel used here is:

template< int i_block_size, int j_block_size, int k_block_size >
__launch_bounds__(i_block_size*j_block_size*k_block_size)
__global__ void nested_init(double* a, double c, int N)
{
  int i = blockIdx.x * i_block_size + threadIdx.x;
  int j = blockIdx.y * j_block_size + threadIdx.y;
  int k = blockIdx.z;

  if ( i < N && j < N && k < N ) {
    a[i+N*(j+N*k)] = c * i * j * k ;
  }
}

A few differences between the CUDA and RAJA-CUDA versions are worth noting. First, the CUDA version uses the CUDA dim3 construct to express the threads-per-block and number of thread-blocks to use: i.e., the nthreads_per_block and nblocks variable definitions. The RAJA::launch interface takes compute dimensions through a RAJA::LaunchParams object. RAJA provides a macro RAJA_DIVIDE_CEILING_INT to perform the proper integer arithmetic to calculate the number of blocks based on the size of the array and the block size in each dimension. Second, the mapping of thread identifiers to the (i,j,k) indices is explicit in the device kernel. Third, an explicit check of the (i,j,k) values is required in the CUDA implementation to avoid addressing memory out-of-bounds; i.e., if ( i < N && j < N && k < N ).... The RAJA variants set similar definitions internally and mask out indices that would be out-of-bounds. Note that we also inserted additional error checking with static_assert and cudaErrchk, which is a RAJA macro, for printing CUDA device error codes, to catch device errors if there are any.

Lastly, we show the RAJA HIP variants of the kernel, which are semantically identical to the RAJA CUDA variants. First, the RAJA-HIP global-thread variant:

  using hip_teams_z_5 = RAJA::LoopPolicy<RAJA::hip_block_z_direct>;
  using hip_global_thread_y_5 = RAJA::LoopPolicy<RAJA::hip_global_thread_y>;
  using hip_global_thread_x_5 = RAJA::LoopPolicy<RAJA::hip_global_thread_x>;

  const bool async_5 = false;
  using launch_policy_5 = RAJA::LaunchPolicy<RAJA::hip_launch_t<async_5>>;

  RAJA::launch<launch_policy_5>
    (RAJA::LaunchParams(RAJA::Teams(n_blocks_i, n_blocks_j, n_blocks_k),
                        RAJA::Threads(i_block_sz, j_block_sz, k_block_sz)),
    [=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {

       RAJA::loop<hip_teams_z_5>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int k) {
           RAJA::loop<hip_global_thread_y_5>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int j) {
               RAJA::loop<hip_global_thread_x_5>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int i) {

                   d_aView(i, j, k) = c * i * j * k ;

           });
         });
       });

  });

and then the RAJA Launch HIP fixed thread-block size, tiled, direct thread mapping version:

  using hip_teams_z_6 = RAJA::LoopPolicy<RAJA::hip_block_z_direct>;
  using hip_teams_y_6 = RAJA::LoopPolicy<RAJA::hip_block_y_direct>;
  using hip_teams_x_6 = RAJA::LoopPolicy<RAJA::hip_block_x_direct>;

  using hip_threads_y_6 = RAJA::LoopPolicy<RAJA::hip_thread_y_direct>;
  using hip_threads_x_6 = RAJA::LoopPolicy<RAJA::hip_thread_x_direct>;

  const bool async_6 = false;
  using launch_policy_6 = RAJA::LaunchPolicy<RAJA::hip_launch_t<async_6>>;

  RAJA::launch<launch_policy_6>
    (RAJA::LaunchParams(RAJA::Teams(n_blocks_i, n_blocks_j, n_blocks_k),
                      RAJA::Threads(i_block_sz, j_block_sz, k_block_sz)),
    [=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {

      RAJA::loop<hip_teams_z_6>(ctx, RAJA::TypedRangeSegment<int>(0, N), [&] (int k) {

        RAJA::tile<hip_teams_y_6>
          (ctx, j_block_sz, RAJA::TypedRangeSegment<int>(0, N), [&] (RAJA::TypedRangeSegment<int> const &j_tile) {

          RAJA::tile<hip_teams_x_6>
            (ctx, i_block_sz, RAJA::TypedRangeSegment<int>(0, N), [&] (RAJA::TypedRangeSegment<int> const &i_tile) {

            RAJA::loop<hip_threads_y_6>(ctx, j_tile, [&] (int j) {
                RAJA::loop<hip_threads_x_6>(ctx, i_tile, [&] (int i) {

                    d_aView(i, j, k) = c * i * j * k ;

                  });
              });

            });
          });

      });
    });

The only differences are that type names are changed to replace ‘CUDA’ types with ‘HIP’ types to use the RAJA HIP back-end.