Vector Addition (Basic Loop Execution)

Key RAJA features shown in this example:

  • RAJA::forall loop execution template
  • RAJA::RangeSegment iteration space construct
  • RAJA execution policies

In the example, we add two vectors ‘a’ and ‘b’ of length N and store the result in vector ‘c’. A simple C-style loop that does this is:

  for (int i = 0; i < N; ++i) {
    c[i] = a[i] + b[i];
  }

RAJA Variants

The RAJA variants of the vector addition operation illustrate how the same kernel can be run with a variety of different programming model back-ends by simply swapping out the execution policy. This can be done by defining type aliases in a header file so that execution policy types can be easily switched, and the code can be compiled to run differently, without changing the loop kernel code. In the example code, we make all execution policy types explicit for clarity.

For the RAJA variants, we replace the C-style for-loop with a call to the RAJA::forall loop execution template method. The method takes an iteration space and the vector addition loop body as a C++ lambda expression. We pass a RAJA::RangeSegment object, which describes a contiguous sequence of integral values [0, N) for the iteration space (for more information about RAJA loop indexing concepts, see Indices, Segments, and IndexSets). The loop execution template method requires an execution policy template type that specifies how the loop is to run (for more information about RAJA execution policies, see Policies).

For the RAJA sequential variant, we use the RAJA::seq_exec execution policy type:

  RAJA::forall<RAJA::seq_exec>(RAJA::RangeSegment(0, N), [=] (int i) { 
    c[i] = a[i] + b[i]; 
  });

The RAJA sequential execution policy enforces strictly sequential execution; in particular, no SIMD vectorization instructions or other substantial optimizations will be generated by the compiler. To attempt to force the compiler to generate SIMD vector instructions, we would use the RAJA SIMD execution policy:

RAJA::simd_exec

Alternatively, RAJA provides a loop execution policy:

RAJA::loop_exec

This policy allows the compiler to generate optimizations, such as SIMD if compiler heuristics suggest that it is safe to do so and potentially beneficial for performance, but the optimizations are not forced.

To run the kernel with OpenMP multithreaded parallelism on a CPU, we use the RAJA::omp_parallel_for_exec execution policy:

  RAJA::forall<RAJA::omp_parallel_for_exec>(RAJA::RangeSegment(0, N), [=] (int i) { 
    c[i] = a[i] + b[i]; 
  });

This will distribute the loop iterations across CPU threads and run the loop over threads in parallel.

To run the kernel on a CUDA GPU device, we use the RAJA::cuda_exec policy:

  RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(RAJA::RangeSegment(0, N), 
    [=] RAJA_DEVICE (int i) { 
    c[i] = a[i] + b[i]; 
  });    

Note that the CUDA execution policy type accepts a template argument CUDA_BLOCK_SIZE, which specifies that each CUDA thread block launched to execute the kernel will have the given number threads in the block.

For performance tuning, the RAJA::cuda_exec_explicit policy is also provided. This allows the user to specify the number of blocks allocated per streaming multiprocessor (SM) to allow additional block level parallelism. Note that the third boolean argument representing asynchronous execution can be omitted, and is false by default:

  RAJA::forall<RAJA::cuda_exec_explicit<CUDA_BLOCK_SIZE, 2, Asynchronous>>(RAJA::RangeSegment(0, N), 
    [=] RAJA_DEVICE (int i) { 
    c[i] = a[i] + b[i]; 
  });    

Since the lambda defining the loop body will be passed to a device kernel, it must be decorated with the __device__ attribute when it is defined. This can be done directly or by using the RAJA_DEVICE macro.

Similarly, to run the kernel on a GPU using the RAJA HIP back-end, we use the RAJA::hip_exec policy:

  RAJA::forall<RAJA::hip_exec<HIP_BLOCK_SIZE>>(RAJA::RangeSegment(0, N),
    [=] RAJA_DEVICE (int i) {
    d_c[i] = d_a[i] + d_b[i];
  });

The file RAJA/examples/tut_add-vectors.cpp contains the complete working example code.