Basic Loop Execution: Vector Addition¶
This section contains an exercise file RAJA/exercises/vector-addition.cpp
for you to work through if you wish to get some practice with RAJA. The
file RAJA/exercises/vector-addition_solution.cpp
contains complete
working code for the examples discussed in this section. You can use the
solution file to check your work and for guidance if you get stuck. To build
the exercises execute make vector-addition
and make vector-addition_solution
from the build directory.
Key RAJA features shown in this example are:
RAJA::forall
loop execution template and execution policies
RAJA::TypedRangeSegment
iteration space construct
In the example, we add two vectors ‘a’ and ‘b’ of length N and store the result in vector ‘c’. A simple C-style loop that does this is:
for (int i = 0; i < N; ++i) {
c_ref[i] = a[i] + b[i];
}
RAJA Variants¶
For the RAJA variants of the vector addition kernel, we replace the C-style
for-loop with a call to the RAJA::forall
loop execution template method.
The method takes an iteration space and the vector addition loop body as
a C++ lambda expression. We pass the object:
RAJA::TypedRangeSegment<int>(0, N)
for the iteration space, which is contiguous sequence of integral values [0, N) (for more information about RAJA loop indexing concepts, see Indices, Segments, and IndexSets). The loop execution template method requires an execution policy template type that specifies how the loop is to run (for more information about RAJA execution policies, see Policies).
For a RAJA sequential variant, we use the RAJA::seq_exec
execution
policy type:
RAJA::forall< RAJA::seq_exec >(
RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
c[i] = a[i] + b[i];
}
);
When using the RAJA sequential execution policy, the resulting loop implementation is essentially the same as writing a C-style for-loop with no directives applied to the loop. The compiler is allowed to perform any optimizations that its heuristics deem are safe and potentially beneficial for performance. To attempt to force the compiler to generate SIMD vector instructions, we would use the RAJA SIMD execution policy:
RAJA::simd_exec
To run the kernel with OpenMP multithreaded parallelism on a CPU, we use the
RAJA::omp_parallel_for_exec
execution policy:
RAJA::forall< RAJA::omp_parallel_for_exec >(
RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
c[i] = a[i] + b[i];
}
);
This will distribute the loop iterations across CPU threads and run the loop over threads in parallel. In particular, this is what you would get if you wrote the kernel using a C-style loop with an OpenMP pragma directly:
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
c[i] = a[i] + b[i];
}
To run the kernel on a CUDA GPU device, we use the RAJA::cuda_exec
policy:
RAJA::forall< RAJA::cuda_exec<CUDA_BLOCK_SIZE> >(RAJA::TypedRangeSegment<int>(0, N),
[=] RAJA_DEVICE (int i) {
d_c[i] = d_a[i] + d_b[i];
});
Since the lambda defining the loop body will be passed to a device kernel,
it must be decorated with the __device__
attribute.
This can be done directly or by using the RAJA_DEVICE
macro.
Note that the CUDA execution policy type requires a template argument
CUDA_BLOCK_SIZE
, which specifies the number of threads to run in each
CUDA thread block launched to run the kernel.
For additional performance tuning options, the RAJA::cuda_exec_explicit
policy is also provided, which allows a user to specify the minimum number
of thread blocks to launch at a time on each streaming multiprocessor (SM):
const bool Asynchronous = true;
RAJA::forall<RAJA::cuda_exec_explicit<CUDA_BLOCK_SIZE, 2, Asynchronous>>(RAJA::TypedRangeSegment<int>(0, N),
[=] RAJA_DEVICE (int i) {
d_c[i] = d_a[i] + d_b[i];
});
Note that the third boolean template argument is used to express whether the kernel launch is synchronous or asynchronous. This is optional and is ‘false’ by default. A similar defaulted optional argument is supported for other RAJA GPU (e.g., CUDA or HIP) policies.
Lastly, to run the kernel on a GPU using the RAJA HIP back-end,
we use the RAJA::hip_exec
policy:
RAJA::forall<RAJA::hip_exec<HIP_BLOCK_SIZE>>(RAJA::TypedRangeSegment<int>(0, N),
[=] RAJA_DEVICE (int i) {
d_c[i] = d_a[i] + d_b[i];
});