Vector Dot Product (Sum Reduction)¶
Key RAJA features shown in this example:
RAJA::forall
loop execution templateRAJA::RangeSegment
iteration space construct- RAJA execution policies
RAJA::ReduceSum
sum reduction template- RAJA reduction policies
In the example, we compute a vector dot product, ‘dot = (a,b)’, where ‘a’ and ‘b’ are two vectors length N and ‘dot’ is a scalar. Typical C-style code to compute the dot product and print its value afterward is:
double dot = 0.0;
for (int i = 0; i < N; ++i) {
dot += a[i] * b[i];
}
Note that this operation performs a reduction, a computational pattern that produces a single result from a set of values. Reductions present a variety of issues that must be addressed to operate properly in parallel.
RAJA Variants¶
Different programming models support parallel reduction operations differently.
Some models, such as CUDA, do not provide support for reductions at all and
so such operations must be explicitly coded by users. It can be challenging
to generate a correct and high performance implementation. RAJA provides
portable reduction types that make it easy to perform reduction operations
in loop kernels. The RAJA variants of the dot product computation show how
to use the RAJA::ReduceSum
sum reduction template type. RAJA provides
other reduction types and also allows multiple reduction operations to be
performed in a single kernel along with other computation. Please see
Reductions for an example that does this.
Each RAJA reduction type takes a reduce policy template argument, which must be compatible with the execution policy applied to the kernel in which the reduction is used. Here is the RAJA sequential variant of the dot product computation:
The sum reduction object is defined by specifying the reduction
policy RAJA::seq_reduce
, which matches the loop execution policy, and
a reduction value type (i.e., ‘double’). An initial value of zero for the
sum is passed to the reduction object constructor. After the kernel executes,
we use the ‘get’ method to retrieve the reduced value.
The OpenMP multithreaded variant of the loop is implemented similarly:
RAJA::ReduceSum<RAJA::omp_reduce, double> ompdot(0.0);
RAJA::forall<RAJA::omp_parallel_for_exec>(RAJA::RangeSegment(0, N), [=] (int i) {
ompdot += a[i] * b[i];
});
dot = ompdot.get();
Here, we use the RAJA::omp_reduce
reduce policy to match the OpenMP
loop execution policy.
The RAJA CUDA variant is achieved by using appropriate loop execution and reduction policies:
RAJA::ReduceSum<RAJA::cuda_reduce, double> cudot(0.0);
RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(RAJA::RangeSegment(0, N),
[=] RAJA_DEVICE (int i) {
cudot += a[i] * b[i];
});
dot = cudot.get();
Here, the CUDA reduce policy RAJA::cuda_reduce
matches the CUDA
loop execution policy. Note that the CUDA thread block size is not
specified in the reduce policy as it will use the same value as the
loop execution policy.
Similarly, for the RAJA HIP variant:
RAJA::ReduceSum<RAJA::hip_reduce, double> hpdot(0.0);
RAJA::forall<RAJA::hip_exec<HIP_BLOCK_SIZE>>(RAJA::RangeSegment(0, N),
[=] RAJA_DEVICE (int i) {
hpdot += d_a[i] * d_b[i];
});
dot = hpdot.get();
It is worth noting how similar the code looks for each of these variants. The loop body is identical for each and only the loop execution policy and reduce policy types change.
The file RAJA/examples/tut_dot-product.cpp
contains the complete
working example code.