Sum Reduction: Vector Dot Product¶
This section contains an exercise file RAJA/exercises/dot-product.cpp
for you to work through if you wish to get some practice with RAJA. The
file RAJA/exercises/dot-product_solution.cpp
contains complete
working code for the examples discussed in this section. You can use the
solution file to check your work and for guidance if you get stuck. To build
the exercises execute make dot-product
and make dot-product_solution
from the build directory.
Key RAJA features shown in this example are:
RAJA::forall
loop execution template and execution policiesRAJA::TypedRangeSegment
iteration space constructRAJA::ReduceSum
sum reduction template and reduction policies
In the example, we compute a vector dot product, ‘dot = (a,b)’, where ‘a’ and ‘b’ are two vectors of length N and ‘dot’ is a scalar. Typical C-style code to compute the dot product and print its value afterward is:
double dot = 0.0;
for (int i = 0; i < N; ++i) {
dot += a[i] * b[i];
}
std::cout << "\t (a, b) = " << dot << std::endl;
Although this kernel is serial, it is representative of a reduction operation which is a common algorithm pattern that produces a single result from a set of values. Reductions present a variety of issues that must be addressed to operate properly in parallel.
RAJA Variants¶
Different programming models support parallel reduction operations differently.
Some models, such as CUDA, do not provide direct support for reductions and
so such operations must be explicitly coded by users. It can be challenging
to generate a correct and high performance implementation. RAJA provides
portable reduction types that make it easy to perform reduction operations
in kernels. The RAJA variants of the dot product computation show how
to use the RAJA::ReduceSum
sum reduction template type. RAJA provides
other reduction types and allows multiple reduction operations to be
performed in a single kernel alongside other computations. Please see
Reduction Operations for more information.
Each RAJA reduction type takes a reduce policy template argument, which must be compatible with the execution policy applied to the kernel in which the reduction is used. Here is the RAJA sequential variant of the dot product computation:
RAJA::ReduceSum<RAJA::seq_reduce, double> seqdot(0.0);
RAJA::forall<RAJA::seq_exec>(RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
seqdot += a[i] * b[i];
});
dot = seqdot.get();
The sum reduction object is defined by specifying the reduction
policy RAJA::seq_reduce
matching the kernel execution policy
RAJA::seq_exec
, and a reduction value type (i.e., ‘double’). An initial
value of zero for the sum is passed to the reduction object constructor. After
the kernel executes, we use the ‘get’ method to retrieve the reduced value.
The OpenMP multithreaded variant of the kernel is implemented similarly:
RAJA::ReduceSum<RAJA::omp_reduce, double> ompdot(0.0);
RAJA::forall<RAJA::omp_parallel_for_exec>(RAJA::RangeSegment(0, N), [=] (int i) {
ompdot += a[i] * b[i];
});
dot = ompdot.get();
Here, we use the RAJA::omp_reduce
reduce policy to match the OpenMP
kernel execution policy.
The RAJA CUDA variant is achieved by using appropriate kernel execution and reduction policies:
RAJA::ReduceSum<RAJA::cuda_reduce, double> cudot(0.0);
RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(RAJA::RangeSegment(0, N),
[=] RAJA_DEVICE (int i) {
cudot += a[i] * b[i];
});
dot = cudot.get();
Here, the CUDA reduce policy RAJA::cuda_reduce
matches the CUDA
kernel execution policy. Note that the CUDA thread block size is not
specified in the reduce policy as it will use the same value as the
loop execution policy.
Similarly, for the RAJA HIP variant:
RAJA::ReduceSum<RAJA::hip_reduce, double> hpdot(0.0);
RAJA::forall<RAJA::hip_exec<HIP_BLOCK_SIZE>>(RAJA::RangeSegment(0, N),
[=] RAJA_DEVICE (int i) {
hpdot += d_a[i] * d_b[i];
});
dot = hpdot.get();
It is worth repeating how similar the code looks for each of these variants. The loop body is identical for each and only the loop execution policy and reduce policy types change.
Note
Currently available reduction capabilities in RAJA require a reduction policy type that is compatible with the execution policy for the kernel in which the reduction is used. We are developing a new reduction interface for RAJA that will provide an alternative for which the reduction policy is not required.