# Sum Reduction: Vector Dot Product¶

This section contains an exercise file `RAJA/exercises/dot-product.cpp`

for you to work through if you wish to get some practice with RAJA. The
file `RAJA/exercises/dot-product_solution.cpp`

contains complete
working code for the examples discussed in this section. You can use the
solution file to check your work and for guidance if you get stuck. To build
the exercises execute `make dot-product`

and `make dot-product_solution`

from the build directory.

Key RAJA features shown in this example are:

`RAJA::forall`

loop execution template and execution policies`RAJA::TypedRangeSegment`

iteration space construct`RAJA::ReduceSum`

sum reduction template and reduction policies

In the example, we compute a vector dot product, ‘dot = (a,b)’, where ‘a’ and ‘b’ are two vectors of length N and ‘dot’ is a scalar. Typical C-style code to compute the dot product and print its value afterward is:

```
double dot = 0.0;
for (int i = 0; i < N; ++i) {
dot += a[i] * b[i];
}
std::cout << "\t (a, b) = " << dot << std::endl;
```

Although this kernel is serial, it is representative of a *reduction*
operation which is a common algorithm pattern that
produces a single result from a set of values. Reductions present a variety
of issues that must be addressed to operate properly in parallel.

## RAJA Variants¶

Different programming models support parallel reduction operations differently.
Some models, such as CUDA, do not provide direct support for reductions and
so such operations must be explicitly coded by users. It can be challenging
to generate a correct and high performance implementation. RAJA provides
portable reduction types that make it easy to perform reduction operations
in kernels. The RAJA variants of the dot product computation show how
to use the `RAJA::ReduceSum`

sum reduction template type. RAJA provides
other reduction types and allows multiple reduction operations to be
performed in a single kernel alongside other computations. Please see
Reduction Operations for more information.

Each RAJA reduction type takes a reduce policy template argument, which
**must be compatible with the execution policy** applied to the kernel
in which the reduction is used. Here is the RAJA sequential variant of the dot
product computation:

```
RAJA::ReduceSum<RAJA::seq_reduce, double> seqdot(0.0);
RAJA::forall<RAJA::seq_exec>(RAJA::TypedRangeSegment<int>(0, N), [=] (int i) {
seqdot += a[i] * b[i];
});
dot = seqdot.get();
```

The sum reduction object is defined by specifying the reduction
policy `RAJA::seq_reduce`

matching the kernel execution policy
`RAJA::seq_exec`

, and a reduction value type (i.e., ‘double’). An initial
value of zero for the sum is passed to the reduction object constructor. After
the kernel executes, we use the ‘get’ method to retrieve the reduced value.

The OpenMP multithreaded variant of the kernel is implemented similarly:

```
RAJA::ReduceSum<RAJA::omp_reduce, double> ompdot(0.0);
RAJA::forall<RAJA::omp_parallel_for_exec>(RAJA::RangeSegment(0, N), [=] (int i) {
ompdot += a[i] * b[i];
});
dot = ompdot.get();
```

Here, we use the `RAJA::omp_reduce`

reduce policy to match the OpenMP
kernel execution policy.

The RAJA CUDA variant is achieved by using appropriate kernel execution and reduction policies:

```
RAJA::ReduceSum<RAJA::cuda_reduce, double> cudot(0.0);
RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(RAJA::RangeSegment(0, N),
[=] RAJA_DEVICE (int i) {
cudot += a[i] * b[i];
});
dot = cudot.get();
```

Here, the CUDA reduce policy `RAJA::cuda_reduce`

matches the CUDA
kernel execution policy. Note that the CUDA thread block size is not
specified in the reduce policy as it will use the same value as the
loop execution policy.

Similarly, for the RAJA HIP variant:

```
RAJA::ReduceSum<RAJA::hip_reduce, double> hpdot(0.0);
RAJA::forall<RAJA::hip_exec<HIP_BLOCK_SIZE>>(RAJA::RangeSegment(0, N),
[=] RAJA_DEVICE (int i) {
hpdot += d_a[i] * d_b[i];
});
dot = hpdot.get();
```

It is worth repeating how similar the code looks for each of these variants. The loop body is identical for each and only the loop execution policy and reduce policy types change.

Note

Currently available reduction capabilities in RAJA require a
*reduction policy* type that is compatible with the execution
policy for the kernel in which the reduction is used. We
are developing a new reduction interface for RAJA that will
provide an alternative for which the reduction policy is not
required.