Computing a Histogram with Atomic Operations¶
Key RAJA features shown in this example:
RAJA::forall
loop execution templateRAJA::RangeSegment
iteration space construct- RAJA atomic add operation
The example uses an integer array of length ‘N’ randomly initialized with values in the interval [0, M). While iterating over the array, the kernel accumulates the number of occurrences of each value in the array using atomic add operations. Atomic operations allow one to update a memory location referenced by a specific address in parallel without data races. The example shows how to use RAJA portable atomic operations and that they are used similarly for different programming model back-ends.
Note
Each RAJA reduction operation requires an atomic policy type parameter that must be compatible with the execution policy for the kernel in which it is used.
For a complete description of supported RAJA atomic operations and atomic policies, please see Atomics.
All code snippets described below use the loop range:
RAJA::TypedRangeSegment<int> array_range(0, N);
and the integer array ‘bins’ of length ‘M’ to accumulate the number of occurrences of each value in the array.
Here is the OpenMP version:
RAJA::forall<RAJA::omp_parallel_for_exec>(array_range, [=](int i) {
RAJA::atomicAdd<RAJA::omp_atomic>(&bins[array[i]], 1);
});
Each slot in the ‘bins’ array is incremented by one when a value associated
with that slot is encountered. Note that the RAJA::atomicAdd
operation uses an OpenMP atomic policy, which is compatible with the OpenMP
loop execution policy.
The CUDA version is similar:
RAJA::forall< RAJA::cuda_exec<CUDA_BLOCK_SIZE> >(array_range,
[=] RAJA_DEVICE(int i) {
RAJA::atomicAdd<RAJA::cuda_atomic>(&bins[array[i]], 1);
});
Here, the atomic add operation uses a CUDA atomic policy, which is compatible with the CUDA loop execution policy.
Note that RAJA provides an auto_atomic
policy for easier usage and
improved portability. This policy will do the right thing in most
circumstances. Specifically, if it is encountered in a CUDA execution
context, the CUDA atomic policy will be applied. If OpenMP is enabled, the
OpenMP atomic policy will be used, which should be correct in a sequential
execution context as well. Otherwise, the sequential atomic policy will be
applied.
For example, here is the CUDA version that uses the ‘auto’ atomic policy:
RAJA::forall< RAJA::cuda_exec<CUDA_BLOCK_SIZE> >(array_range,
[=] RAJA_DEVICE(int i) {
RAJA::atomicAdd<RAJA::auto_atomic>(&bins[array[i]], 1);
});
The same CUDA loop execution policy as in the previous example is used.
The file RAJA/examples/tut_atomic-histogram.cpp
contains the complete
working example code.