Computing a Histogram with Atomic Operations

Key RAJA features shown in this example:

  • RAJA::forall loop execution template
  • RAJA::RangeSegment iteration space construct
  • RAJA atomic add operation

The example uses an integer array of length ‘N’ randomly initialized with values in the interval [0, M). While iterating over the array, the kernel accumulates the number of occurrences of each value in the array using atomic add operations. Atomic operations allow one to update a memory location referenced by a specific address in parallel without data races. The example shows how to use RAJA portable atomic operations and that they are used similarly for different programming model back-ends.

Note

Each RAJA reduction operation requires an atomic policy type parameter that must be compatible with the execution policy for the kernel in which it is used.

For a complete description of supported RAJA atomic operations and atomic policies, please see Atomics.

All code snippets described below use the loop range:

  RAJA::TypedRangeSegment<int> array_range(0, N);

and the integer array ‘bins’ of length ‘M’ to accumulate the number of occurrences of each value in the array.

Here is the OpenMP version:

  RAJA::forall<RAJA::omp_parallel_for_exec>(array_range, [=](int i) {
                          
    RAJA::atomicAdd<RAJA::omp_atomic>(&bins[array[i]], 1);
                                           
  });

Each slot in the ‘bins’ array is incremented by one when a value associated with that slot is encountered. Note that the RAJA::atomicAdd operation uses an OpenMP atomic policy, which is compatible with the OpenMP loop execution policy.

The CUDA version is similar:

  RAJA::forall< RAJA::cuda_exec<CUDA_BLOCK_SIZE> >(array_range, 
    [=] RAJA_DEVICE(int i) {
                          
    RAJA::atomicAdd<RAJA::cuda_atomic>(&bins[array[i]], 1);
                                                 
  });

Here, the atomic add operation uses a CUDA atomic policy, which is compatible with the CUDA loop execution policy.

Note that RAJA provides an auto_atomic policy for easier usage and improved portability. This policy will do the right thing in most circumstances. Specifically, if it is encountered in a CUDA execution context, the CUDA atomic policy will be applied. If OpenMP is enabled, the OpenMP atomic policy will be used, which should be correct in a sequential execution context as well. Otherwise, the sequential atomic policy will be applied.

For example, here is the CUDA version that uses the ‘auto’ atomic policy:

  RAJA::forall< RAJA::cuda_exec<CUDA_BLOCK_SIZE> >(array_range, 
    [=] RAJA_DEVICE(int i) {

    RAJA::atomicAdd<RAJA::auto_atomic>(&bins[array[i]], 1);

  });

The same CUDA loop execution policy as in the previous example is used.

The file RAJA/examples/tut_atomic-histogram.cpp contains the complete working example code.