Computing a Histogram with Atomic Operations

Key RAJA features shown in this example:

  • RAJA::forall loop execution template
  • RAJA::RangeSegment iteration space construct
  • RAJA atomic add operation

The example uses an integer array of length ‘N’ randomly initialized with values in the interval [0, M). While iterating over the array, the kernel accumulates the number of occurrences of each value in the array using atomic add operations. Atomic operations allow one to update a memory location referenced by a specific address in parallel without data races. The example shows how to use RAJA portable atomic operations and that they are used similarly for different programming model back-ends.

Note

Each RAJA reduction operation requires an atomic policy type parameter that must be compatible with the execution policy for the kernel in which it is used.

For a complete description of supported RAJA atomic operations and atomic policies, please see Atomics.

All code snippets described below use the loop range:

  RAJA::TypedRangeSegment<int> array_range(0, N);

and the integer array ‘bins’ of length ‘M’ to accumulate the number of occurrences of each value in the array.

Here is the OpenMP version:

  using EXEC_POL2 = RAJA::omp_parallel_for_exec;
  using ATOMIC_POL2 = RAJA::atomic::omp_atomic;

  RAJA::forall<EXEC_POL2>(array_range, [=](int i) {
                          
    RAJA::atomic::atomicAdd<ATOMIC_POL2>(&bins[array[i]], 1);
                                           
  });

Each slot in the ‘bins’ array is incremented by one when a value associated with that slot is encountered. Note that the RAJA::atomic::atomicAdd operation uses an OpenMP atomic policy, which is compatible with the OpenMP loop execution policy.

The CUDA version is similar:

  using EXEC_POL4 = RAJA::cuda_exec<CUDA_BLOCK_SIZE>;
  using ATOMIC_POL4 = RAJA::atomic::cuda_atomic;

  RAJA::forall<EXEC_POL4>(array_range, [=] RAJA_DEVICE(int i) {
                          
    RAJA::atomic::atomicAdd<ATOMIC_POL4>(&bins[array[i]], 1);
                                                 
  });

Here, the atomic add operation uses a CUDA atomic policy, which is compatible with the CUDA loop execution policy.

Note that RAJA provides an auto_atomic policy for easier usage and improved portability. This policy will do the right thing in most circumstances. Specifically, if it is encountered in a CUDA execution context, the CUDA atomic policy will be applied. If OpenMP is enabled, the OpenMP atomic policy will be used, which should be correct in a sequential execution context as well. Otherwise, the sequential atomic policy will be applied.

For example, here is the CUDA version that uses the ‘auto’ atomic policy:

  using ATOMIC_POL5 = RAJA::atomic::auto_atomic;

  RAJA::forall<EXEC_POL4>(array_range, [=] RAJA_DEVICE(int i) {

    RAJA::atomic::atomicAdd<ATOMIC_POL5>(&bins[array[i]], 1);

  });

The same CUDA loop execution policy as in the previous example is used.

The file RAJA/examples/tut_atomic-binning.cpp contains the complete working example code.