Computing a Histogram with Atomic Operations

Key RAJA features shown in this example:

  • RAJA::forall loop execution template
  • RAJA::RangeSegment iteration space construct
  • RAJA atomic add operation

The example uses an integer array of length ‘N’ randomly initialized with values in the interval [0, M). While iterating over the array, the kernel accumulates the number of occurrences of each value in the array using atomic add operations. Atomic operations allow one to update a memory location referenced by a specific address in parallel without data races. The example shows how to use RAJA portable atomic operations and that they are used similarly for different programming model back-ends.

Note

Each RAJA reduction operation requires an atomic policy type parameter that must be compatible with the execution policy for the kernel in which it is used.

For a complete description of supported RAJA atomic operations and atomic policies, please see Atomics.

All code snippets described below use the loop range:

  RAJA::TypedRangeSegment<int> array_range(0, N);

and the integer array ‘bins’ of length ‘M’ to accumulate the number of occurrences of each value in the array.

Here is the OpenMP version:

  RAJA::forall<RAJA::omp_parallel_for_exec>(array_range, [=](int i) {
                          
    RAJA::atomicAdd<RAJA::omp_atomic>(&bins[array[i]], 1);
                                           
  });

Each slot in the ‘bins’ array is incremented by one when a value associated with that slot is encountered. Note that the RAJA::atomicAdd operation uses an OpenMP atomic policy, which is compatible with the OpenMP loop execution policy.

The CUDA and HIP versions are similar:

  RAJA::forall< RAJA::cuda_exec<CUDA_BLOCK_SIZE> >(array_range, 
    [=] RAJA_DEVICE(int i) {
                          
    RAJA::atomicAdd<RAJA::cuda_atomic>(&bins[array[i]], 1);
                                                 
  });

and:

  RAJA::forall< RAJA::hip_exec<HIP_BLOCK_SIZE> >(array_range, 
    [=] RAJA_DEVICE(int i) {

    RAJA::atomicAdd<RAJA::hip_atomic>(&d_bins[d_array[i]], 1);

  });

Here, the atomic add operations uses CUDA and HIP atomic policies, which are compatible with the CUDA and HIP loop execution policies.

Note that RAJA provides an auto_atomic policy for easier usage and improved portability. This policy will do the right thing in most circumstances. If OpenMP is enabled, the OpenMP atomic policy will be used, which is correct in a sequential execution context as well. Otherwise, the sequential atomic policy will be applied. Similarly, if it is encountered in a CUDA or HIP execution context, the corresponding GPU back-end atomic policy will be applied.

For example, here is the CUDA version that uses the ‘auto’ atomic policy:

  RAJA::forall< RAJA::cuda_exec<CUDA_BLOCK_SIZE> >(array_range, 
    [=] RAJA_DEVICE(int i) {

    RAJA::atomicAdd<RAJA::auto_atomic>(&bins[array[i]], 1);

  });

and the HIP version:

  RAJA::forall< RAJA::hip_exec<HIP_BLOCK_SIZE> >(array_range, 
    [=] RAJA_DEVICE(int i) {

    RAJA::atomicAdd<RAJA::auto_atomic>(&d_bins[d_array[i]], 1);

  });

The same CUDA and HIP loop execution policies as in the previous examples are used.

The file RAJA/examples/tut_atomic-histogram.cpp contains the complete working example code.