Atomic Operations: Computing a Histogram

This section contains an exercise file RAJA/exercises/atomic-histogram.cpp for you to work through if you wish to get some practice with RAJA. The file RAJA/exercises/atomic-histogram_solution.cpp contains complete working code for the examples discussed in this section. You can use the solution file to check your work and for guidance if you get stuck. To build the exercises execute make atomic-histogram and make atomic-histogram_solution from the build directory.

Key RAJA features shown in this exercise are:

  • RAJA::forall kernel execution template and execution policies

  • RAJA::TypedRangeSegment iteration space construct

  • RAJA atomic add operation and RAJA atomic operation policies

The example uses an integer array of length ‘N’ randomly initialized with values in the interval [0, M).

  constexpr int M = 20;
  constexpr int N = 100000;

  int* array = memoryManager::allocate<int>(N);
  int* hist = memoryManager::allocate<int>(M);

  for (int i = 0; i < N; ++i) { 
    array[i] = rand() % M;
  }

Each kernel iterates over the array and accumulates the number of occurrences of each value in [0, M) in another array named ‘hist’. The kernels use atomic operations for the accumulation, which allow one to update a memory location referenced by a specific address in parallel without data races. The example shows how to use RAJA portable atomic operations and that they are used similarly for different programming model back-ends.

Note

Each RAJA atomic operation requires an atomic policy type parameter that must be compatible with the execution policy for the kernel in which it is used. This is similar to the reduction policies we described in Sum Reduction: Vector Dot Product.

For a complete description of supported RAJA atomic operations and atomic policies, please see Atomic Operations.

All code snippets described below use the stride-1 iteration space range:

  RAJA::TypedRangeSegment<int> array_range(0,N);

Here is the OpenMP version:

  RAJA::forall<RAJA::omp_parallel_for_exec>(array_range, [=](int i) {

    RAJA::atomicAdd<RAJA::omp_atomic>(&hist[array[i]], 1);

  });

One is added to a slot in the ‘bins’ array when a value associated with that slot is encountered. Note that the RAJA::atomicAdd operation uses an OpenMP atomic policy, which is compatible with the OpenMP kernel execution policy.

The CUDA and HIP versions are similar:

  RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(array_range, [=] RAJA_DEVICE (int i) {

    RAJA::atomicAdd<RAJA::cuda_atomic>(&hist[array[i]], 1);

  });

and:

  RAJA::forall<RAJA::hip_exec<HIP_BLOCK_SIZE>>(array_range, [=] RAJA_DEVICE (int i) {

    RAJA::atomicAdd<RAJA::hip_atomic>(&hist[array[i]], 1);

  });

Here, the atomic add operations uses CUDA and HIP atomic policies, which are compatible with the CUDA and HIP kernel execution policies.

Note that RAJA provides an auto_atomic policy for easier usage and improved portability. This policy will choose the proper atomic operation for the execution policy used to run the kernel. Specifically, when OpenMP is enabled, the OpenMP atomic policy will be used, which is correct in a sequential or OpenMP execution context. Otherwise, the sequential atomic policy will be applied. Similarly, if it is encountered in a CUDA or HIP execution context, the corresponding GPU back-end atomic policy will be applied.

For example, here is the CUDA version that uses the ‘auto’ atomic policy:

  RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(array_range, [=] RAJA_DEVICE (int i) {

    RAJA::atomicAdd<RAJA::auto_atomic>(&hist[array[i]], 1);

  });

and the HIP version:

  RAJA::forall<RAJA::hip_exec<HIP_BLOCK_SIZE>>(array_range, [=] RAJA_DEVICE (int i) {

    RAJA::atomicAdd<RAJA::auto_atomic>(&hist[array[i]], 1);

  });

The same CUDA and HIP kernel execution policies as in the previous examples are used.