In this section, we describe the basics of RAJA workgroups. RAJA::WorkPool, RAJA::WorkGroup, and RAJA::WorkSite class templates comprise the RAJA interface for grouped loop execution. RAJA::WorkPool takes a set of simple loops (e.g., non-nested loops) and instantiates a RAJA::WorkGroup. RAJA::WorkGroup represents an executable form of those loops and when run makes a RAJA::WorkSite. RAJA::WorkSite holds all of the resources used for a single run of the loops. Be aware that the RAJA workgroup constructs API is still being developed and may change in later RAJA releases.


  • All workgroup constructs are in the namespace RAJA.
  • The RAJA::WorkPool, RAJA::WorkGroup, and RAJA::WorkSite class templates are templated on:
    • a WorkGroup policy which is composed of:
      • a work execution policy
      • a work ordering policy
      • a work storage policy
      • a work dispatch policy
    • an index type that is the first argument to the loop bodies.
    • a list of extra argument types that are the rest of the arguments to the loop bodies.
    • an allocator type to be used for the memory used to store and manage the loop bodies.
  • The RAJA::WorkPool::enqueue method takes two arguments:
    • an iteration space object, and
    • a lambda expression representing the loop body.
  • Multi-dimensional loops can be used with RAJA::CombiningAdapter see, Multi-dimensional loops using simple loop APIs (RAJA::CombiningAdapter).

Examples showing how to use RAJA workgroup methods may be found in the RAJA Tutorial and Examples.

For more information on RAJA work policies and iteration space constructs, see Policies and Indices, Segments, and IndexSets, respectively.


The behavior of the RAJA workgroup constructs is determined by a policy. The RAJA::WorkGroupPolicy has four components, a work execution policy, a work ordering policy, a work storage policy, and a work dispatch policy. RAJA::WorkPool, RAJA::WorkGroup, and RAJA::WorkSite class templates all take the same policy and template arguments. For example:

using workgroup_policy = RAJA::WorkGroupPolicy <
                             RAJA::indirect_function_call_dispatch >;

is a workgroup policy that will run loops sequentially on the host in the order they were enqueued, stores the loop bodies sequentially in single buffer in memory, and dispatches each loop using a function pointer.

The work execution policy acts like the execution policies used with RAJA::forall and determines the backend used to run the loops and the parallelism within each loop.

Work Execution Policies Brief description
seq_work Execute loop iterations strictly sequentially.
simd_work Execute loop iterations sequentially and try to force generation of SIMD instructions via compiler hints in RAJA internal implementation.
loop_work Execute loop iterations sequentially and allow compiler to generate any optimizations.
omp_work Execute loop iterations in parallel using OpenMP.
tbb_work Execute loop iterations in parallel using TBB.
cuda_work<BLOCK_SIZE>, Execute loop iterations in parallel
cuda_work_async<BLOCK_SISZE> using a CUDA kernel launched with given thread-block size.
omp_target_work Execute loop iterations in parallel using OpenMP target.

The work ordering policy acts like the segment iteration execution policies when RAJA::forall is used with a RAJA::IndexSet and determines the backend used when iterating over the loops and the parallelism between each loop.

Work Ordering Policies Brief description
ordered Execute loops sequentially in the order they were enqueued using forall.
reverse_ordered Execute loops sequentially in the reverse of the order order they were enqueued using forall.
unordered_cuda_loop_y_block_iter_x_threadblock_average Execute loops in parallel by mapping each loop to a set of cuda blocks with the same index in the y direction in a cuda kernel. Each loop is given a number of threads over one of more blocks in the x direction equal to the average number of iterations of all the loops rounded up to a multiple of the block size.

The work storage policy determines the strategy used to allocate and layout the storage used to store the ranges, loop bodies, and other data necessary to implement the workstorage constructs.

Work Storage Policies Brief description
array_of_pointers Store loop data in individual allocations and keep an array of pointers to the individual loop data allocations.
ragged_array_of_objects Store loops sequentially in a single allocation, reallocating and moving the loop data items as needed, and keep an array of offsets to the individual loop data items.
constant_stride_array_of_objects Store loops sequentially in a single allocation with a consistent stride between loop data items, reallocating and/or changing the stride and moving the loop data items as needed.

The work dispatch policy determines the technique used to dispatch from type erased storage to the loops or iterations of each range and loop body pair.

Work Dispatch Policies Brief description
indirect_function_call_dispatch Dispatch using function pointers.
indirect_virtual_function_dispatch Dispatch using virtual functions in a class hierarchy.
direct_dispatch< Dispatch using a switch statement like
camp::list<Range, Callable>…> coding to pick the right pair of Range and Callable types from the template parameter pack. You may only enqueue a range and callable pair that is in the list of types in the policy.


The next two template arguments to the workgroup constructs determine the call signature of the loop bodies that may be added to the workgroup. The first is an index type which is the first parameter in the call signature. Next is a list of types called RAJA::xargs, short for extra arguments, that gives the rest of the types of the parameters in the call signature. The values of the extra arguments are passed in when the loops are run, see WorkGroup. For example:

int, RAJA::xargs<>

can be used with lambdas with the following signature:

[=](int) { ... }


int, RAJA::xargs<int*, double>

can be used with lambdas with the following signature:

[=](int, int*, double) { ... }


The last template argument to the workgroup constructs is an allocator type that conforms to the allocator named requirement used in the standard library. This gives you control over how memory is allocated, for example with umpire, and what memory space is used, both of which have poerformance implications. Find the requirements for allocator types along with a simple example here The default allocator used by the standard template library may be used with ordered and non-GPU policies:

using Allocator = std::allocator<char>;


  • The allocator type must use template argument char.

  • Allocators must provide memory that is accessible where it is used.
    • Ordered work order policies only require memory that is accessible where loop bodies are enqueued.

    • Unordered work order policies require memory that is accessible from both where the loop bodies are enqueued and from where the loop is executed based on the work execution policy.

      For example, when using cuda work exeution policies with CUDA unordered work order policies, pinned memory is a good choice because it is always accessible on the host and device.


The RAJA::WorkPool class template holds a set of simple (e.g., non-nested) loops that are enqueued one at a time. Note that simple multi-dimensional loops can be adapted into simple loops via RAJA::CombiningAdapter, see Multi-dimensional loops using simple loop APIs (RAJA::CombiningAdapter). For example, to enqueue a C-style loop that adds two vectors, like:

for (int i = 0; i < N; ++i) {
  c[i] = a[i] + b[i];

is as simple as calling enqueue on a RAJA::WorkPool object and passing the same arguments you would pass to RAJA::forall.:

using WorkPool_type = RAJA::WorkPool< workgroup_policy,
                                      int, RAJA::xargs<>,
                                      Allocator >;
WorkPool_type workpool(Allocator{});

workpool.enqueue(RAJA::RangeSegment(0, N), [=] (int i) {
  c[i] = a[i] + b[i];

Note that WorkPool may have to allocate and reallocate multiple times to store a set of loops depending on the work storage policy. Reallocation can be avoided by reserving enough memory before adding any loops.:

workpool.reserve(num_loops, storage_bytes);

Here num_loops is the number of loops to allocate space for and num_storage_bytes is the amount of storage to allocate. These may be used differently depending on the work storage policy. The number of loops enqueued in a RAJA::WorkPool and the amount of storage used may be queried using:

size_t num_loops     = workpool.num_loops();
size_t storage_bytes = workpool.storage_bytes();

Storage will automatically reserved when reusing a RAJA::WorkPool` object based on the maximum seen values for num_loops and storage_bytes.

When you’ve added all the loops you want to the set, you can call instantiate on the RAJA::WorkPool to generate a RAJA::WorkGroup.:

WorkGroup_type workgroup = workpool.instantiate();


The RAJA::WorkGroup class template is responsible for hanging onto the set of loops and running the loops. The RAJA::WorkGroup owns its loops and must not be destroyed before any loops run asynchronously using it have completed. It is instantiated from a RAJA::WorkPool object which transfers ownership of a set of loops to the RAJA::WorkGroup and prepares the loops to be run. For example:

using WorkGroup_type = RAJA::WorkGroup< workgroup_policy,
                                        int, RAJA::xargs<>,
                                        Allocator >;
WorkGroup_type workgroup = workpool.instantiate();

creates a RAJA::WorkGroup workgroup from the loops in workpool and leaves workpool empty and ready for reuse. When you want to run the loops simply call run on workgroup and pass in the extra arguments:

WorkSite_type worksite =;

In this case no extra arguments were passed to run because the RAJA::WorkGroup specified no extra arguments RAJA::xargs<>. Passing extra arguments when the loops are run lets you delay creation of those arguments until you plan to run the loops. This lets the value of the arguments depend on the loops in the set. A simple example of this may be found in the tutorial here RAJA Tutorial and Examples. Run produces a RAJA::WorkSite object.


The RAJA::WorkSite class template is responsible for extending the lifespan of objects used when running loops asynchronously. This means that the RAJA::WorkSite object must remain alive until the call to run has been synchronized. For example the scoping here:

  using WorkSite_type = RAJA::WorkSite< workgroup_policy,
                                        int, RAJA::xargs<>,
                                        Allocator >;
  WorkSite_type worksite =;

  // do other things


ensures that worksite survives until after synchronize is called.