Policies¶
RAJA kernel execution methods take an execution policy type template parameter to specialize execution behavior. Typically, the policy indicates which programming model back-end to use and other information about the execution pattern, such as number of CUDA threads per thread block, whether execution is synchronous or asynchronous, etc. This section describes RAJA policies for loop kernel execution, scans, sorts, reductions, atomics, etc. Please detailed examples in RAJA Tutorial and Examples for a variety of use cases.
As RAJA functionality evolves, new policies are added and some may be redefined and to work in new ways.
Note
All RAJA policies are in the namespace
RAJA.All RAJA policies have a prefix indicating the back-end implementation that they use; e.g.,
omp_for OpenMP,cuda_for CUDA, etc.
RAJA Loop/Kernel Execution Policies¶
The following tables summarize RAJA policies for executing kernels. Please see notes below policy descriptions for additional usage details and caveats.
Sequential CPU Policies¶
For the sequential CPU back-end, RAJA provides policies that allow developers to have some control over the optimizations that compilers are allow to apply during code compilation.
Sequential/SIMD Execution Policies
Works with
Brief description
seq_launch_t
launch
Creates a sequential execution space.
seq_exec
forall, kernel (For), scan, sort
Sequential execution, where the compiler is allowed to apply any any optimizations that its heuristics deem beneficial; i.e., no loop decorations (pragmas or intrinsics) in the RAJA implementation.
simd_exec
forall, kernel (For), scan
Try to force generation of SIMD instructions via compiler hints in RAJA’s internal implementation.
OpenMP Parallel CPU Policies¶
For the OpenMP CPU multithreading back-end, RAJA has policies that can be used by themselves to execute kernels. In particular, they create an OpenMP parallel region and execute a kernel within it. To distinguish these in this discussion, we refer to these as full policies. These policies are provided to users for convenience in common use cases.
RAJA also provides other OpenMP policies, which we refer to as partial policies, since they need to be used in combination with other policies. Typically, they work by providing an outer policy and an inner policy as a template parameter to the outer policy. These give users flexibility to create more complex execution patterns.
Note
To control the number of threads used by OpenMP policies, set the value of the environment variable ‘OMP_NUM_THREADS’ (which is fixed for duration of run), or call the OpenMP routine ‘omp_set_num_threads(nthreads)’ in your application, which allows one to change the number of threads at run time.
The full policies are described in the following table. Partial policies are described in other tables below.
OpenMP CPU Full Policies
Works with
Brief description
omp_parallel_for_exec
forall, kernel (For), launch (loop), scan, sort
Same as applying the OpenMP pragma ‘omp parallel for schedule(auto)’
omp_parallel_for_static_exec<ChunkSize>
forall, kernel (For)
Same as applying ‘omp parallel for schedule(static, ChunkSize)’
omp_parallel_for_dynamic_exec<ChunkSize>
forall, kernel (For)
Same as applying ‘omp parallel for schedule(dynamic, ChunkSize)’
omp_parallel_for_guided_exec<ChunkSize>
forall, kernel (For)
Same as applying ‘omp parallel for schedule(guided, ChunkSize)’
omp_parallel_for_runtime_exec
forall, kernel (For)
Same as applying ‘omp parallel for schedule(runtime)’
Note
For the OpenMP scheduling policies above that take a ChunkSize
parameter, the chunk size is optional. If not provided, the
default chunk size that OpenMP applies will be used, which may
be specific to the OpenMP implementation in use. For this case,
the RAJA policy syntax is
omp_parallel_for_{static|dynamic|guided}_exec< >, which will
result in the OpenMP pragma
omp parallel for schedule({static|dynamic|guided}) being applied.
RAJA provides an (outer) OpenMP CPU policy to create a parallel region in which to execute a kernel. It requires an inner policy that defines how a kernel will execute in parallel inside the region.
OpenMP CPU Outer Policies
Works with
Brief description
omp_launch_t
launch
Creates an OpenMP parallel region. Same as applying ‘omp parallel’ pragma.
omp_parallel_exec<InnerPolicy>
forall, kernel (For), scan
Creates OpenMP parallel region and requires an InnerPolicy. Same as applying ‘omp parallel’ pragma.
Finally, we summarize the inner policies that RAJA provides for OpenMP.
These policies are passed to the RAJA omp_parallel_exec outer policy as
a template argument as described above.
OpenMP CPU Inner Policies
Works with
Brief description
omp_for_exec
forall, kernel (For), launch (loop) scan
Parallel execution within existing parallel region, specifically apply the OpenMP pragma ‘omp for schedule (auto)’ pragma.
omp_for_static_exec<ChunkSize>
forall, kernel (For)
Same as applying ‘omp for schedule(static, ChunkSize)’
omp_for_nowait_static_exec<ChunkSize>
forall, kernel (For)
Same as applying ‘omp for schedule(static, ChunkSize) nowait’
omp_for_dynamic_exec<ChunkSize>
forall, kernel (For)
Same as applying ‘omp for schedule(dynamic, ChunkSize)’
omp_for_guided_exec<ChunkSize>
forall, kernel (For)
Same as applying ‘omp for schedule(guided, ChunkSize)’
omp_for_runtime_exec
forall, kernel (For)
Same as applying ‘omp for schedule(runtime)’
omp_parallel_collapse_exec
kernel (Collapse + ArgList)
Use in Collapse statement to parallelize multiple loop levels in loop nest indicated using ArgList
Important
RAJA only provides a nowait policy option for static scheduling since that is the only schedule case that can be used with nowait and be correct in general when executing multiple loops in a single parallel region. Paraphrasing the OpenMP standard: programs that depend on which thread executes a particular loop iteration under any circumstance other than static schedule are non-conforming.
Note
As in the RAJA full policies for OpenMP scheduling, the ChunkSize
is optional. If not provided, the default chunk size that the OpenMP
implementation applies will be used.
Note
As noted above, RAJA inner OpenMP policies must be used within an
existing parallel region to work properly. Embedding an inner
policy inside the RAJA outer omp_parallel_exec will allow you to
apply the OpenMP execution prescription specified by the policies to
a single kernel. To support use cases with multiple kernels inside an
OpenMP parallel region, RAJA provides a region construct that
takes a template argument to specify the execution back-end. For
example:
RAJA::region<RAJA::omp_parallel_region>([=]() {
RAJA::forall<RAJA::omp_for_nowait_static_exec< > >(segment,
[=] (int idx) {
// do something at iterate 'idx'
}
);
RAJA::forall<RAJA::omp_for_static_exec< > >(segment,
[=] (int idx) {
// do something else at iterate 'idx'
}
);
});
Here, the RAJA::region<RAJA::omp_parallel_region> method call
creates an OpenMP parallel region, which contains two RAJA::forall
kernels. The first uses the RAJA::omp_for_nowait_static_exec< >
policy, meaning that no thread synchronization is needed after the
kernel. Thus, threads can start working on the second kernel while
others are still working on the first kernel. In general, this will
be correct when the iteration segments used in the two kernels are
the same and each kernel is data parallel. Static scheduling is
applied to both kernels. The second kernel uses the
RAJA::omp_for_static_exec policy (NO ‘no wait’ clause), which
means that all threads will complete before the kernel exits. In
this example, this is not really needed since there is no
more code to execute in the parallel region and the
RAJA::omp_parallel_region construct applies a barrier
at the end of it.
GPU Policies for CUDA and HIP¶
RAJA policies for GPU execution using CUDA or HIP are essentially identical.
The only difference is that CUDA policies have the prefix cuda_ and HIP
policies have the prefix hip_.
CUDA/HIP Execution Policies |
Works with |
Brief description |
|---|---|---|
cuda/hip_exec<BLOCK_SIZE> |
forall, scan, sort |
Execute loop iterations directly mapped to global threads in a GPU kernel launched with given threadblock size and unbounded grid size. Note that the threadblock size must be provided. There is no default. |
cuda/hip_exec_with_reduce<BLOCK_SIZE> |
forall |
The cuda/hip exec policy recommended for use with kernels containing reductions. In general, using the occupancy calculator policies improves performance of kernels with reductions. Exactly how much occupancy to use differs by platform. This policy provides a simple way to get what works well for a platform without having to know the details. |
cuda/hip_exec_base<with_reduce, BLOCK_SIZE> |
forall |
Choose between cuda/hip_exec and cuda/hip_exec_with_reduce policies based on the boolean template parameter ‘with_reduce’ |
cuda/hip_exec_grid<BLOCK_SIZE, GRID_SIZE> |
forall |
Execute loop iterations mapped to global threads via grid striding with multiple iterations per global thread in a GPU kernel launched with given thread-block size and grid size. Note that the thread-block size and grid size must be provided, there is no default. |
cuda/hip_exec_occ_max<BLOCK_SIZE> |
forall |
Execute loop iterations mapped to global threads via grid striding with multiple iterations per global thread in a GPU kernel launched with given thread-block size and grid size bounded by the maximum occupancy of the kernel. |
cuda/hip_exec_occ_calc<BLOCK_SIZE> |
forall |
Similar to the occ_max policy but may use less than the maximum occupancy determined by the occupancy calculator of the kernel for performance reasons. |
cuda/hip_exec_occ_fraction<BLOCK_SIZE, Fraction<size_t, numerator, denominator>> |
forall |
Similar to the occ_max policy but use a fraction of the maximum occupancy of the kernel. |
cuda/hip_exec_occ_custom<BLOCK_SIZE, Concretizer> |
forall |
Similar to the occ_max policy policy but the grid size is is determined by concretizer. |
cuda/hip_launch_t |
launch |
Launches a device kernel, any code inside the lambda expression is executed on the device. |
cuda/hip_thread_x_direct_unchecked |
kernel (For) launch (loop) |
Map loop iterates directly without checking loop bounds to GPU threads in x-dimension, one iterate per thread. See note below about limitations. |
cuda/hip_thread_y_direct_unchecked |
kernel (For) launch (loop) |
Same as above, but map to threads in y-dimension. |
cuda/hip_thread_z_direct_unchecked |
kernel (For) launch (loop) |
Same as above, but map to threads in z-dimension. |
cuda/hip_thread_x_direct |
kernel (For) launch (loop) |
Map loop iterates directly to GPU threads in x-dimension, one or no iterates per thread. See note below about limitations. |
cuda/hip_thread_y_direct |
kernel (For) launch (loop) |
Same as above, but map to threads in y-dimension. |
cuda/hip_thread_z_direct |
kernel (For) launch (loop) |
Same as above, but map to threads in z-dimension. |
cuda/hip_thread_x_loop |
kernel (For) launch (loop) |
Similar to thread-x-direct policy, but use a block-stride loop which doesn’t limit total number of loop iterates. |
cuda/hip_thread_y_loop |
kernel (For) launch (loop) |
Same as above, but for threads in y-dimension. |
cuda/hip_thread_z_loop |
kernel (For) launch (loop) |
Same as above, but for threads in z-dimension. |
cuda/hip_thread_syncable_loop<dims…> |
kernel (For) launch (loop) |
Similar to thread-loop policy, but safe to use with Cuda/HipSyncThreads. |
cuda/hip_thread_size_x_direct_unchecked<nx_threads> |
kernel (For) launch (loop) |
Same as thread_x_direct_unchecked policy above but with a compile time number of threads. |
cuda/hip_thread_size_y_direct_unchecked<ny_threads> |
kernel (For) launch (loop) |
Same as above, but map to threads in y-dimension |
cuda/hip_thread_size_z_direct_unchecked<nz_threads> |
kernel (For) launch (loop) |
Same as above, but map to threads in z-dimension. |
cuda/hip_thread_size_x_direct<nx_threads> |
kernel (For) launch (loop) |
Same as thread_x_direct policy above but with a compile time number of threads. |
cuda/hip_thread_size_y_direct<ny_threads> |
kernel (For) launch (loop) |
Same as above, but map to threads in y-dimension |
cuda/hip_thread_size_z_direct<nz_threads> |
kernel (For) launch (loop) |
Same as above, but map to threads in z-dimension. |
cuda/hip_flatten_threads_{xyz}_direct_unchecked |
launch (loop) |
Reshapes threads in a multi-dimensional thread team into one-dimension. Accepts any permutation of one, two, or three dimensions. |
cuda/hip_flatten_threads_{xyz}_direct |
launch (loop) |
Same as above, but with direct mapping. |
cuda/hip_flatten_threads_{xyz}_loop |
launch (loop) |
Same as above, but with loop mapping. |
cuda/hip_block_x_direct_unchecked |
kernel (For) launch (loop) |
Map loop iterates directly without checking loop bounds to GPU thread blocks in the x-dimension, one iterate per block. |
cuda/hip_block_y_direct_unchecked |
kernel (For) launch (loop) |
Same as above, but map to blocks in y-dimension |
cuda/hip_block_z_direct_unchecked |
kernel (For) launch (loop) |
Same as above, but map to blocks in z-dimension |
cuda/hip_block_x_direct |
kernel (For) launch (loop) |
Map loop iterates directly to GPU thread blocks in the x-dimension, one or no iterates per block. |
cuda/hip_block_y_direct |
kernel (For) launch (loop) |
Same as above, but map to blocks in y-dimension |
cuda/hip_block_z_direct |
kernel (For) launch (loop) |
Same as above, but map to blocks in z-dimension |
cuda/hip_block_x_loop |
kernel (For) launch (loop) |
Similar to block-x-direct policy, but use a grid-stride loop. |
cuda/hip_block_y_loop |
kernel (For) launch (loop) |
Same as above, but use blocks in y-dimension |
cuda/hip_block_z_loop |
kernel (For) launch (loop) |
Same as above, but use blocks in z-dimension |
cuda/hip_block_size_x_direct_unchecked<nx_blocks> |
kernel (For) launch (loop) |
Same as block_x_direct_unchecked policy above but with a compile time number of blocks |
cuda/hip_block_size_y_direct_unchecked<ny_blocks> |
kernel (For) launch (loop) |
Same as above, but map to blocks in y-dim |
cuda/hip_block_size_z_direct_unchecked<nz_blocks> |
kernel (For) launch (loop) |
Same as above, but map to blocks in z-dim |
cuda/hip_block_size_x_direct<nx_blocks> |
kernel (For) launch (loop) |
Same as block_x_direct policy above but with a compile time number of blocks |
cuda/hip_block_size_y_direct<ny_blocks> |
kernel (For) launch (loop) |
Same as above, but map to blocks in y-dim |
cuda/hip_block_size_z_direct<nz_blocks> |
kernel (For) launch (loop) |
Same as above, but map to blocks in z-dim |
cuda/hip_block_size_x_loop<nx_blocks> |
kernel (For) launch (loop) |
Same as block_x_loop policy above but with a compile time number of blocks |
cuda/hip_block_size_y_loop<ny_blocks> |
kernel (For) launch (loop) |
Same as above, but map to blocks in y-dim |
cuda/hip_block_size_z_loop<nz_blocks> |
kernel (For) launch (loop) |
Same as above, but map to blocks in z-dim |
cuda/hip_global_x_direct_unchecked |
kernel (For) launch (loop) |
Map loop iterates directly without checking loop bounds to GPU threads in the grid in the x-dimension, one iterate per thread. Creates a unique thread id for each thread on the x-dimension of the grid. Same as computing threadIdx.x + threadDim.x * blockIdx.x. |
cuda/hip_global_y_direct_unchecked |
kernel (For) launch (loop) |
Same as above, but uses globals in y-dimension. |
cuda/hip_global_z_direct_unchecked |
kernel (For) launch (loop) |
Same as above, but uses globals in z-dimension. |
cuda/hip_global_x_direct |
kernel (For) launch (loop) launch (loop) |
Same as global_x_direct_unchecked above, but maps loop iterates directly to GPU threads in the grid, one or no iterates per thread. |
cuda/hip_global_y_direct |
kernel (For) launch (loop) |
Same as above, but uses globals in y-dimension. |
cuda/hip_global_z_direct |
kernel (For) launch (loop) |
Same as above, but uses globals in z-dimension. |
cuda/hip_global_x_loop |
kernel (For) launch (loop) |
Similar to global-x-direct policy, but use a grid-stride loop. |
cuda/hip_global_y_loop |
kernel (For) launch (loop) |
Same as above, but use globals in y-dimension |
cuda/hip_global_z_loop |
kernel (For) launch (loop) |
Same as above, but use globals in z-dimension |
cuda/hip_global_size_x_direct_unchecked<nx_threads> |
kernel (For) launch (loop) |
Same as global_x_direct_unchecked policy above but with a compile time block size. |
cuda/hip_global_size_y_direct_unchecked<ny_threads> |
kernel (For) launch (loop) |
Same as above, but map to globals in y-dim |
cuda/hip_global_size_z_direct_unchecked<nz_threads> |
kernel (For) launch (loop) |
Same as above, but map to globals in z-dim |
cuda/hip_global_size_x_direct<nx_threads> |
kernel (For) launch (loop) |
Same as global_x_direct policy above but with a compile time block size |
cuda/hip_global_size_y_direct<ny_threads> |
kernel (For) launch (loop) |
Same as above, but map to globals in y-dim |
cuda/hip_global_size_z_direct<nz_threads> |
kernel (For) launch (loop) |
Same as above, but map to globals in z-dim |
cuda/hip_global_size_x_loop<nx_threads> |
kernel (For) launch (loop) |
Same as global_x_loop policy above but with a compile time block size. |
cuda/hip_global_size_y_loop<ny_threads> |
kernel (For) launch (loop) |
Same as above, but map to globals in y-dim |
cuda/hip_global_size_z_loop<nz_threads> |
kernel (For) launch (loop) |
Same as above, but map to globals in z-dim |
cuda/hip_warp_direct_unchecked |
kernel (For) |
Map work to threads in a warp directly without checking loop bounds. Cannot be used in conjunction with cuda/hip_thread_x_* policies. Multiple warps can be created by using cuda/hip_thread_y/z_* policies. |
cuda/hip_warp_direct |
kernel (For) |
Similar to warp_direct_unchecked, but map work to threads in a warp directly. |
cuda/hip_warp_loop |
kernel (For) |
Similar to warp_direct, but map work to threads in a warp using a warp-stride loop. |
cuda/hip_warp_masked_direct<BitMask<..>> |
kernel (For) |
Mmap work directly to threads in a warp using a bit mask. Cannot be used with cuda/hip_thread_x_* policies. Multiple warps can be created by using cuda/hip_thread_y/z_* policies. |
cuda/hip_warp_masked_loop<BitMask<..>> |
kernel (For) |
Map work to threads in a warp using a bit mask and a warp- stride loop. Cannot be used with cuda/hip_thread_x_* policies. Multiple warps can be created by using cuda/hip_thread_y/z_* policies. |
cuda/hip_block_reduce |
kernel (Reduce) |
Perform a reduction across a single GPU thread block. |
cuda/hip_warp_reduce |
kernel (Reduce) |
Perform a reduction across a single GPU thread warp. thread warp. |
When a CUDA or HIP policy leaves parameters like the block size and/or grid size
unspecified a concretizer object is used to decide those parameters. The
following concretizers are available to use in the cuda/hip_exec_occ_custom
policies:
Execution Policy |
Brief description |
|---|---|
Cuda/HipDefaultConcretizer |
The default concretizer, expected to provide good performance in general. Note that it may not use max occupancy. |
Cuda/HipRecForReduceConcretizer |
Expected to provide good performance in loops with reducers. Note that it may not use max occupancy. |
Cuda/HipMaxOccupancyConcretizer |
Uses max occupancy. |
Cuda/HipAvoidDeviceMaxThreadOccupancyConcretizer |
Avoids using the max occupancy of the device in terms of threads. Note that it may use the max occupancy of the kernel if that is below the max occupancy of the device. |
Cuda/HipFractionOffsetOccupancyConcretizer< Fraction<size_t, numerator, denomenator>, BLOCKS_PER_SM_OFFSET> |
Uses a fraction and offset to choose an occupancy based on the max occupancy Using the following formula: (Fraction * kernel_max_blocks_per_sm + BLOCKS_PER_SM_OFFSET) * sm_per_device |
Several notable constraints apply to RAJA CUDA/HIP direct_unchecked policies.
Note
DirectUnchecked policies do not mask out threads that are out-of-range. So they should only be used when the size of the range matches the size of the block or grid.
Repeating direct_unchecked policies with the same dimension in perfectly nested loops is not recommended. Your code may do something, but likely will not do what you expect and/or be correct.
If multiple direct_unchecked policies are used in a kernel (using different dimensions), the product of sizes of the corresponding iteration spaces cannot be greater than the maximum allowable threads per block or blocks per grid. Typically, this is 1024 threads per block. Attempting to execute a kernel with more than the maximum allowed causes the CUDA/HIP runtime to complain about illegal launch parameters.
Block-direct-unchecked policies are recommended for most tiled loop patterns. In these cases the CUDA/HIP kernel is launched with the exact number of blocks needed so no checking is necessary.
Several notable constraints apply to RAJA CUDA/HIP direct policies.
Note
Direct policies mask out threads that are out-of-range. So they should only be used when the size of the range is less than or equal to the size of the block or grid.
Repeating direct policies with the same dimension in perfectly nested loops is not recommended. Your code may do something, but likely will not do what you expect and/or be correct.
If multiple direct policies are used in a kernel (using different dimensions), the product of sizes of the corresponding iteration spaces cannot be greater than the maximum allowable threads per block or blocks per grid. Typically, this is 1024 threads per block. Attempting to execute a kernel with more than the maximum allowed causes the CUDA/HIP runtime to complain about illegal launch parameters.
Global-direct-sized policies are recommended for most loop patterns, but may be inappropriate for kernels using block level synchronization.
Thread-direct policies are recommended only for certain loop patterns, such as block tilings that produce small fixed size iteration spaces within each block.
Several notes regarding CUDA/HIP loop policies are also good to know.
Note
Loop policies perform a block or grid stride loop. So they can be used when the size of the range exceeds the size of the block or grid.
There is no constraint on the product of sizes of the associated loop iteration space.
These polices allow having a larger number of iterates than threads/blocks in the x, y, or z dimension.
The cuda/hip_thread_loop policies are not safe to use with Cuda/HipSyncThreads, use the cuda/hip_thread_syncable_loop<dims…> policies instead. For example cuda_thread_x_loop -> cuda_thread_syncable_loop<named_dim::x>.
CUDA/HIP loop policies are recommended for some loop patterns where a large or unknown sized iteration space is mapped to a small or fixed number of threads.
Finally
Note
CUDA/HIP block-direct-unchecked or block-direct policies may be preferable to block-loop policies in situations where block load balancing may be an issue as the block-direct-unchecked or block-direct policies may yield better performance.
Several notes regarding the CUDA/HIP policy implementation that allow you to write more explicit policies.
Note
Policies are a class template like cuda/hip_exec_explicit or cuda/hip_indexer. The various template parameters specify the behavior of the policy.
Policies have a mapping from loop iterations to iterates in the index set via a iteration_mapping enum template parameter. The possible values are DirectUnchecked, Direct, and StridedLoop.
Policies can be safely used with some synchronization constructs via a kernel_sync_requirement enum template parameter. The possible values are none and sync.
Policies get their indices via an iteration getter class template like cuda/hip::IndexGlobal.
Iteration getters can be used with different dimensions via the named_dim enum. The possible values are x, y and z.
Iteration getters know the number of threads per block (block_size) and number of blocks per grid (grid_size) via integer template parameters. These can be positive integers, in this case they must match the number used in the kernel launch. These can also be values of the named_usage enum. The possible values are unspecified and ignored. For example in cuda_thread_x_direct block_size is unspecified so a runtime number of threads is used, but grid_size is ignored so blocks are ignored when getting indices.
GPU Policies for SYCL¶
Note
SYCL uses C++-style ordering for its work group and global thread dimension/indexing types. This is due, in part, to SYCL’s closer alignment with C++ multi-dimensional indexing, which is “row-major”. This is the reverse of the thread indexing used in CUDA or HIP, which is “column-major”. For example, suppose we have a thread-block or work-group where we specify the shape as (nx, ny, nz). Consider an element in the thread-block or work-group with id (x, y, z). In CUDA or HIP, the element index is x + y * nx + z * nx * ny. In SYCL, the element index is z + y * nz + x * nz * ny.
In terms of the CUDA or HIP built-in variables to support threads, we have:
Thread ID: threadIdx.x/y/z
Block ID: blockIdx.x/y/z
Block dimension: blockDim.x/y/z
Grid dimension: gridDim.x/y/z
The analogues in SYCL are:
Thread ID: sycl::nd_item.get_local_id(2/1/0)
Work-group ID: sycl::nd_item.get_group(2/1/0)
Work-group dimensions: sycl::nd_item.get_local_range().get(2/1/0)
ND-range dimensions: sycl::nd_item.get_group_range(2/1/0)
When using RAJA::launch, thread and block configuration
follows CUDA and HIP programming models and is always
configured in three-dimensions. This means that SYCL dimension
2 always exists and should be used as one would use the
x dimension for CUDA and HIP.
Similarly, RAJA::kernel uses a three-dimensional work-group
configuration. SYCL dimension 2 always exists and should be used as
one would use the x dimension in CUDA and HIP.
SYCL Execution Policies |
Works with |
Brief description |
|---|---|---|
sycl_exec<WORK_GROUP_SIZE> |
forall, |
Execute loop iterations in a GPU kernel launched with given work group size. |
sycl_launch_t |
launch |
Launches a sycl kernel, any code express within the lambda is executed on the device. |
sycl_global_0<WORK_GROUP_SIZE> |
kernel (For) |
Map loop iterates directly to GPU global ids in first dimension, one iterate per work item. Group execution into work groups of given size. |
sycl_global_1<WORK_GROUP_SIZE> |
kernel (For) |
Same as above, but map to global ids in second dim |
sycl_global_2<WORK_GROUP_SIZE> |
kernel (For) |
Same as above, but map to global ids in third dim |
sycl_global_item_0 |
launch (loop) |
Creates a unique thread id for each thread for dimension 0 of the grid. Same as computing itm.get_group(0) * itm.get_local_range(0) + itm.get_local_id(0). |
sycl_global_item_1 |
launch (loop) |
Same as above, but uses threads in dimension 1 Same as computing itm.get_group(1) + itm.get_local_range(1) * itm.get_local_id(1). |
sycl_global_item_2 |
launch (loop) |
Same as above, but uses threads in dimension 2 Same as computing itm.get_group(2) + itm.get_local_range(2) * itm.get_local_id(2). |
sycl_local_0_direct |
kernel (For) launch (loop) |
Map loop iterates directly to GPU work items in first dimension, one iterate per work item (see note below about limitations) |
sycl_local_1_direct |
kernel (For) launch (loop) |
Same as above, but map to work items in second dim |
sycl_local_2_direct |
kernel (For) launch (loop) |
Same as above, but map to work items in third dim |
sycl_local_0_loop |
kernel (For) launch (loop) |
Similar to local-1-direct policy, but use a work group-stride loop which doesn’t limit number of loop iterates |
sycl_local_1_loop |
kernel (For) launch (loop) |
Same as above, but for work items in second dimension |
sycl_local_2_loop |
kernel (For) launch (loop) |
Same as above, but for work items in third dimension |
sycl_group_0_direct |
kernel (For) launch (loop) |
Map loop iterates directly to GPU group ids in first dimension, one iterate per group |
sycl_group_1_direct |
kernel (For) launch (loop) |
Same as above, but map to groups in second dimension |
sycl_group_2_direct |
kernel (For) launch (loop) |
Same as above, but map to groups in third dimension |
sycl_group_0_loop |
kernel (For) launch (loop) |
Similar to group-1-direct policy, but use a group-stride loop. |
sycl_group_1_loop |
kernel (For) launch (loop) |
Same as above, but use groups in second dimension |
sycl_group_2_loop |
kernel (For) launch (loop) |
Same as above, but use groups in third dimension |
OpenMP Target Offload Policies¶
RAJA provides policies to use OpenMP to offload kernel execution to a GPU device, for example. They are summarized in the following table.
OpenMP Target Execution Policies
Works with
Brief description
omp_target_parallel_for_exec<#>
forall, kernel(For)
Create parallel target region and execute with given number of threads per team inside it. Number of teams is calculated internally; i.e., apply
omp teams distribute parallel for num_teams(iteration space size/#) thread_limit(#)pragmaomp_target_parallel_collapse_exec
kernel (Collapse)
Similar to above, but collapse perfectly-nested loops, indicated in arguments to RAJA Collapse statement. Note: compiler determines number of thread teams and threads per team
RAJA IndexSet Execution Policies¶
When an IndexSet iteration space is used in RAJA by passing an IndexSet
to a RAJA::forall method, for example, an index set execution policy is
required. An index set execution policy is a two-level policy: an ‘outer’
policy for iterating over segments in the index set, and an ‘inner’ policy
used to execute the iterations defined by each segment. An index set execution
policy type has the form:
RAJA::ExecPolicy< segment_iteration_policy, segment_execution_policy >
In general, any policy that can be used with a RAJA::forall method
can be used as the segment execution policy. The following policies are
available to use for the outer segment iteration policy:
Execution Policy |
Brief description |
|---|---|
Serial |
|
seq_segit |
Iterate over index set segments sequentially. |
OpenMP CPU multithreading |
|
omp_parallel_segit |
Create OpenMP parallel region and
iterate over segments in parallel inside it; i.e., apply |
omp_parallel_for_segit |
Same as above. |
Parallel Region Policies¶
Earlier, we discussed using the RAJA::region construct to
execute multiple kernels in an OpenMP parallel region. To support source code
portability, RAJA provides a sequential region concept that can be used to
surround code that uses execution back-ends other than OpenMP. For example:
RAJA::region<RAJA::seq_region>([=]() {
RAJA::forall<RAJA::seq_exec>(segment, [=] (int idx) {
// do something at iterate 'idx'
} );
RAJA::forall<RAJA::seq_exec>(segment, [=] (int idx) {
// do something else at iterate 'idx'
} );
});
Note
The sequential region specialization is essentially a pass through operation. It is provided so that if you want to turn off OpenMP in your code, for example, you can simply replace the region policy type and you do not have to change your algorithm source code.
Reduction Policies¶
Each RAJA reduction object must be defined with a ‘reduction policy’ type. Reduction policy types are distinct from loop execution policy types. It is important to note the following constraints about RAJA reduction usage:
Note
To guarantee correctness, a reduction policy must be consistent with the loop execution policy used. For example, a CUDA reduction policy must be used when the execution policy is a CUDA policy, an OpenMP reduction policy must be used when the execution policy is an OpenMP policy, and so on.
The following table summarizes RAJA reduction policy types:
Reduction Policy |
Loop Policies to Use With |
Brief description |
|---|---|---|
seq_reduce |
seq_exec, |
Non-parallel (sequential) reduction. |
omp_reduce |
any OpenMP policy |
OpenMP parallel reduction. |
omp_reduce_ordered |
any OpenMP policy |
OpenMP parallel reduction with result guaranteed to be reproducible. |
omp_target_reduce |
any OpenMP target policy |
OpenMP parallel target offload reduction. |
cuda/hip_reduce |
any CUDA/HIP policy |
Parallel reduction in a CUDA/HIP kernel (device synchronization will occur when reduction value is finalized). |
cuda/hip_reduce_atomic |
any CUDA/HIP policy |
Same as above, but reduction may use atomic operations leading to run to run variability in the results. |
cuda/hip_reduce_base<with_atomic> |
any CUDA/HIP policy |
Choose between cuda/hip_reduce and cuda/hip_reduce_atomic policies based on the with_atomic boolean. |
cuda/hip_reduce_device_fence |
any CUDA/HIP policy |
Same as above, and reduction uses normal memory accesses that are not visible across the whole device and device scope fences to ensure visibility and ordering. This works on all architectures but incurs higher overheads on some architectures. |
cuda/hip_reduce_block_fence |
any CUDA/HIP policy |
Same as above, and reduction uses special memory accesses to a level of cache visible to the whole device and block scope fences to ensure ordering. This improves performance on some architectures. |
cuda/hip_reduce_atomic_host_init_device_fence |
any CUDA/HIP policy |
Same as above with device fence, but initializes the memory used for atomics on the host. This works well on recent architectures and incurs lower overheads. |
cuda/hip_reduce_atomic_host_init_block_fence |
any CUDA/HIP policy |
Same as above with block fence, but initializes the memory used for atomics on the host. This works well on recent architectures and incurs lower overheads. |
cuda/hip_reduce_atomic_device_init_device_fence |
any CUDA/HIP policy |
Same as above with device fence, but initializes the memory used for atomics on the device. This works on all architectures but incurs higher overheads. |
cuda/hip_reduce_atomic_device_init_block_fence |
any CUDA/HIP policy |
Same as above with block fence, but initializes the memory used for atomics on the device. This works on all architectures but incurs higher overheads. |
sycl_reduce |
any SYCL policy |
Reduction in a SYCL kernel (device synchronization will occur when the reduction value is finalized). |
Note
RAJA reductions used with SIMD execution policies are not guaranteed to generate correct results. So they should not be used for kernels containing reductions.
MultiReduction Policies¶
Each RAJA multi-reduction object must be defined with a ‘multi-reduction policy’ type. Multi-reduction policy types are distinct from loop execution policy types. It is important to note the following constraints about RAJA multi-reduction usage:
Note
To guarantee correctness, a multi-reduction policy must be compatible with the loop execution policy used. For example, a CUDA multi-reduction policy must be used when the execution policy is a CUDA policy, an OpenMP multi-reduction policy must be used when the execution policy is an OpenMP policy, and so on.
The following table summarizes RAJA multi-reduction policy types:
MultiReduction Policy |
Loop Policies to Use With |
Brief description |
|---|---|---|
seq_multi_reduce |
seq_exec, |
Non-parallel (sequential) multi-reduction. |
omp_multi_reduce |
any OpenMP policy |
OpenMP parallel multi-reduction. |
omp_multi_reduce_ordered |
any OpenMP policy |
OpenMP parallel multi-reduction with result guaranteed to be reproducible. |
cuda/hip_multi_reduce_atomic |
any CUDA/HIP policy |
Parallel multi-reduction in a CUDA/HIP kernel. Multi-reduction may use atomic operations leading to run to run variability in the results. (device synchronization will occur when reduction value is finalized) |
cuda/hip_multi_reduce_atomic_low_performance_low_overhead |
any CUDA/HIP policy |
Same as above, but multi-reduction uses a low overhead algorithm with a minimal set of resources. This minimally effects the performance of loops containing the multi-reducer though it may cause the multi-reducer itself to perform poorly if it is used. |
cuda/hip_multi_reduce_atomic_block_then_atomic_grid_host_init |
any CUDA/HIP policy |
The multi-reduction uses atomics into shared memory and global memory. Atomics into shared memory are used each time a value is combined into the multi-reducer and at the end of the life of the block the shared values are combined into global memory with atomics. If there is not enough shared memory available this will fall back to using atomics into global memory only, which may have a performance penalty. The memory for global atomics is initialized on the host. |
cuda/hip_multi_reduce_atomic_global_host_init |
any CUDA/HIP policy |
The multi-reduction uses atomics into global global memory only. Atomics into global memory are used each time a value is combined into the multi-reducer. The memory for global atomics is initialized on the host. |
cuda/hip_multi_reduce_atomic_global_no_replication_host_init |
any CUDA/HIP |
Same as above, but uses minimal memory by not replicating global atomics. |
Note
RAJA multi-reductions used with SIMD execution policies are not guaranteed to generate correct results. So they should not be used for kernels containing multi-reductions.
Atomic Policies¶
Each RAJA atomic operation must be defined with an ‘atomic policy’ type. Atomic policy types are distinct from loop execution policy types.
Note
An atomic policy type must be consistent with the loop execution policy for the kernel in which the atomic operation is used. The following table summarizes RAJA atomic policies and usage.
Atomic Policy |
Loop Policies to Use With |
Brief description |
|---|---|---|
seq_atomic |
seq_exec, |
Atomic operation performed in a non-parallel (sequential) kernel. |
omp_atomic |
any OpenMP policy |
Atomic operation in OpenMP
multithreading or target kernel;
i.e., apply |
cuda/hip/sycl_atomic |
any CUDA/HIP/SYCL policy |
Atomic operation performed in a CUDA/HIP/SYCL kernel. |
cuda/hip_atomic_explicit |
any CUDA/HIP policy |
Atomic operation performed in a CUDA/HIP kernel that may also be used in a host execution context. The atomic policy takes a host atomic policy template argument. See additional explanation and example below. |
builtin_atomic |
seq_exec, any OpenMP policy |
Compiler builtin atomic operation. |
auto_atomic |
seq_exec, any OpenMP policy, any CUDA/HIP/SYCL policy |
Atomic operation compatible with loop execution policy. See example below. Cannot be used inside CUDA or HIP explicit atomic policies. |
Note
The cuda_atomic_explicit and hip_atomic_explicit policies
take a host atomic policy template parameter. They are intended to
be used with kernels that are host-device decorated to be used in
either a host or device execution context.
Here is an example illustrating use of the cuda_atomic_explicit policy:
auto kernel = [=] RAJA_HOST_DEVICE (RAJA::Index_type i) {
RAJA::atomicAdd< RAJA::cuda_atomic_explicit<omp_atomic> >(&sum, 1);
};
RAJA::forall< RAJA::cuda_exec<BLOCK_SIZE> >(RAJA::TypedRangeSegment<int> seg(0, N), kernel);
RAJA::forall< RAJA::omp_parallel_for_exec >(RAJA::TypedRangeSegment<int> seg(0, N), kernel);
In this case, the atomic operation knows when it is compiled for the device in a CUDA kernel context and the CUDA atomic operation is applied. Similarly when it is compiled for the host in an OpenMP kernel the omp_atomic policy is used and the OpenMP version of the atomic operation is applied.
Here is an example illustrating use of the auto_atomic policy:
RAJA::forall< RAJA::cuda_exec<BLOCK_SIZE> >(RAJA::TypedRangeSegment<int> seg(0, N),
[=] RAJA_DEVICE (RAJA::Index_type i) {
RAJA::atomicAdd< RAJA::auto_atomic >(&sum, 1);
});
In this case, the atomic operation knows that it is used in a CUDA kernel context and the CUDA atomic operation is applied. Similarly, if an OpenMP execution policy was used, the OpenMP version of the atomic operation would be used.
Note
The builtin_atomic policy may be preferable to the
omp_atomic policy in terms of performance.
Local Array Memory Policies¶
RAJA::LocalArray types must use a memory policy indicating
where the memory for the local array will live. These policies are described
in Local Array.
The following memory policies are available to specify memory allocation
for RAJA::LocalArray objects:
RAJA::cpu_tile_mem- Allocate CPU memory on the stack
RAJA::cuda/hip_shared_mem- Allocate CUDA or HIP shared memory
RAJA::cuda/hip_thread_mem- Allocate CUDA or HIP thread private memory
RAJA Kernel Execution Policies¶
RAJA kernel execution policy constructs form a simple domain specific language
for composing and transforming complex loops that relies
solely on standard C++17 template support.
RAJA kernel policies are constructed using a combination of Statements and
Statement Lists. A RAJA Statement is an action, such as execute a loop,
invoke a lambda, set a thread barrier, etc. A StatementList is an ordered list
of Statements that are composed in the order that they appear in the kernel
policy to construct a kernel. A Statement may contain an enclosed StatementList. Thus, a RAJA::KernelPolicy type is really just a StatementList.
The main Statement types provided by RAJA are RAJA::statement::For and
RAJA::statement::Lambda, that we discussed in
Complex Loops (RAJA::kernel).
A RAJA::statement::For<ArgID, ExecPolicy, Enclosed Statements> type
indicates a for-loop structure. The ArgID parameter is an integral constant
that identifies the position of the iteration space in the iteration space
tuple passed to the RAJA::kernel method to be used for the loop. The
ExecPolicy is the RAJA execution policy to use on the loop, which is
similar to RAJA::forall usage. The EnclosedStatements type is a
nested template parameter that contains whatever is needed to execute the
kernel and which forms a valid StatementList. The
RAJA::statement::Lambda<LambdaID>
type invokes the lambda expression corresponding to its position ‘LambdaID’
in the sequence of lambda expressions in the RAJA::kernel argument list.
For example, a simple sequential for-loop:
for (int i = 0; i < N; ++i) {
// loop body
}
can be represented using the RAJA kernel interface as:
using KERNEL_POLICY =
RAJA::KernelPolicy<
RAJA::statement::For<0, RAJA::seq_exec,
RAJA::statement::Lambda<0>
>
>;
RAJA::kernel<KERNEL_POLICY>(
RAJA::make_tuple(range),
[=](int i) {
// loop body
}
);
Note
All RAJA::forall functionality can be done using the
RAJA::kernel interface. We maintain the RAJA::forall
interface since it is less verbose and thus more convenient
for users.
RAJA::kernel Statement Types¶
The list below summarizes the current collection of statement types that
can be used with RAJA::kernel and RAJA::kernel_param. More detailed
explanation along with examples of how they are used can be found in
the RAJA::kernel examples in RAJA Tutorial and Examples.
Note
All of the statement types described below are in the namespace
RAJA::statement. For brevity, we omit the namespaces in
the discussion in this section.
Note
RAJA::kernel_param functions similarly to RAJA::kernel
except that the second argument is a tuple of parameters used
in a kernel for local arrays, thread local variables, tiling
information, etc.
Several RAJA statements can be specialized with auxiliary types, which are described in Auxilliary Types.
The following list contains the most commonly used statement types.
For< ArgId, ExecPolicy, EnclosedStatements >abstracts a for-loop associated with kernel iteration space at tuple indexArgId, to be run withExecPolicyexecution policy, and containing theEnclosedStatementswhich are executed for each loop iteration.Lambda< LambdaId >invokes the lambda expression that appears at position ‘LambdaId’ in the sequence of lambda arguments. With this statement, the lambda expression must accept all arguments associated with the tuple of iteration space segments and tuple of parameters (if kernel_param is used).Lambda< LambdaId, Args...>extends the Lambda statement. The second template parameter indicates which arguments (e.g., which segment iteration variables) are passed to the lambda expression.Collapse< ExecPolicy, ArgList<...>, EnclosedStatements >collapses multiple perfectly nested loops specified by tuple iteration space indices inArgList, using theExecPolicyexecution policy, and placesEnclosedStatementsinside the collapsed loops which are executed for each iteration. Note that this only works for CPU execution policies (e.g., sequential, OpenMP). It may be available for CUDA in the future if such use cases arise.
There is one statement specific to OpenMP kernels.
OmpSyncThreadsapplies the OpenMP#pragma omp barrierdirective.
Statement types that launch CUDA or HIP GPU kernels are listed next. They work
similarly for each back-end and their names are distinguished by the prefix
Cuda or Hip. For example, CudaKernel or HipKernel.
Cuda/HipKernel< EnclosedStatements>launchesEnclosedStatementsas a GPU kernel; e.g., a loop nest where the iteration spaces of each loop level are associated with threads and/or thread blocks as described by the execution policies applied to them. This kernel launch is synchronous.Cuda/HipKernelAsync< EnclosedStatements>asynchronous version of Cuda/HipKernel.Cuda/HipKernelFixed<num_threads, EnclosedStatements>similar to Cuda/HipKernel but enables a fixed number of threads (specified by num_threads). This kernel launch is synchronous.Cuda/HipKernelFixedAsync<num_threads, EnclosedStatements>asynchronous version of Cuda/HipKernelFixed.CudaKernelFixedSM<num_threads, min_blocks_per_sm, EnclosedStatements>similar to CudaKernelFixed but enables a minimum number of blocks per sm (specified by min_blocks_per_sm), this can help increase occupancy. This kernel launch is synchronous. Note: there is no HIP variant of this statement.CudaKernelFixedSMAsync<num_threads, min_blocks_per_sm, EnclosedStatements>asynchronous version of CudaKernelFixedSM. Note: there is no HIP variant of this statement.Cuda/HipKernelOcc<EnclosedStatements>similar to CudaKernel but uses the CUDA occupancy calculator to determine the optimal number of threads/blocks. Statement is intended for use with RAJA::cuda/hip_block_{xyz}_loop policies. This kernel launch is synchronous.Cuda/HipKernelOccAsync<EnclosedStatements>asynchronous version of Cuda/HipKernelOcc.Cuda/HipKernelExp<num_blocks, num_threads, EnclosedStatements>similar to CudaKernelOcc but with the flexibility to fix the number of threads and/or blocks and let the CUDA occupancy calculator determine the unspecified values. This kernel launch is synchronous.Cuda/HipKernelExpAsync<num_blocks, num_threads, EnclosedStatements>asynchronous version of Cuda/HipKernelExp.Cuda/HipSyncThreadsinvokes CUDA or HIP__syncthreads()barrier.Cuda/HipSyncWarpinvokes CUDA__syncwarp()barrier. Warp sync is not supported in HIP, so the HIP variant is a no-op.
Statement types that launch SYCL kernels are listed next.
SyclKernel<EnclosedStatements>launchesEnclosedStatementsas a SYCL kernel. This kernel launch is synchronous.SyclKernelAsync<EnclosedStatements>asynchronous version of SyclKernel.
RAJA provides statements to define loop tiling which can improve performance; e.g., by allowing CPU cache blocking or use of GPU shared memory.
Tile< ArgId, TilePolicy, ExecPolicy, EnclosedStatements >abstracts an outer tiling loop containing an inner for-loop over each tile. TheArgIdindicates which entry in the iteration space tuple to which the tiling loop applies and theTilePolicyspecifies the tiling pattern to use, including its dimension. TheExecPolicyandEnclosedStatementsare similar to what they represent in astatement::Fortype.TileTCount< ArgId, ParamId, TilePolicy, ExecPolicy, EnclosedStatements >abstracts an outer tiling loop containing an inner for-loop over each tile, where it is necessary to obtain the tile number in each tile. TheArgIdindicates which entry in the iteration space tuple to which the loop applies and theParamIdindicates the position of the tile number in the parameter tuple. TheTilePolicyspecifies the tiling pattern to use, including its dimension. TheExecPolicyandEnclosedStatementsare similar to what they represent in astatement::Fortype.ForICount< ArgId, ParamId, ExecPolicy, EnclosedStatements >abstracts an inner for-loop within an outer tiling loop where it is necessary to obtain the local iteration index in each tile. TheArgIdindicates which entry in the iteration space tuple to which the loop applies and theParamIdindicates the position of the tile index parameter in the parameter tuple. TheExecPolicyandEnclosedStatementsare similar to what they represent in astatement::Fortype.
It is often advantageous to use local arrays for data accessed in tiled loops. RAJA provides a statement for allocating data in a Local Array object according to a memory policy. See Local Array Memory Policies for more information about such policies.
InitLocalMem< MemPolicy, ParamList<...>, EnclosedStatements >allocates memory for aRAJA::LocalArrayobject used in kernel. TheParamListentries indicate which local array objects in a tuple will be initialized. TheEnclosedStatementscontain the code in which the local array will be accessed; e.g., initialization operations.
RAJA provides some statement types that apply in specific kernel scenarios.
Reduce< ReducePolicy, Operator, ParamId, EnclosedStatements >reduces a value across threads in a multithreaded code region to a single thread. TheReducePolicyis similar to what it represents for RAJA reduction types.ParamIdspecifies the position of the reduction value in the parameter tuple passed to theRAJA::kernel_parammethod.Operatoris the binary operator used in the reduction; typically, this will be one of the operators that can be used with RAJA scans (see RAJA Scan Operators). After the reduction is complete, theEnclosedStatementsexecute on the thread that received the final reduced value.If< Conditional >chooses which portions of a policy to run based on run-time evaluation of conditional statement; e.g., true or false, equal to some value, etc.Hyperplane< ArgId, HpExecPolicy, ArgList<...>, ExecPolicy, EnclosedStatements >provides a hyperplane (or wavefront) iteration pattern over multiple indices. A hyperplane is a set of multi-dimensional index values: i0, i1, … such that h = i0 + i1 + … for a given h. Here,ArgIdis the position of the loop argument we will iterate on (defines the order of hyperplanes),HpExecPolicyis the execution policy used to iterate over the iteration space specified by ArgId (often sequential),ArgListis a list of other indices that along with ArgId define a hyperplane, andExecPolicyis the execution policy that applies to the loops inArgList. Then, for each iteration, everything in theEnclosedStatementsis executed.
Auxilliary Types¶
The following list summarizes auxiliary types used in the above statements. These
types live in the RAJA namespace.
tile_fixed<TileSize>tile policy argument to aTileorTileTCountstatement; partitions loop iterations into tiles of a fixed size specified byTileSize. This statement type can be used as theTilePolicytemplate parameter in theTilestatements above.
tile_dynamic<ParamIdx>TilePolicy argument to a Tile or TileTCount statement; partitions loop iterations into tiles of a size specified by aTileSize{}positional parameter argument. This statement type can be used as theTilePolicytemplate parameter in theTilestatements above.
Segs<...>argument to a Lambda statement; used to specify which segments in a tuple will be used as lambda arguments.
Offsets<...>argument to a Lambda statement; used to specify which segment offsets in a tuple will be used as lambda arguments.
Params<...>argument to a Lambda statement; used to specify which params in a tuple will be used as lambda arguments.
ValuesT<T, ...>argument to a Lambda statement; used to specify compile time constants, of type T, that will be used as lambda arguments.
Examples that show how to use a variety of these statement types can be found in Complex Loops (RAJA::kernel).