Resources

This section describes the basic concepts of resource types and how to use them with RAJA-based kernels using RAJA::forall, RAJA::kernel, `` RAJA::launch``, etc. Resources are used as an interface to various RAJA back-end constructs and their respective hardware. Currently there exist resource types for Cuda, Hip, Omp (target) and Host. Resource objects allow one to allocate and deallocate storage in memory spaces associated with RAJA back-ends and copy data between memory spaces. They also allow one to execute RAJA kernels asynchronously on a respective thread/stream. Resource support in RAJA is rudimentary at this point and its functionality / behavior may change as it is developed.

Note

  • Currently feature complete asynchronous behavior and streamed/threaded support is available only for Cuda and Hip resources.

  • RAJA resource support is based on camp resource support. The RAJA::resources namespace aliases the camp::resources namespace.

Each resource has a set of underlying functionality that is synonymous across each resource type.

Methods

Brief description

get_platform

Returns the underlying camp platform associated with the resource.

get_event

Return an event object for the resource from the last resource call.

allocate

Allocate data on a resource back-end.

deallocate

Deallocate data on a resource back-end.

memcpy

Perform a memory copy from a source location to a destination location on a resource back-end.

memset

Set memory value in an allocation on a resource back-end.

wait

Wait for all operations enqueued on a resource to complete before proceeding.

wait_for

Enqueue a wait on a resource stream/thread for a user passed event to complete.

Note

deallocate, memcpy and memset operations only work with pointers that correspond to memory locations that have been allocated on the resource’s respective device.

Each resource type also defines specific back-end information/functionality. For example, each CUDA resource contains a cudaStream_t value with an associated get method. The basic interface for each resource type is summarized in Camp resource.

Note

Stream IDs are assigned to resources in a round robin fashion. The number of independent streams for a given back-end is limited to the maximum number of concurrent streams that the back-end supports.

Type-Erasure

Resources can be declared in two ways, as a type-erased resource or as a concrete resource. The underlying run time functionality is the same for both.

Here is one way to construct a concrete CUDA resource type:

RAJA::resources::Cuda my_cuda_res;

A type-erased resource allows a user the ability to change the resource back-end at run time. For example, to choose a CUDA GPU device resource or host resource at run time, one could do the following:

RAJA::resources::Resource* my_res = nullptr;

if (use_gpu)
  my_res = new RAJA::resources::Resource{RAJA::resources::Cuda()};
else
  my_res = new RAJA::resources::Resource{RAJA::resources::Host()};

When use_gpu is true, my_res will be a CUDA GPU device resource. Otherwise, it will be a host CPU resource.

Memory Operations

The example discussed in this section illustrates most of the memory operations that can be performed with A common use case for a resource is to manage arrays in the appropriate memory space to use in a kernel. Consider the following code example:

// create a resource for a host CPU and a CUDA GPU device
RAJA::resources::Resource host_res{RAJA::resources::Host()};
RAJA::resources::Resource cuda_res{RAJA::resources::Cuda()};

// allocate arrays in host memory and device memory
int N = 100;

int* host_array = host_res.allocate<int>(N);
int* gpu_array = cuda_res.allocate<int>(N);

// initialize values in host_array....

// initialize gpu_array values to zero
cuda_res.memset(gpu_array, 0, sizeof(int) * N);

// copy host_array values to gpu_array
cuda_res.memcpy(gpu_array, host_array, sizeof(int) * N);

// execute a CUDA kernel that uses gpu_array data
RAJA::forall<RAJA::cuda_exec<128>>(RAJA::TypedRangeSegment<int>(0, N),
  [=] RAJA_DEVICE(int i) {
    // modify values of gpu_array...
  }
);

// copy gpu_array values to host_array
cuda_res.memcpy(host_array, gpu_array, sizeof(int) * N);

// do something with host_array on CPU...

// de-allocate array storage
host_res.deallocate(host_array);
cuda_res.deallocate(gpu_array);

Here, we create a CUDA GPU device resource and a host CPU resource and use them to allocate an array in GPU memory and one in host memory, respectively. Then, after initializing the host array, we use the CUDA resource to copy the host array to the GPU array storage. Next, we run a CUDA device kernel which modifies the GPU array. After using the CUDA resource to copy the GPU array values into the host array, we can do something with the values generated in the GPU kernel on the CPU host. Lastly, we de-allocate the arrays.

Kernel Execution and Resources

Resources can be used with the following RAJA kernel execution interfaces:

  • RAJA::forall

  • RAJA::kernel

  • RAJA::launch

  • RAJA::sort

  • RAJA::scan

Although we show examples using mainly RAJA::forall in the following discussion, resource usage with the other methods listed is similar and provides similar behavior.

Usage

Specifically, a resource can be passed optionally as the first argument in a call to one of these methods. For example:

RAJA::forall<ExecPol>(my_res, .... );

Note

When a resource is not passed when calling one of the methods listed above, the default resource type associated with the execution policy is used in the internal implementation.

When passing a CUDA or HIP resource, the method will execute asynchronously on a GPU stream. Currently, CUDA and HIP are the only resource types that enable asynchronous threading.

Note

Support for OpenMP CPU multithreading, which would use the RAJA::resources::Host resource type, and OpenMP target offload which would use the RAJA::resources::Omp resource type, is incomplete and under development.

The resource type passed to one of the methods listed above must be a concrete type; i.e., not type erased. The reason is that this allows consistency checking via a compile-time assertion to ensure that the passed resource is compatible with the given execution policy. For example:

using ExecPol = RAJA::cuda_exec_async<BLOCK_SIZE>;

RAJA::resources::Cuda my_cuda_res;
RAJA::forall<ExecPol>(my_cuda_res, .... ); // Successfully compiles

RAJA::resources::Resource my_res{RAJA::resources::Cuda()};
RAJA::forall<ExecPol>(my_res, .... )       // Compilation error since resource type is not concrete

RAJA::resources::Host my_host_res;
RAJA::forall<ExecPol>(my_host_res, .... )  // Compilation error since resource type is incompatible with the execution policy

IndexSet Usage

Recall that a kernel that uses a RAJA IndexSet to describe the kernel iteration space, require a two execution policies (see RAJA IndexSet Execution Policies). Currently, a user may only pass a single resource to a method taking an IndexSet argument. The resource is used for the inner execution over each segment in the IndexSet, not for the outer iteration over segments. For example:

using ExecPol = RAJA::ExecPolicy<RAJA::seq_segit, RAJA::cuda_exec<256>>;
RAJA::forall<ExecPol>(my_cuda_res, iset,  .... );

Default Resources

When a resource is not provided by the user, a default resource that corresponds to the execution policy is used. The default resource can be accessed in multiple ways. It can be accessed directly from the concrete resource type:

RAJA::resources::Cuda my_default_cuda = RAJA::resources::Cuda::get_default();

The resource type can also be deduced in two different ways from an execution policy:

using Res = RAJA::resources::get_resource<ExecPol>::type;
Res r = Res::get_default();

Or:

auto my_resource = RAJA::resources::get_default_resource<ExecPol>();

Note

For CUDA and HIP, the default resource is NOT associated with the default CUDA or HIP stream. It is its own stream defined by the underlying camp resource. This is intentional to break away from some issues that arise from the synchronization behavior of the CUDA and HIP default streams. It is still possible to use the CUDA and HIP default streams as the default resource. This can be enabled by defining the environment variable CAMP_USE_PLATFORM_DEFAULT_STREAM before compiling RAJA in a project.

Events

Event objects allow users to wait or query the status of a resource’s action. An event can be returned from a resource:

RAJA::resources::Event e = my_res.get_event();

Getting an event like this enqueues an event object for the given back-end.

Users can call the blocking wait function on the event:

e.wait();

This wait call will block all execution until all operations enqueued on a resource complete.

Alternatively, a user can enqueue the event on a specific resource, forcing only the resource to wait for the operation associated with the event to complete:

my_res.wait_for(&e);

All methods listed above near the beginning of the RAJA resource discussion return an event object so users can access the event associated with the method call. This allows one to set up dependencies between resource objects and operations, as well as define and control asynchronous execution patterns.

Note

An Event object is only created if a user explicitly sets the event returned by the RAJA::forall call to a variable. This avoids unnecessary event objects being created when not needed. For example:

RAJA::forall<cuda_exec_async<BLOCK_SIZE>>(my_cuda_res, ...);

will not generate a cudaStreamEvent, whereas:

RAJA::resources::Event e = RAJA::forall<cuda_exec_async<BLOCK_SIZE>>(my_cuda_res, ...);

will generate a cudaStreamEvent.

Example

The example presented here executes three kernels across two CUDA streams on a GPU with a requirement that the first and second kernel finish execution before the third is launched. It also shows copying memory from the device to host on a resource that we described earlier.

First, we define two concrete CUDA resources and one concrete host resource, and define an asynchronous CUDA execution policy type:

  RAJA::resources::Cuda res_gpu1;
  RAJA::resources::Cuda res_gpu2;
  RAJA::resources::Host res_host;

  using EXEC_POLICY = RAJA::cuda_exec_async<GPU_BLOCK_SIZE>;

Next, we allocate data for two GPU arrays and one host array, all of length ‘N’:

  int* d_array1 = res_gpu1.allocate<int>(N);
  int* d_array2 = res_gpu2.allocate<int>(N);
  int* h_array  = res_host.allocate<int>(N);

Then, we launch a GPU kernel on the CUDA stream associated with the resource res_gpu1, without keeping a handle to the associated event:

  RAJA::forall<EXEC_POLICY>(res_gpu1, RAJA::RangeSegment(0,N),
    [=] RAJA_HOST_DEVICE (int i) {
      d_array1[i] = i;
    }
  );

Next, we execute another GPU kernel on the CUDA stream associated with the resource res_gpu2 and keep a handle to the corresponding event object by assigning it to a local variable e:

  RAJA::resources::Event e = RAJA::forall<EXEC_POLICY>(res_gpu2, RAJA::RangeSegment(0,N),
    [=] RAJA_HOST_DEVICE (int i) {
      d_array2[i] = -1;
    }
  );

We require that the next kernel we launch to wait for the kernel launched on the stream associated with the resource res_gpu2 to complete. Therefore, we enqueue a wait on that event on the res_gpu1 resource:

  res_gpu2.wait_for(&e);

Now that the second GPU kernel is complete, we launch a second kernel on the stream associated with the resource res_gpu1:

  RAJA::forall<EXEC_POLICY>(res_gpu1, RAJA::RangeSegment(0,N),
    [=] RAJA_HOST_DEVICE (int i) {
      d_array1[i] *= d_array2[i];
    }
  );

Next, we enqueue a memcpy operation on the resource res_gpu1 to copy the GPU array d_array to the host array h_array:

  res_gpu1.memcpy(h_array, d_array1, sizeof(int) * N);

Lastly, we use the copied data in a kernel executed on the host:

  bool check = true;
  RAJA::forall<RAJA::seq_exec>(res_host, RAJA::RangeSegment(0,N),
    [&check, h_array] (int i) {
      if(h_array[i] != -i) {check = false;} 
    }
  );