.. ## .. ## Copyright (c) Lawrence Livermore National Security, LLC and other .. ## RAJA Project Developers. See top-level LICENSE and COPYRIGHT .. ## files for dates and other details. No copyright assignment is required .. ## to contribute to RAJA. .. ## .. ## SPDX-License-Identifier: (BSD-3-Clause) .. ## .. _feat-resource-label: ========= Resources ========= This section describes the basic concepts of resource types and how to use them with RAJA-based kernels using ``RAJA::forall``, ``RAJA::kernel``, `` RAJA::launch``, etc. Resources are used as an interface to various RAJA back-end constructs and their respective hardware. Currently there exist resource types for ``Cuda``, ``Hip``, ``Omp`` (target) and ``Host``. Resource objects allow one to allocate and deallocate storage in memory spaces associated with RAJA back-ends and copy data between memory spaces. They also allow one to execute RAJA kernels asynchronously on a respective thread/stream. Resource support in RAJA is rudimentary at this point and its functionality / behavior may change as it is developed. .. note:: * Currently feature complete asynchronous behavior and streamed/threaded support is available only for ``Cuda`` and ``Hip`` resources. * RAJA resource support is based on camp resource support. The ``RAJA::resources`` namespace aliases the ``camp::resources`` namespace. Each resource has a set of underlying functionality that is synonymous across each resource type. ===================== =============================================== Methods Brief description ===================== =============================================== get_platform Returns the underlying camp platform associated with the resource. get_event Return an event object for the resource from the last resource call. allocate Allocate data on a resource back-end. deallocate Deallocate data on a resource back-end. memcpy Perform a memory copy from a source location to a destination location on a resource back-end. memset Set memory value in an allocation on a resource back-end. wait Wait for all operations enqueued on a resource to complete before proceeding. wait_for Enqueue a wait on a resource stream/thread for a user passed event to complete. ===================== =============================================== .. note:: ``deallocate``, ``memcpy`` and ``memset`` operations only work with pointers that correspond to memory locations that have been allocated on the resource's respective device. Each resource type also defines specific back-end information/functionality. For example, each CUDA resource contains a ``cudaStream_t`` value with an associated get method. The basic interface for each resource type is summarized in `Camp resource `_. .. note:: Stream IDs are assigned to resources in a round robin fashion. The number of independent streams for a given back-end is limited to the maximum number of concurrent streams that the back-end supports. ------------ Type-Erasure ------------ Resources can be declared in two ways, as a type-erased resource or as a concrete resource. The underlying run time functionality is the same for both. Here is one way to construct a concrete CUDA resource type:: RAJA::resources::Cuda my_cuda_res; A type-erased resource allows a user the ability to change the resource back-end at run time. For example, to choose a CUDA GPU device resource or host resource at run time, one could do the following:: RAJA::resources::Resource* my_res = nullptr; if (use_gpu) my_res = new RAJA::resources::Resource{RAJA::resources::Cuda()}; else my_res = new RAJA::resources::Resource{RAJA::resources::Host()}; When ``use_gpu`` is true, ``my_res`` will be a CUDA GPU device resource. Otherwise, it will be a host CPU resource. ------------------- Memory Operations ------------------- The example discussed in this section illustrates most of the memory operations that can be performed with RAJA resource objects. A common use case for a resource is to manage arrays in the appropriate memory space to use in a kernel. Consider the following code example:: // create a resource for a host CPU and a CUDA GPU device RAJA::resources::Resource host_res{RAJA::resources::Host()}; RAJA::resources::Resource cuda_res{RAJA::resources::Cuda()}; // allocate arrays in host memory and device memory int N = 100; int* host_array = host_res.allocate(N); int* gpu_array = cuda_res.allocate(N); // initialize values in host_array.... // initialize gpu_array values to zero cuda_res.memset(gpu_array, 0, sizeof(int) * N); // copy host_array values to gpu_array cuda_res.memcpy(gpu_array, host_array, sizeof(int) * N); // execute a CUDA kernel that uses gpu_array data RAJA::forall>(RAJA::TypedRangeSegment(0, N), [=] RAJA_DEVICE(int i) { // modify values of gpu_array... } ); // copy gpu_array values to host_array cuda_res.memcpy(host_array, gpu_array, sizeof(int) * N); // do something with host_array on CPU... // de-allocate array storage host_res.deallocate(host_array); cuda_res.deallocate(gpu_array); Here, we create a CUDA GPU device resource and a host CPU resource and use them to allocate an array in GPU memory and one in host memory, respectively. Then, after initializing the host array, we use the CUDA resource to copy the host array to the GPU array storage. Next, we run a CUDA device kernel which modifies the GPU array. After using the CUDA resource to copy the GPU array values into the host array, we can do something with the values generated in the GPU kernel on the CPU host. Lastly, we de-allocate the arrays. -------------------------------- Kernel Execution and Resources -------------------------------- Resources can be used with the following RAJA kernel execution interfaces: * ``RAJA::forall`` * ``RAJA::kernel`` * ``RAJA::launch`` * ``RAJA::sort`` * ``RAJA::scan`` Although we show examples using mainly ``RAJA::forall`` in the following discussion, resource usage with the other methods listed is similar and provides similar behavior. Usage ^^^^^ Specifically, a resource can be passed optionally as the first argument in a call to one of these methods. For example:: RAJA::forall(my_res, .... ); .. note:: When a resource is not passed when calling one of the methods listed above, the *default* resource type associated with the execution policy is used in the internal implementation. When passing a CUDA or HIP resource, the method will execute asynchronously on a GPU stream. Currently, CUDA and HIP are the only resource types that enable asynchronous threading. .. note:: Support for OpenMP CPU multithreading, which would use the ``RAJA::resources::Host`` resource type, and OpenMP target offload which would use the ``RAJA::resources::Omp`` resource type, is incomplete and under development. The resource type passed to one of the methods listed above must be a concrete type; i.e., not type erased. The reason is that this allows consistency checking via a compile-time assertion to ensure that the passed resource is compatible with the given execution policy. For example:: using ExecPol = RAJA::cuda_exec_async; RAJA::resources::Cuda my_cuda_res; RAJA::forall(my_cuda_res, .... ); // Successfully compiles RAJA::resources::Resource my_res{RAJA::resources::Cuda()}; RAJA::forall(my_res, .... ) // Compilation error since resource type is not concrete RAJA::resources::Host my_host_res; RAJA::forall(my_host_res, .... ) // Compilation error since resource type is incompatible with the execution policy IndexSet Usage ^^^^^^^^^^^^^^^ Recall that a kernel that uses a RAJA IndexSet to describe the kernel iteration space, require a two execution policies (see :ref:`indexsetpolicy-label`). Currently, a user may only pass a single resource to a method taking an IndexSet argument. The resource is used for the *inner* execution over each segment in the IndexSet, not for the *outer* iteration over segments. For example:: using ExecPol = RAJA::ExecPolicy>; RAJA::forall(my_cuda_res, iset, .... ); Default Resources ^^^^^^^^^^^^^^^^^^^^^^ When a resource is not provided by the user, a *default* resource that corresponds to the execution policy is used. The default resource can be accessed in multiple ways. It can be accessed directly from the concrete resource type:: RAJA::resources::Cuda my_default_cuda = RAJA::resources::Cuda::get_default(); The resource type can also be deduced in two different ways from an execution policy:: using Res = RAJA::resources::get_resource::type; Res r = Res::get_default(); Or:: auto my_resource = RAJA::resources::get_default_resource(); .. note:: For CUDA and HIP, the default resource is *NOT* associated with the default CUDA or HIP stream. It is its own stream defined by the underlying camp resource. This is intentional to break away from some issues that arise from the synchronization behavior of the CUDA and HIP default streams. It is still possible to use the CUDA and HIP default streams as the default resource. This can be enabled by defining the environment variable ``CAMP_USE_PLATFORM_DEFAULT_STREAM`` before compiling RAJA in a project. ------ Events ------ Event objects allow users to wait or query the status of a resource's action. An event can be returned from a resource:: RAJA::resources::Event e = my_res.get_event(); Getting an event like this enqueues an event object for the given back-end. Users can call the *blocking* ``wait`` function on the event:: e.wait(); This wait call will block all execution until all operations enqueued on a resource complete. Alternatively, a user can enqueue the event on a specific resource, forcing only the resource to wait for the operation associated with the event to complete:: my_res.wait_for(&e); All methods listed above near the beginning of the RAJA resource discussion return an event object so users can access the event associated with the method call. This allows one to set up dependencies between resource objects and operations, as well as define and control asynchronous execution patterns. .. note:: An Event object is only created if a user explicitly sets the event returned by the ``RAJA::forall`` call to a variable. This avoids unnecessary event objects being created when not needed. For example:: RAJA::forall>(my_cuda_res, ...); will *not* generate a cudaStreamEvent, whereas:: RAJA::resources::Event e = RAJA::forall>(my_cuda_res, ...); will generate a cudaStreamEvent. ------- Example ------- The example presented here executes three kernels across two CUDA streams on a GPU with a requirement that the first and second kernel finish execution before the third is launched. It also shows copying memory from the device to host on a resource that we described earlier. First, we define two concrete CUDA resources and one concrete host resource, and define an asynchronous CUDA execution policy type: .. literalinclude:: ../../../../examples/resource-forall.cpp :start-after: _raja_res_defres_start :end-before: _raja_res_defres_end :language: C++ Next, we allocate data for two GPU arrays and one host array, all of length 'N': .. literalinclude:: ../../../../examples/resource-forall.cpp :start-after: _raja_res_alloc_start :end-before: _raja_res_alloc_end :language: C++ Then, we launch a GPU kernel on the CUDA stream associated with the resource ``res_gpu1``, without keeping a handle to the associated event: .. literalinclude:: ../../../../examples/resource-forall.cpp :start-after: _raja_res_k1_start :end-before: _raja_res_k1_end :language: C++ Next, we execute another GPU kernel on the CUDA stream associated with the resource ``res_gpu2`` and keep a handle to the corresponding event object by assigning it to a local variable ``e``: .. literalinclude:: ../../../../examples/resource-forall.cpp :start-after: _raja_res_k2_start :end-before: _raja_res_k2_end :language: C++ We require that the next kernel we launch to wait for the kernel launched on the stream associated with the resource ``res_gpu2`` to complete. Therefore, we enqueue a wait on that event on the ``res_gpu1`` resource: .. literalinclude:: ../../../../examples/resource-forall.cpp :start-after: _raja_res_wait_start :end-before: _raja_res_wait_end :language: C++ Now that the second GPU kernel is complete, we launch a second kernel on the stream associated with the resource ``res_gpu1``: .. literalinclude:: ../../../../examples/resource-forall.cpp :start-after: _raja_res_k3_start :end-before: _raja_res_k3_end :language: C++ Next, we enqueue a memcpy operation on the resource ``res_gpu1`` to copy the GPU array ``d_array`` to the host array ``h_array``: .. literalinclude:: ../../../../examples/resource-forall.cpp :start-after: _raja_res_memcpy_start :end-before: _raja_res_memcpy_end :language: C++ Lastly, we use the copied data in a kernel executed on the host: .. literalinclude:: ../../../../examples/resource-forall.cpp :start-after: _raja_res_k4_start :end-before: _raja_res_k4_end :language: C++