.. ## .. ## Copyright (c) Lawrence Livermore National Security, LLC and other .. ## RAJA Project Developers. See top-level LICENSE and COPYRIGHT .. ## files for dates and other details. No copyright assignment is required .. ## to contribute to RAJA. .. ## .. ## SPDX-License-Identifier: (BSD-3-Clause) .. ## .. _tut-launchintro-label: ------------------------------ ``RAJA::Launch`` Basics ------------------------------ There are no exercise files to work through for this section. Instead, there is an example source file ``RAJA/examples/tut_launch_basic.cpp`` which contains complete code examples of the concepts described here. Key RAJA features shown in the following examples are: * ``RAJA::launch`` method to create a run-time selectable host/device execution space. * ``RAJA::loop`` methods to express algorithms in terms of nested for loops. In this example, we introduce the RAJA Launch framework and discuss hierarchical loop-based parallelism. Kernel execution details with RAJA Launch occur inside the lambda expression passed to the ``RAJA::launch`` method, which defines an execution space:: RAJA::launch(RAJA::ExecPlace , RAJA::LaunchParams(RAJA::Teams(Nteams,Nteams), RAJA::Threads(Nthreads,Nthreads)), [=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) { /* Kernel code goes here */ }); The ``RAJA::launch`` method accepts a ``RAJA::LaunchPolicy`` template parameter that can be defined using up to two policies (host and device). For example, the following creates an execution space for a sequential and CUDA kernel dispatch:: using launch_policy = RAJA::LaunchPolicy >; Whether a kernel executes on the host or device is determined by the first argument passed to the ``RAJA::launch`` method, which is a ``RAJA::ExecPlace`` enum value, either ``HOST`` or ``DEVICE``. Similar to GPU thread and block programming models, RAJA Launch carries out computation in a predefined compute grid made up of threads which are then grouped into teams when executing on the device. The execution space is then enclosed by a host/device lambda which takes a ``RAJA::LaunchContext`` object, which may be used to control the flow within the kernel, for example by creating thread-team synchronization points. Inside the execution space, developers write a kernel using nested ``RAJA::loop`` methods. The manner in which each loop is executed is determined by a template parameter type, which indicates how the corresponding iterates are mapped to the Teams/Threads configuration defined by the ``RAJA::LaunchParams`` type passed as the second argument to the ``RAJA::launch`` method. Following the CUDA and HIP programming models, this defines an hierarchical structure in which outer loops are executed by thread-teams and inner loops are executed by threads in a team. .. literalinclude:: ../../../../examples/tut_launch_basic.cpp :start-after: // _team_loops_start :end-before: // _team_loops_end :language: C++ The mapping between teams and threads to the underlying programming model depends on how the ``RAJA::loop`` template parameter types are defined. For example, we may define host and device mapping strategies as:: using teams_x = RAJA::LoopPolicy< RAJA::seq_exec, RAJA::cuda_block_x_direct >; using thread_x = RAJA::LoopPolicy< RAJA::seq_exec, RAJA::cuda_block_x_direct >; Here, the ``RAJA::LoopPolicy`` type holds both the host (CPU) and device (CUDA GPU) loop mapping strategies. On the host, both the team/thread strategies expand out to standard C-style loops for execution: .. literalinclude:: ../../../../examples/tut_launch_basic.cpp :start-after: // _c_style_loops_start :end-before: // _c_style_loops_end :language: C++ On the device the ``teams_x/y`` policies will map loop iterations directly to CUDA (or HIP) thread blocks, while the ``thread_x/y`` policies will map loop iterations directly to threads in a CUDA (or HIP) thread block. The direct CUDA equivalent of the kernel body using the policy shown above is: .. literalinclude:: ../../../../examples/tut_launch_basic.cpp :start-after: // _device_loop_start :end-before: // _device_loop_end :language: C++