Team based loops (RAJA Teams)¶

Key RAJA features shown in the following examples:

RAJA::expt::launch method to create a run-time selectable host/device execution space.

RAJA::expt::loop methods to express algorithms in terms of nested for loops.

In this example, we introduce the RAJA Teams framework and discuss hierarchical loop-based parallelism. Development with RAJA Teams occurs inside an execution space. The execution space is launched using the RAJA::expt::launch method:

RAJA::expt::launch<launch_policy>(RAJA::expt::ExecPlace ,
RAJA::expt::Grid(RAJA::expt::Teams(Nteams,Nteams),
                      RAJA::expt::Threads(Nthreads,Nthreads)),
[=] RAJA_HOST_DEVICE (RAJA::expt::LaunchContext ctx) {

  /* Express code here */

});

The RAJA::expt::launch method is templated on both a host and a device launch policy. As an example, the following constructs an execution space for a sequential and CUDA kernel:

using launch_policy = RAJA::expt::LaunchPolicy
  <RAJA::expt::seq_launch_t, RAJA::expt::cuda_launch_t<false>>;

Kernel execution on either the host or device is driven by the first argument of the method which takes a RAJA::expt::ExecPlace enum type, either HOST or DEVICE. Similar to thread, and block programming models, RAJA Teams carries out computation in a predefined compute grid made up of threads which are then grouped into teams. The execution space is then enclosed by a host/device lambda which takes a RAJA::expt::LaunchContext object. The RAJA::expt::LaunchContext may then be used to control the flow within the kernel, for example creating thread-team synchronization points.

Inside the execution space the RAJA::expt::loop methods enable developers to express their code in terms of nested loops. The manner in which the loops are executed depends on the template. Following the CUDA/HIP programming models we follow a hierarchical structure in which outer loops are executed by thread-teams and inner loops are executed by a thread in a team.

   RAJA::expt::loop<teams_y>(ctx, RAJA::RangeSegment(0, Nteams), [&] (int by) {
     RAJA::expt::loop<teams_x>(ctx, RAJA::RangeSegment(0, Nteams), [&] (int bx) {

       RAJA::expt::loop<threads_y>(ctx, RAJA::RangeSegment(0, Nthreads), [&] (int ty) {
         RAJA::expt::loop<threads_x>(ctx, RAJA::RangeSegment(0, Nthreads), [&] (int tx) {

             printf("RAJA Teams: threadId_x %d threadId_y %d teamId_x %d teamId_y %d \n",
                    tx, ty, bx, by);

           });
         });

       });
     });

The mapping between the thread and teams to programming model depends on how they are defined. For example, we may define host and device mapping strategies as the following:

using teams_x = RAJA::expt::LoopPolicy<RAJA::loop_exec,
                                       RAJA::cuda_block_x_direct>;
using thread_x = RAJA::expt::LoopPolicy<RAJA::loop_exec,
                                        RAJA::cuda_block_x_direct>;

In the example above the RAJA::expt::LoopPolicy struct holds both the host and device loop mapping strategies. On the host, both the team/thread strategies expand out to standard C-style loops for execution:

    for(int by=0; by<Nteams; ++by) {
      for(int bx=0; bx<Nteams; ++bx) {

        for(int ty=0; ty<Nthreads; ++ty) {
          for(int tx=0; tx<Nthreads; ++tx) {

            printf("c-iter: iter_tx %d iter_ty %d iter_bx %d iter_by %d \n",
                   tx, ty, bx, by);
          }
        }

      }
    }

On the device the teams_x/y policies will map loop iterations directly to CUDA thread blocks, while the thread_x/y policies will map loop iterations directly to threads in a CUDA block. The CUDA equivalent is illustrated below:

  {int by = blockIdx.y;
    {int bx = blockIdx.x;

      {int ty = threadIdx.y;
        {int tx = blockIdx.x;

          printf("device-iter: threadIdx_tx %d threadIdx_ty %d block_bx %d block_by %d \n",
                 tx, ty, bx, by);

        }
      }

    }
  }

The file RAJA/examples/tut_teams_basic.cpp contains the complete working example code.