Nested Loop Interchange

Key RAJA features shown in this example:

  • RAJA::kernel loop iteration templates
  • RAJA nested loop execution policies
  • Nested loop reordering (i.e., loop interchange)
  • RAJA strongly-types indices

In Complex Loops (RAJA::kernel), we introduced the basic mechanics in RAJA for representing nested loops. In Matrix Multiplication (Nested Loops), we presented a complete example using RAJA nested loop features. The following example shows the nested loop interchange process in more detail. Specifically, we describe how to reorder nested policy arguments and introduce strongly-typed index variables that can help users write correct nested loop code with RAJA. The example does not perform any actual computation; each kernel simply prints out the loop indices in the order that the iteration spaces are traversed. Thus, only sequential execution policies are used. However, the mechanics work the same way for other RAJA execution policies.

Before we dive into the example, we note important features applied here that represent the main differences between nested-loop RAJA and the RAJA::forall loop construct for simple (i.e., non-nested) loops:

  • An index space (e.g., range segment) and lambda index argument are required for each level in a loop nest. This example contains triply-nested loops, so there will be three ranges and three index arguments.
  • The index spaces for the nested loop levels are specified in a RAJA tuple object. The order of spaces in the tuple must match the order of index arguments to the lambda for this to be correct, in general. RAJA provides strongly-typed indices to help with this, which we show here.
  • An execution policy is required for each level in a loop nest. These are specified as nested statements in the RAJA::KernelPolicy type.
  • The loop nest ordering is specified in the nested kernel policy – the first statement::For type identifies the outermost loop, the second statement::For type identifies the loop nested inside the outermost loop, and so on.

We begin by defining three named strongly-typed variables for the loop index variables.

RAJA_INDEX_VALUE(KIDX, "KIDX");
RAJA_INDEX_VALUE(JIDX, "JIDX"); 
RAJA_INDEX_VALUE(IIDX, "IIDX"); 

We also define three typed range segments which bind the ranges to the index variable types via template specialization:

  RAJA::TypedRangeSegment<KIDX> KRange(2, 4);
  RAJA::TypedRangeSegment<JIDX> JRange(1, 3);
  RAJA::TypedRangeSegment<IIDX> IRange(0, 2);

When these features are used as in this example, the compiler will generate error messages if the lambda expression index argument ordering and types do not match the index ordering in the tuple.

We present a complete example, and then describe its key elements:

  using KJI_EXECPOL = RAJA::KernelPolicy<
                        RAJA::statement::For<2, RAJA::seq_exec,    // k
                          RAJA::statement::For<1, RAJA::seq_exec,  // j
                            RAJA::statement::For<0, RAJA::seq_exec,// i 
                              RAJA::statement::Lambda<0>
                            > 
                          > 
                        > 
                      >;

  RAJA::kernel<KJI_EXECPOL>( RAJA::make_tuple(IRange, JRange, KRange),
  [=] (IIDX i, JIDX j, KIDX k) { 
     printf( " (%d, %d, %d) \n", (int)(*i), (int)(*j), (int)(*k));
  });

Here, the RAJA::kernel execution template takes two arguments: a tuple of ranges, one for each of the three levels in the loop nest, and the lambda expression loop body. Note that the lambda has an index argument for each range and that their order and types match.

The execution policy for the loop nest is specified in the RAJA::KernelPolicy type. Each level in the loop nest is identified by a statement::For type, which identifies the iteration space and execution policy for the level. Here, each level uses a sequential execution policy. This is for illustration purposes; if you run the example code, you will see the loop index triple printed in the exact order in which the kernel executes. The integer that appears as the first template argument to each statement::For type corresponds to the index of a range in the tuple and also to the associated lambda index argument; i.e., ‘0’ is for ‘i’, ‘1’ is for ‘j’, and ‘2’ is for ‘k’.

Here, the ‘k’ index corresponds to the outermost loop (slowest index), the ‘j’ index corresponds to the middle loop, and the ‘i’ index is for the innermost loop (fastest index). In other words, if written using C-style for-loops, the loop would appear as:

for (int k = 2; k< 4; ++k) {
  for (int j = 1; j < 3; ++j) {
    for (int i = 0; i < 2; ++i) {
      // print loop index triple...
    }
  }
}

The integer argument to each statement::For type is needed so that the levels in the loop nest can be reordered by changing the policy while the kernel remains the same. Next, we permute the loop nest ordering so that the ‘j’ loop is the outermost, the ‘i’ loop is in the middle, and the ‘k’ loop is the innermost with the following policy:

  using JIK_EXECPOL = RAJA::KernelPolicy<
                        RAJA::statement::For<1, RAJA::seq_exec,    // j
                          RAJA::statement::For<0, RAJA::seq_exec,  // i
                            RAJA::statement::For<2, RAJA::seq_exec,// k 
                              RAJA::statement::Lambda<0>
                            > 
                          > 
                        > 
                      >;

Note that we have simply reordered the nesting of the RAJA::statement::For types. This is analogous to reordering ‘for’ statements in traditional C-style nested loops. Here, the analogous C-style loop nest would appear as:

for (int j = 1; j < 3; ++j) {
  for (int i = 0; i < 2; ++i) {
    for (int k = 2; k< 4; ++k) {
      // print loop index triple...
    }
  }
}

Finally, for completeness, we permute the loops again so that the ‘i’ loop is the outermost, the ‘k’ loop is in the middle, and the ‘j’ loop is the innermost with the following policy:

  using IKJ_EXECPOL = RAJA::KernelPolicy<
                        RAJA::statement::For<0, RAJA::seq_exec,    // i
                          RAJA::statement::For<2, RAJA::seq_exec,  // k
                            RAJA::statement::For<1, RAJA::seq_exec,// j 
                              RAJA::statement::Lambda<0>
                            > 
                          > 
                        > 
                      >;

The analogous C-style loop nest would appear as:

for (int i = 0; j < 2; ++i) {
  for (int k = 2; k< 4; ++k) {
    for (int j = 1; j < 3; ++j) {
      // print loop index triple...
    }
  }
}

Hopefully, it should be clear how this works at this point. If not, the typed indices and typed range segments can help by enabling the compiler to let you know when something is not correct.

For example, this version of the loop will generate a compilation error (note that the kernel execution policy is the same as in the previous example):

  RAJA::kernel<IKJ_EXECPOL>( RAJA::make_tuple(IRange, JRange, KRange),
  [=] (JIDX i, IIDX j, KIDX k) {
     printf( " (%d, %d, %d) \n", (int)(*i), (int)(*j), (int)(*k));
  });

If you carefully compare the range ordering in the tuple to the lambda argument types, you will see what’s wrong.

Do you see the problem?

The file RAJA/examples/tut_nested-loop-reorder.cpp contains the complete working example code.