Loop Tiling¶
In this section, we discuss RAJA statements that can be used to tile nested for-loops. Typical loop tiling involves partitioning an iteration space into a collection of “tiles” and then iterating over tiles in outer loops and entries within each tile in inner loops. Many scientific computing algorithms can benefit from loop tiling due to more efficient cache usage on a CPU or use of GPU shared memory.
For example, an operation performed using a for-loop with a range of [0, 10):
for (int i=0; i<10; ++i) {
// loop body using index 'i'
}
May be expressed as a loop nest that iterates over five tiles of size two:
int numTiles = 5;
int tileDim = 2;
for (int t=0; t<numTiles; ++t) {
for (int j=0; j<tileDim; ++j) {
int i = j + tileDim*t; // Calculate global index 'i'
// loop body using index 'i'
}
}
Next, we show how this tiled loop can be represented using RAJA. Then, we present variations on it that illustrate the usage of different RAJA kernel statement types.
using KERNEL_EXEC_POL =
RAJA::KernelPolicy<
RAJA::statement::Tile<0, RAJA::tile_fixed<2>, RAJA::seq_exec,
RAJA::statement::For<0, RAJA::seq_exec,
RAJA::statement::Lambda<0>
>
>
>;
RAJA::kernel<KERNEL_EXEC_POL>(RAJA::make_tuple(RAJA::RangeSegment(0,10)),
[=] (int i) {
// loop body using index 'i'
});
In RAJA, the simplest way to tile an iteration space is to use RAJA
statement::Tile
and statement::For
statement types. A
statement::Tile
type is similar to a statement::For
type, but takes
a tile size as the second template argument. The statement::Tile
construct generates the outer loop over tiles and the statement::For
statement iterates over each tile. Nested together, as in the example, these
statements will pass the global index ‘i’ to the loop body in the lambda
expression as in the non-tiled version above.
Note
When using statement::Tile
and statement::For
types together
to define a tiled loop structure, the integer passed as the first
template argument to each statement type must be the same. This
indicates that they both apply to the same item in the iteration
space tuple passed to the RAJA::kernel
methods.
RAJA also provides alternative tiling and for statements that provide the tile number and local tile index, if needed inside the kernel body, as shown below:
using KERNEL_EXEC_POL2 =
RAJA::KernelPolicy<
RAJA::statement::TileTCount<0, RAJA::statement::Param<0>,
RAJA::tile_fixed<2>, RAJA::seq_exec,
RAJA::statement::ForICount<0, RAJA::statement::Param<1>,
RAJA::seq_exec,
RAJA::statement::Lambda<0>
>
>
>;
RAJA::kernel_param<KERNEL_EXEC_POL2>(RAJA::make_tuple(RAJA::RangeSegment(0,10)),
RAJA::make_tuple((int)0, (int)0),
[=](int i, int t, int j) {
// i - global index
// t - tile number
// j - index within tile
// Then, i = j + 2*t (2 is tile size)
});
The statement::TileTCount
type allows the tile number to be accessed as a
lambda argument and the statement::ForICount
type allows the local tile
loop index to be accessed as a lambda argument. These values are specified in
the tuple, which is the second argument passed to the RAJA::kernel_param
method above. The statement::Param<#>
type appearing as the second
template parameter for each statement type indicates which parameter tuple
entry the tile number or local tile loop index is passed to the lambda, and
in which order. Here, the tile number is the second lambda argument (tuple
parameter ‘0’) and the local tile loop index is the third lambda argument
(tuple parameter ‘1’).
Note
The global loop indices always appear as the first lambda expression
arguments. Then, the parameter tuples identified by the integers
in the Param
statement types given for the loop statement
types follow.