Tiled Matrix Transpose with Local Array¶
This section extends the discussion in Tiled Matrix Transpose by adding local array objects which are used to store data for each tile in CPU stack-allocated arrays or GPU thread local and shared memory to be used within kernels.
There are exercise files
RAJA/exercises/kernel-matrix-transpose-local-array.cpp
and
RAJA/exercises/launch-matrix-transpose-local-array.cpp
for you to work
through if you wish to get some practice with RAJA. The files
RAJA/exercises/kernel-matrix-transpose-local-array._solutioncpp
and
RAJA/exercises/launch-matrix-transpose-local-array_solution.cpp
contain
complete working code for the examples. You can use the solution files to
check your work and for guidance if you get stuck. To build
the exercises execute make (kernel/launch)-matrix-transpose-local-array
and make (kernel/launch)-matrix-transpose-local-array_solution
from the build directory.
Key RAJA features shown in this example are:
RAJA::kernel_param
method and execution policy usage with multiple lambda expressions
RAJA::statement::Tile
type for loop tiling
RAJA::statement::ForICount
type for generating local tile indices
RAJA::LocalArray
type for thread-local tile memory arrays
RAJA::launch
kernel execution interface
RAJA::tile
type for loop tiling
RAJA::loop_icount
method to generate local tile indices for Launch
RAJA_TEAM_SHARED
macro for thread-local tile memory arrays
As in Tiled Matrix Transpose, this example computes the transpose of an input matrix \(A\) of size \(N_r \times N_c\) and stores the result in a second matrix \(At\) of size \(N_c \times N_r\). The operation uses a local memory tiling algorithm, which tiles the outer loops and iterates over tiles in inner loops. The algorithm first loads input matrix entries into a local two-dimensional array for a tile, and then reads from the tile swapping the row and column indices to generate the output matrix.
We choose tile dimensions smaller than the dimensions of the matrix and note that it is not necessary for the tile dimensions to divide evenly the number of rows and columns in the matrix. As in the Tiled Matrix Transpose example, we start by defining the number of rows and columns in the matrices, the tile dimensions, and the number of tiles.
constexpr int N_r = 267;
constexpr int N_c = 251;
constexpr int TILE_DIM = 16;
constexpr int outer_Dimc = (N_c - 1) / TILE_DIM + 1;
constexpr int outer_Dimr = (N_r - 1) / TILE_DIM + 1;
We also use RAJA View objects to simplify the multi-dimensional indexing as in the Tiled Matrix Transpose example.
RAJA::View<int, RAJA::Layout<DIM>> Aview(A, N_r, N_c);
RAJA::View<int, RAJA::Layout<DIM>> Atview(At, N_c, N_r);
The complete sequential C-style implementation of the tiled transpose operation using a stack-allocated local array for the tiles is:
//
// (0) Outer loops to iterate over tiles
//
for (int by = 0; by < outer_Dimr; ++by) {
for (int bx = 0; bx < outer_Dimc; ++bx) {
// Stack-allocated local array for data on a tile
int Tile[TILE_DIM][TILE_DIM];
//
// (1) Inner loops to read input matrix tile data into the array
//
// Note: loops are ordered so that input matrix data access
// is stride-1.
//
for (int ty = 0; ty < TILE_DIM; ++ty) {
for (int tx = 0; tx < TILE_DIM; ++tx) {
int col = bx * TILE_DIM + tx; // Matrix column index
int row = by * TILE_DIM + ty; // Matrix row index
// Bounds check
if (row < N_r && col < N_c) {
Tile[ty][tx] = Aview(row, col);
}
}
}
//
// (2) Inner loops to write array data into output array tile
//
// Note: loop order is swapped from above so that output matrix
// data access is stride-1.
//
for (int tx = 0; tx < TILE_DIM; ++tx) {
for (int ty = 0; ty < TILE_DIM; ++ty) {
int col = bx * TILE_DIM + tx; // Matrix column index
int row = by * TILE_DIM + ty; // Matrix row index
// Bounds check
if (row < N_r && col < N_c) {
Atview(col, row) = Tile[ty][tx];
}
}
}
}
}
Note
To prevent indexing out of bounds, when the tile dimensions do not divide evenly the matrix dimensions, we use a bounds check in the inner loops.
For efficiency, we order the inner loops so that reading from the input matrix and writing to the output matrix both use stride-1 data access.
RAJA::kernel
Variants¶
The RAJA::kernel
interface provides mechanisms to tile loops and use
local arrays in kernels so that algorithm patterns like the C-style kernel
above can be implemented with RAJA. When using RAJA::kernel
, a
RAJA::LocalArray
type specifies an object whose memory is created inside
a kernel using a statement type in a RAJA kernel execution policy. The local
array data is only usable within the kernel. See Local Array
for more information.
RAJA::kernel
methods also support loop tiling statements which determine
the number of tiles needed to perform an operation based on tile size and
extent of the corresponding iteration space. Moreover, lambda expressions for
the kernel will not be invoked for iterations outside the bounds of an
iteration space when tile dimensions do not divide evenly the size of the
iteration space; thus, no conditional checks on loop bounds are needed
inside inner loops.
For the RAJA version of the matrix transpose kernel above, we define the
type of the RAJA::LocalArray
used for matrix entries in a tile and
create an object to represent it:
using TILE_MEM =
RAJA::LocalArray<int, RAJA::Perm<0, 1>, RAJA::SizeList<TILE_DIM, TILE_DIM>>;
TILE_MEM Tile_Array;
The template parameters that define the type are: the array data type, the
data stride permutation for the array indices (here the identity permutation
is given, so the default RAJA conventions apply; i.e., the rightmost array
index will be stride-1), and the array dimensions. Next, we compare two
RAJA::kernel
implementations of the matrix transpose operation.
The complete RAJA sequential CPU variant with kernel execution policy and kernel is:
using SEQ_EXEC_POL_I =
RAJA::KernelPolicy<
RAJA::statement::Tile<1, RAJA::tile_fixed<TILE_DIM>, RAJA::seq_exec,
RAJA::statement::Tile<0, RAJA::tile_fixed<TILE_DIM>, RAJA::seq_exec,
RAJA::statement::InitLocalMem<RAJA::cpu_tile_mem, RAJA::ParamList<2>,
RAJA::statement::ForICount<1, RAJA::statement::Param<0>, RAJA::seq_exec,
RAJA::statement::ForICount<0, RAJA::statement::Param<1>, RAJA::seq_exec,
RAJA::statement::Lambda<0>
>
>,
RAJA::statement::ForICount<0, RAJA::statement::Param<1>, RAJA::seq_exec,
RAJA::statement::ForICount<1, RAJA::statement::Param<0>, RAJA::seq_exec,
RAJA::statement::Lambda<1>
>
>
>
>
>
>;
RAJA::kernel_param<SEQ_EXEC_POL_I>(
RAJA::make_tuple(RAJA::TypedRangeSegment<int>(0, N_c),
RAJA::TypedRangeSegment<int>(0, N_r)),
RAJA::make_tuple((int)0, (int)0, Tile_Array),
[=](int col, int row, int tx, int ty, TILE_MEM &Tile_Array) {
Tile_Array(ty, tx) = Aview(row, col);
},
[=](int col, int row, int tx, int ty, TILE_MEM &Tile_Array) {
Atview(col, row) = Tile_Array(ty, tx);
}
);
In the execution policy, the RAJA::statement::Tile
types define
tiling of the outer ‘row’ (iteration space tuple index ‘1’) and ‘col’
(iteration space tuple index ‘0’) loops, as well as tile sizes
(RAJA::tile_fixed
types) and loop execution policies. Next,
the RAJA::statement::InitLocalMem
type allocates the local tile array
based on the memory policy type (here, we use RAJA::cpu_tile_mem
for
a CPU stack-allocated array). The RAJA::ParamList<2>
parameter indicates
that the local array object is associated with position ‘2’ in the parameter
tuple argument passed to the RAJA::kernel_param
method. The first two
entries in the parameter tuple indicate storage for the local tile indices
that are used in the two lambda expressions that comprise the kernel body.
Finally, we have two sets of nested inner loops for reading the input matrix
entries into the local tile array and writing them out to the output matrix
transpose. The inner bodies of each of these loop nests are identified by
lambda expression invocation statements RAJA::statement::Lambda<0>
for
the first lambda passed as an argument to the RAJA::kernel_param
method
and RAJA::statement::Lambda<1>
for the second lambda argument.
Note that the loops within tiles use RAJA::statement::ForICount
types
rather than RAJA::statement::For
types that we saw in the
tiled matrix transpose example in Tiled Matrix Transpose.
The RAJA::statement::ForICount
type generates local tile indices that
are passed to lambda loop body expressions to index into the local tile
memory array. As the reader will observe, there is no local tile index
computation needed in the lambdas for the RAJA version of the kernel as a
result. The first integer template parameter for each
RAJA::statement::ForICount
type indicates the item in the iteration space
tuple passed to the RAJA::kernel_param
method to which it applies.
The second template parameter for each
RAJA::statement::ForICount
type indicates the position in the parameter
tuple passed to the RAJA::kernel_param
method that will hold the
associated local tile index. For more detailed discussion of RAJA loop tiling
statement types, please see Loop Tiling.
Now that we have described the execution policy in some detail, let’s pull
everything together by briefly walking though the call to the
RAJA::kernel_param
method, which is similar to RAJA::kernel
but takes
additional arguments needed to execute the operations involving local
tile indices and the local memory array. The first argument is a tuple of
iteration spaces that define the iteration ranges for the levels in the loop
nest. Again, the first integer parameters given to the RAJA::statement::Tile
and RAJA::statement::ForICount
types identify the tuple entry to which
they apply. The second argument:
RAJA::make_tuple((int)0, (int)0, Tile_Array)
is a tuple of data parameters that will hold the local tile indices and
RAJA::LocalArray
tile memory. The tuple entries are
associated with various statements in the execution policy as we described
earlier. Next, two lambda expression arguments are passed to the
RAJA::kernel_param
method for reading and writing the input and output
matrix entries, respectively.
Note
RAJA::kernel_param
accepts a parameter tuple argument after
the iteration space tuple, which enables the parameters to be
used in multiple lambda expressions in a kernel.
In the kernel, both lambda expressions take the same five arguments. The first
two are the matrix global column and row indices associated with the iteration
space tuple. The next three arguments correspond to the parameter tuple entries.
The first two of these are the local tile indices used to access entries in the
RAJA::LocalArray
object memory. The last argument is a reference to the
RAJA::LocalArray
object itself.
The next RAJA::kernel_param
variant we present works the same as the one
above. It is different from the previous version since we include
additional template parameters in the RAJA::statement::Lambda
types to
indicate which arguments each lambda expression takes and in which order.
Here is the complete version including execution policy and kernel:
using SEQ_EXEC_POL_II =
RAJA::KernelPolicy<
RAJA::statement::Tile<1, RAJA::tile_fixed<TILE_DIM>, RAJA::seq_exec,
RAJA::statement::Tile<0, RAJA::tile_fixed<TILE_DIM>, RAJA::seq_exec,
RAJA::statement::InitLocalMem<RAJA::cpu_tile_mem, RAJA::ParamList<0>,
RAJA::statement::For<1, RAJA::seq_exec,
RAJA::statement::For<0, RAJA::seq_exec,
RAJA::statement::Lambda<0, Segs<0>, Segs<1>, Offsets<0>, Offsets<1>, Params<0> >
>
>,
RAJA::statement::For<0, RAJA::seq_exec,
RAJA::statement::For<1, RAJA::seq_exec,
RAJA::statement::Lambda<1, Segs<0, 1>, Offsets<0, 1>, Params<0> >
>
>
>
>
>
>;
RAJA::kernel_param<SEQ_EXEC_POL_II>(
RAJA::make_tuple(RAJA::TypedRangeSegment<int>(0, N_c),
RAJA::TypedRangeSegment<int>(0, N_r)),
RAJA::make_tuple(Tile_Array),
[=](int col, int row, int tx, int ty, TILE_MEM &Tile_Array) {
Tile_Array(ty, tx) = Aview(row, col);
},
[=](int col, int row, int tx, int ty, TILE_MEM &Tile_Array) {
Atview(col, row) = Tile_Array(ty, tx);
}
);
Here, the two RAJA::statement::Lambda
types in the execution policy show
two different ways to specify the segments (RAJA::Segs
)
associated with the matrix column and row indices. That is, we can use a
Segs
statement for each argument, or include multiple segment ids in one
statement.
Note that we are using RAJA::statement::For
types for the inner tile
loops instead of RAJA::statement::ForICount` types used in the first variant.
As a consequence of specifying lambda arguments, there are two main differences.
The local tile indices are properly computed and passed to the lambda
expressions as a result of the RAJA::Offsets
types that appear
in the lambda statement types. The RAJA::statement::Lambda
type for each
lambda shows the two ways to specify the local tile index arguments; we can
use an Offsets
statement for each argument, or include multiple segment
ids in one statement. Lastly, there is only one entry in the parameter
tuple in this case, the local tile array. The placeholders in the
previous example are not needed.
Note
In this example, we need all five arguments in each lambda expression so the lambda expression argument lists are the same. Another use case for the template parameter argument specification described here is to be able to pass only the arguments used in a lambda expression. In particular when we use multiple lambda expressions to represent a kernel, each lambda can have a different argument lists from the others.
RAJA::launch
Variants¶
The RAJA::launch
interface provides mechanisms to tile loops and use
local arrays in kernels to support algorithm patterns like the C-style kernel
above. When, using RAJA::launch
, the RAJA_TEAM_SHARED
macro is
used to create a GPU a static sized shared memory array when using the CUDA and HIP backends,
static shared memory using SYCL is currently not supported. On the CPU, allocating
RAJA_TEAM_SHARED
corresponds to allocating memory on the stack.
Alternatively, one can allocated dynamic shared memory by specifying the amount of shared memory in
the RAJA::LaunchParams
struct. Dynamic shared memory is supported with all
backends and will be demonstrated as our second example. On the CPU dynamic
shared memory is mapped to heap memory and allocated via malloc at kernel launch.
As a first example, we illustrate the usage of static shared memory
and the use of RAJA::launch
tiling methods. RAJA tiling methods
take a iteration space in RAJA::tile
and output tiles which
the RAJA::loop_icount
method can iterate over and generate
global and local tile index offsets. Moreover, lambda expressions for these
methods will not be invoked for iterations outside the bounds of an iteration
space when tile dimensions do not evenly divide the size of the iteration space;
thus, no conditional checks on loop bounds are needed inside inner loops.
A complete RAJA sequential CPU variant with kernel execution policy and kernel is:
using loop_pol_1 = RAJA::LoopPolicy<RAJA::seq_exec>;
using launch_policy_1 = RAJA::LaunchPolicy<RAJA::seq_launch_t>;
RAJA::launch<launch_policy_1>(
RAJA::LaunchParams(), //LaunchParams may be empty when only running on the cpu
[=] RAJA_HOST_DEVICE (RAJA::LaunchContext ctx) {
RAJA::tile<loop_pol_1>(ctx, TILE_DIM, RAJA::TypedRangeSegment<int>(0, N_r), [&] (RAJA::TypedRangeSegment<int> const &row_tile) {
RAJA::tile<loop_pol_1>(ctx, TILE_DIM, RAJA::TypedRangeSegment<int>(0, N_c), [&] (RAJA::TypedRangeSegment<int> const &col_tile) {
RAJA_TEAM_SHARED double Tile_Array[TILE_DIM][TILE_DIM];
RAJA::loop_icount<loop_pol_1>(ctx, row_tile, [&] (int row, int ty) {
RAJA::loop_icount<loop_pol_1>(ctx, col_tile, [&] (int col, int tx) {
Tile_Array[ty][tx] = Aview(row, col);
});
});
RAJA::loop_icount<loop_pol_1>(ctx, col_tile, [&] (int col, int tx) {
RAJA::loop_icount<loop_pol_1>(ctx, row_tile, [&] (int row, int ty) {
Atview(col, row) = Tile_Array[ty][tx];
});
});
});
});
});
Here, the RAJA::tile
method is used to create tilings of the outer
‘row’ and ‘col’ iteration spaces. The RAJA::tile
method
takes an additional argument specifying the tile size for the corresponding
loop. To traverse the tile, we use the RAJA::loop_icount
method,
which is similar to the RAJA::ForICount
statement used in a
RAJA::kernel
execution policy as shown above. A
RAJA::loop_icount
method call
will generate local tile index associated with the outer global index.
The local tile index is necessary as we use it to read and write entries
from/to global memory to RAJA_TEAM_SHARED
memory array.
As an alternative to static shared memory, the matrix transpose kernel may be express using dynamic shared memory. Prior to invoking the amount of shared memory must be specified
constexpr size_t dynamic_shared_mem_size = TILE_DIM * TILE_DIM * sizeof(int);
The amount of shared memory is then specifed in the RAJA::LaunchParams
struct
and then accessed within the kernel using the LaunchContext’s getSharedMemory
method.
The getSharedMemory
method may be invoked multiple times each time returning an
offset to the shared memory buffer. Since the offset moves per getSharedMemory
call (also known as bump style allocation), it becomes necessary to reset the allocator offset
count at the end of the shared memory scope to avoid going beyond the buffer size.
The full example of matrix transpose with dynamic shared memory is provided below
RAJA::launch<launch_policy>
(select_cpu_or_gpu,
RAJA::LaunchParams(RAJA::Teams(outer_Dimr, outer_Dimc),
RAJA::Threads(TILE_DIM, TILE_DIM), dynamic_shared_mem_size),
"Matrix tranpose with dynamic shared memory kernel",
[=] RAJA_HOST_DEVICE(RAJA::LaunchContext ctx)
{
RAJA::loop<outer1>(ctx, RAJA::RangeSegment(0, outer_Dimr), [&] (int by){
RAJA::loop<outer0>(ctx, RAJA::RangeSegment(0, outer_Dimc), [&] (int bx){
//Request memory from shared memory pool
int * tile_ptr = ctx.getSharedMemory<int>(TILE_DIM * TILE_DIM);
//Use RAJA View for simplified indexing
RAJA::View<int, RAJA::Layout<2>> Tile(tile_ptr, TILE_DIM, TILE_DIM);
RAJA::loop<inner1>(ctx, RAJA::RangeSegment(0, TILE_DIM), [&] (int ty){
RAJA::loop<inner0>(ctx, RAJA::RangeSegment(0, TILE_DIM), [&] (int tx){
int col = bx * TILE_DIM + tx; // Matrix column index
int row = by * TILE_DIM + ty; // Matrix row index
// Bounds check
if (row < N_r && col < N_c) {
Tile(ty,tx) = Aview(row, col);
}
});
});
//Barrier is needed to ensure all threads have written to Tile
ctx.teamSync();
RAJA::loop<inner1>(ctx, RAJA::RangeSegment(0, TILE_DIM), [&] (int ty){
RAJA::loop<inner0>(ctx, RAJA::RangeSegment(0, TILE_DIM), [&] (int tx){
int col = bx * TILE_DIM + tx; // Matrix column index
int row = by * TILE_DIM + ty; // Matrix row index
// Bounds check
if (row < N_r && col < N_c) {
Atview(col, row) = Tile(ty, tx);
}
});
});
//The launch context uses bump style allocator in which calls
//to getSharedMemory moves a memory buffer pointer to return
//different segments of shared memory. To avoid requesting beyond
//the pre-allocated memory quantity we reset the allocator offset counter
//in the launch context effectively releasing shared memory.
ctx.releaseSharedMemory();
});
});
});