main
method.parallel_for_each
starts one thread for each element in product.extent
, and replaces the for
loops for row and column. The value of the cell at the row and column is available in idx
. You can access the elements of an array_view
object by using either the []
operator and an index variable, or the ()
operator and the row and column variables. The example demonstrates both methods. The array_view::synchronize
method copies the values of the product
variable back to the productMatrix
variable.include
and using
statements at the top of MatrixMultiply.cpp.main
method to call the MultiplyWithAMP
method.tile_static
variables. Access to data in tile_static
space can be many times faster than access to data in the global space. An instance of a tile_static
variable is created for each tile, and all threads in the tile have access to the variable. The primary benefit of tiling is the performance gain due to tile_static
access.tile_barrier::wait
before they continue execution.array_view
object and the index relative to the tile. By using the local index, you can make your code easier to read and debug.tile_static
variables for faster access. In this example, the matrix is partitioned into submatrices of equal size. The product is found by multiplying the submatrices. The two matrices and their product in this example are:a
through h
are 2x2 matrices, all of the products and sums of them are also 2x2 matrices. It also follows that the product of A and B is a 4x4 matrix, as expected. To quickly check the algorithm, calculate the value of the element in the first row, first column in the product. In the example, that would be the value of the element in the first row and first column of ae + bg
. You only have to calculate the first column, first row of ae
and bg
for each term. That value for ae
is (1 * 1) + (2 * 5) = 11
. The value for bg
is (3 * 1) + (4 * 5) = 23
. The final value is 11 + 23 = 34
, which is correct.tiled_extent
object instead of an extent
object in the parallel_for_each
call.tiled_index
object instead of an index
object in the parallel_for_each
call.tile_static
variables to hold the submatrices.tile_barrier::wait
method to stop the threads for the calculation of the products of the submatrices.main
method.a
into locA
. Copy the elements of tile[0,0] of b
into locB
. Notice that product
is tiled, not a
and b
. Therefore, you use global indices to access a, b
, and product
. The call to tile_barrier::wait
is essential. It stops all of the threads in the tile until both locA
and locB
are filled.locA
and locB
and put the results in product
.a
into locA
. Copy the elements of tile [1,0] of b
into locB
.locA
and locB
and add them to the results that are already in product
.tile_static
variables are created for each tile appropriately and the call to tile_barrier::wait
Guys that sing with auto tune 2017. controls the program flow.tile_static
memory twice. That data transfer does take time. However, once the data is in tile_static
memory, access to the data is much faster. Because calculating the products requires repeated access to the values in the submatrices, there is an overall performance gain. For each algorithm, experimentation is required to find the optimal algorithm and tile size.tile_static
memory. That is not a significant performance gain. However, if the A and B were 1024x1024 matrices and the tile size were 16, there would be a significant performance gain. In that case, each element would be copied into tile_static
memory only 16 times and accessed from tile_static
memory 1024 times.MultiplyWithTiling
method, as shown.