EXAMINATION COMMITTEE PAGE
6.2 Tile Low-Rank Matrix Vector Multiplication (TLR- MVM)MVM)
We introduce here the Tile Low-Rank Matrix Vector Multiplication (TLR-MVM).
We will also compare it with dense MVM and show performance improvement on various architectures.
6.2.1 Algorithm Description
The main idea behind TLR-MVM is to perform data compression on a tiled ma- trix using linear algebra matrix algorithms (e.g., rank-revealing QR [93], regu- lar/randomized SVD [94], etc.) and to express the new MVM algorithms based on standard BLAS calls.
Figure 6.1: Original dense MVM Figure 6.2: Dense vs TLR compressed data structure for MVM
Fig. 6.1 shows a rectangular matrix split into 4× 6 tiles. The rectangular matrix A is of size m×n. The tile size is nb. So each tileAi,j is of size nb×nb.
We use a standard SVD algorithm to get the Ui,j bases (left singular vectors) and Vi,j bases (right singular vectors) and singular values P
i,j of each tile Ai,j. Ui,j bases size isnb×nb,P
i,j size isnb×nbandVi,j bases size isnb×nb. P
i,j is a diagonal matrix where each element along the diagonal is singular values of matrix Ai,j, and they are sorted in descending order. We define a rank k approximation to tile Ai,j as ˜Ai,j,k, the expression is
A˜i,j(k) =Ui,j[:,:k]X
i,j
[:k,:k]Vi,j[:k,:] (6.1)
61
Here we take the first k column vectors of Ui,j, the sub square matrix of size k×k starting from the first elements of P
i, j and the first k row vectors of Vi,j. We use asingular value selection algorithm based on its Frobenius norm.
The algorithm defines an error threshold ϵ. The idea is to continuously add singular values and singular vectors until ∥A∥A˜i,j(k)∥F
i,j∥F ≤ ϵ. We iterate through all tiles in the matrixA and get the results shown in Fig. 6.2. The diagonal singular values blocks are multiplied intoV bases. Theoretical details may be found at [95].
Figure 6.3: Stacked bases for TLR- compressed format
Figure 6.4: Batch of MVM in V bases (Phase 1)
Fig. 6.3 shows theU andV bases in blue and green, respectively. For simplicity, we assumek is constant but in real dataset,k is of variable sizes for each tile. It is noteworthy to mention that this compression step needs to be done only once. One could reuse the compressedU/V bases to conduct computation multiple times. For example, in the astronomy application, the MVM operation A×x = y needs to be performed successively thousands of times at a regular and constant pace to cope with the evolving wind and atmospheric conditions.
Once each tile is compressed, we stack the bases together to ensure a contiguous memory access during the computation, as pictured in Fig. 6.4. After the TLR compressed data structure is formed, TLR-MVM is expressed in three phases:
(1) a batch of regular MVM involving only the V bases and the input vector x resulting in the set of output vectors Y v (Fig. 6.4), (2) a reshuffle phase that projects the ranks (identified by an orange delimiter) within each set of vectors Y v from phase 1 into the U bases resulting into the set of output vectors Y u (Fig. 6.5), and (3) a batch of regular MVM involving only the U bases and the
62
set of vectors Y u (Fig. 6.6) that computes at the end the approximated vector command y.
Figure 6.5: Intermediate vector projected from V toU bases (Phase 2)
Figure 6.6: Batch of MVM w/U bases and output command vectory (Phase 3)
A similar concept of stacking the bases has been applied for sparse direct solver to further increase the arithmetic complexity of the factorization [96]. Since MVM is inherently memory-bound, we stack the bases in order to increase data locality in the high level caches of the memory subsystem. This allows our implementation to decouple potentially from the main memory and occasionally, depending on the underlying hardware architecture, to benefit from cache bandwidth.
6.2.2 Arithmetic Complexity of TLR-MVM
We describe the arithmetic complexity and memory bandwidth formula of TLR- MVM.
To begin with, we first show how to calculate arithmetic complexity and mem- ory bandwidth formula for dense matrix vector multiplication. Given a dense matrix Am,n, the floating-point operations (flops) and the memory bandwidth of the dense MVM are 2mn and B(mn+n+m)/t, whereB is the number of bytes per element and t is the execution time. In TLR-MVM, the calculation of flops is as follows. In the computation of phase 1, the column size of each GEMV is nb and the row size of each GEMV is the sum of the ranks along this tile column. So, the flops for computation of V is 2Rnb, where R is the sum of the ranks across all tiles of the matrix. In the reshuffle phase 2, there are no computations but only data movement. In phase 3, the flops rate is calculated now with the row
63
size of each GEMV fixed to nb and the column size is the sum of the ranks along the tile row. Therefore, the total flops for phase 3 is 2Rnb. The overall flops of TLR-MVM is thus
FLOPSTLR-MVM = 2Rnb+ 2Rnb= 4Rnb (6.2)
In order to compute the memory bandwidth for TLR-MVM, one needs to count the bytes read in and written back to the main memory. During the phase 1, the total bytes read in and out are B(Rnb+n+R). In the phase 2, 2BR bytes are read in and out through memory. In the phase 3, the total bytes read in and out are B(Rnb+R+m). So the memory bandwidth is
BWTLR-MVM = B(2Rnb+ 4R+n+m) tu+tv
(6.3)