TLR-MVM Implementation and Performance Opti- mizationmization

EXAMINATION COMMITTEE PAGE

6.3 TLR-MVM Implementation and Performance Opti- mizationmization

size of each GEMV fixed to nb and the column size is the sum of the ranks along the tile row. Therefore, the total flops for phase 3 is 2Rn_b. The overall flops of TLR-MVM is thus

FLOPS_TLR-MVM = 2Rnb+ 2Rnb= 4Rnb (6.2)

In order to compute the memory bandwidth for TLR-MVM, one needs to count the bytes read in and written back to the main memory. During the phase 1, the total bytes read in and out are B(Rnb+n+R). In the phase 2, 2BR bytes are read in and out through memory. In the phase 3, the total bytes read in and out are B(Rnb+R+m). So the memory bandwidth is

BW_TLR-MVM = B(2Rnb+ 4R+n+m) tu+tv

(6.3)

6.3 TLR-MVM Implementation and Performance Opti-

6.3.2 MPI+OpenMP Implementation

Algorithm 4 MPI+OpenMP pseudocode of TLR-MVM

1: CompressA

2: Get mpirank, mpisize ▷rank and worldsize

3: nt= 0 ▷number of tiles on each process

4: foreach tile column indexjdo ▷get number of tiles

5: opnode = jmodmpisize

6: ifmpirank = opnodethen

7: nt = nt + 1

8: end if

9: end for

10: foreach tile row indexido

11: foreach tile column indexjdo

12: opnode = jmodmpisize

13: ifrootthen

14: send tileUi,j,Vi,j andxjto opnode

15: else

16: opnode receivesUi,j,Vi,j andxjfrom root process

17: end if

18: end for

19: end for ▷1D cyclic data splitting

20: Call OpenMP TLR-MVM from Algorithm 3

21: Reduce the partial results from each MPI process to root

The two pseudocodes assume constant ranks for readability purposes. The generalized code for variable ranks contains additional pointer arithmetics to en- sure a proper off-setting when traversing theU and V bases and the intermediate sets of vectors Y v and Y u.

Algorithm 4 highlights the MPI+OpenMP version of the TLR-MVM implementation. We use a 1D cyclic block data distribution similar to ScaLAPACK [97]

to mitigate the load imbalance that may appear with variable ranks. We split the U and V bases vertically among the MPI processes. While the splitting for the U bases eventually engenders workloads that may be operated in an embarrassingly parallel fashion, the vertical splitting for theV bases requires an MPI reduce operation to sum the partial results to the root process. However, the three com- putational phases powered by OpenMP (i.e., Algorithm 3) can run independently on each MPI process.

6.3.3 CUDA Implementation

The authors in [67] implement TLR-MVM in C++ on NEC SX-Aurora TSUB- ASA vector engine using standard MPI+OpenMP programming models. They introduce a load balancing technique to reduce idle times encountered when si-

Algorithm 5 pseudo code of a single TLR-MVM using CUDA Graph

Initialization:

1) CompressAto get compressed U bases and V bases.

2) Initialize CUDA streams, each stream get its own rank (streamrank), and world size (streamsize).

LOOP Process

forEach stacked tile columnVjdo

if (jmodstreamsize) = streamrank then Compute MVM of columnVjon stream rankj.

end if end for

Synchronize all streams.

Phase 2: Reshuffle on stream 0.

Synchronize all streams.

forEach concatenated tile rowUido

if (imodstreamsize) = streamrank then Compute MVM of columnUion stream ranki.

end if end for

Synchronize all streams. returnY

TLR-MVM Execution DAG on GPU using streams and Graph API

Start of Graph

MVM on stream 1 MVM on stream 2

MVM on stream 19

. . MVM on stream 0

Phase 2 on stream 0

MVM on stream 1 MVM on stream 2

MVM on stream 19 MVM on stream 0

End of Graph

Phase 1:

Batch MVM on 20 streams

Phase 3:

Batch MVM on 20 streams .

. MVM on

stream 0

Figure 6.7: CUDA Graph of a single TLR-MVM with20 streams

multaneously processing all frequency-space seismic operators in single complex precision arithmetics. Since the single complex datatype for MVM (i.e., CGEMV) is widely supported by vendor numerical libraries, we propose to further extend TLR-MVM to a wider family of systems, including x86, ARM, and GPU. For x86 and ARM, MPI+OpenMP is natively supported and portability is attained with modest effort. However, to get performance, the MPI processes and the corresponding spawn OpenMP threads must be mapped carefully into the under- lying core/socket packaging. For GPU (NVIDIA studied herein), MPI remains as the bridge to communicate across GPUs, while the CUDA software ecosystem is needed inside the GPU. In particular, we rely on streams to launch the batch

MVMs with variable sizes and monitor the data dependencies inside TLR-MVM.

These streams are then captured by the CUDA Graph framework for asynchronous execution, while reducing the kernel launch overheads observed when operating on small data structures. Figure 6.7 shows the DAG of TLR-MVM when processing a single frequency matrix with 20 streams. Phases 1 and 3 are embarrassingly parallel and can benefit the most from streams. Since SRI involves many frequency matrices, we merge each phase of TLR-MVM across all frequency matrices to further expose parallelism, reduce CUDA Graph overheads, and increase hardware occupancy. Inspired by the merge phase strategy in Chapter 7.2, we can build a single DAG to launch massive operations on all frequency matrices at the same time. We execute phase 1 of all frequency matrices in a round-robin fashion and synchronize for phase 2. We then execute phase 3 of all frequency matrices again in a round-robin fashion until all MVM tasks (i.e., CGEMV) are completed.

Dalam dokumen Exploiting Data Sparsity in Matrix Algorithms for Adaptive Optics and Seismic Redatuming (Halaman 63-66)