EXAMINATION COMMITTEE PAGE
6.3 TLR-MVM Implementation and Performance Opti- mizationmization
63
size of each GEMV fixed to nb and the column size is the sum of the ranks along the tile row. Therefore, the total flops for phase 3 is 2Rnb. The overall flops of TLR-MVM is thus
FLOPSTLR-MVM = 2Rnb+ 2Rnb= 4Rnb (6.2)
In order to compute the memory bandwidth for TLR-MVM, one needs to count the bytes read in and written back to the main memory. During the phase 1, the total bytes read in and out are B(Rnb+n+R). In the phase 2, 2BR bytes are read in and out through memory. In the phase 3, the total bytes read in and out are B(Rnb+R+m). So the memory bandwidth is
BWTLR-MVM = B(2Rnb+ 4R+n+m) tu+tv
(6.3)
6.3 TLR-MVM Implementation and Performance Opti-
64
6.3.2 MPI+OpenMP Implementation
Algorithm 4 MPI+OpenMP pseudocode of TLR-MVM
1: CompressA
2: Get mpirank, mpisize ▷rank and worldsize
3: nt= 0 ▷number of tiles on each process
4: foreach tile column indexjdo ▷get number of tiles
5: opnode = jmodmpisize
6: ifmpirank = opnodethen
7: nt = nt + 1
8: end if
9: end for
10: foreach tile row indexido
11: foreach tile column indexjdo
12: opnode = jmodmpisize
13: ifrootthen
14: send tileUi,j,Vi,j andxjto opnode
15: else
16: opnode receivesUi,j,Vi,j andxjfrom root process
17: end if
18: end for
19: end for ▷1D cyclic data splitting
20: Call OpenMP TLR-MVM from Algorithm 3
21: Reduce the partial results from each MPI process to root
The two pseudocodes assume constant ranks for readability purposes. The generalized code for variable ranks contains additional pointer arithmetics to en- sure a proper off-setting when traversing theU and V bases and the intermediate sets of vectors Y v and Y u.
Algorithm 4 highlights the MPI+OpenMP version of the TLR-MVM imple- mentation. We use a 1D cyclic block data distribution similar to ScaLAPACK [97]
to mitigate the load imbalance that may appear with variable ranks. We split the U and V bases vertically among the MPI processes. While the splitting for the U bases eventually engenders workloads that may be operated in an embarrass- ingly parallel fashion, the vertical splitting for theV bases requires an MPI reduce operation to sum the partial results to the root process. However, the three com- putational phases powered by OpenMP (i.e., Algorithm 3) can run independently on each MPI process.
6.3.3 CUDA Implementation
The authors in [67] implement TLR-MVM in C++ on NEC SX-Aurora TSUB- ASA vector engine using standard MPI+OpenMP programming models. They introduce a load balancing technique to reduce idle times encountered when si-
65
Algorithm 5 pseudo code of a single TLR-MVM using CUDA Graph
Initialization:
1) CompressAto get compressed U bases and V bases.
2) Initialize CUDA streams, each stream get its own rank (streamrank), and world size (streamsize).
LOOP Process
forEach stacked tile columnVjdo
if (jmodstreamsize) = streamrank then Compute MVM of columnVjon stream rankj.
end if end for
Synchronize all streams.
Phase 2: Reshuffle on stream 0.
Synchronize all streams.
forEach concatenated tile rowUido
if (imodstreamsize) = streamrank then Compute MVM of columnUion stream ranki.
end if end for
Synchronize all streams. returnY
TLR-MVM Execution DAG on GPU using streams and Graph API
Start of Graph
MVM on stream 1 MVM on stream 2
MVM on stream 19
. . MVM on stream 0
Phase 2 on stream 0
MVM on stream 1 MVM on stream 2
MVM on stream 19 MVM on stream 0
End of Graph
Phase 1:
Batch MVM on 20 streams
Phase 3:
Batch MVM on 20 streams .
. MVM on
stream 0
Figure 6.7: CUDA Graph of a single TLR-MVM with20 streams
multaneously processing all frequency-space seismic operators in single complex precision arithmetics. Since the single complex datatype for MVM (i.e., CGEMV) is widely supported by vendor numerical libraries, we propose to further extend TLR-MVM to a wider family of systems, including x86, ARM, and GPU. For x86 and ARM, MPI+OpenMP is natively supported and portability is attained with modest effort. However, to get performance, the MPI processes and the corresponding spawn OpenMP threads must be mapped carefully into the under- lying core/socket packaging. For GPU (NVIDIA studied herein), MPI remains as the bridge to communicate across GPUs, while the CUDA software ecosystem is needed inside the GPU. In particular, we rely on streams to launch the batch
66
MVMs with variable sizes and monitor the data dependencies inside TLR-MVM.
These streams are then captured by the CUDA Graph framework for asynchronous execution, while reducing the kernel launch overheads observed when operating on small data structures. Figure 6.7 shows the DAG of TLR-MVM when processing a single frequency matrix with 20 streams. Phases 1 and 3 are embarrassingly par- allel and can benefit the most from streams. Since SRI involves many frequency matrices, we merge each phase of TLR-MVM across all frequency matrices to fur- ther expose parallelism, reduce CUDA Graph overheads, and increase hardware occupancy. Inspired by the merge phase strategy in Chapter 7.2, we can build a single DAG to launch massive operations on all frequency matrices at the same time. We execute phase 1 of all frequency matrices in a round-robin fashion and synchronize for phase 2. We then execute phase 3 of all frequency matrices again in a round-robin fashion until all MVM tasks (i.e., CGEMV) are completed.