• Tidak ada hasil yang ditemukan

Quantitative Study on the Advanced Parallelization Strategies

Chapter 3 PARALLELIZATION STRATEGIES

3.4 Quantitative Study on the Advanced Parallelization Strategies

Parallelization is essentially problem dependent, and how successfully embracing the distinct features of the problem under consideration appears to be crucial. Although qualitative analysis of a separate parallel algorithm is widely available, quantitative comparison among advanced parallel algorithms is not well documented so far. So, a comparative study on the parallel strategies, particularly in terms of factorization, has been conducted. Then optimizations of the selected parallel algorithm are suggested to exploit the unique features: highly banded nature and small portion of the stiffness matrix being affected by addition of penalty elements.

Among a multitude of the advanced parallel strategies, three representative ones, in terms of factorization, are studied herein: (1) broadcasting, (2) pipelining, and (3) look-ahead method (e.g., Casanova et al. 2009). Associated pseudocodes are provided in appendix C.

Before moving forward, it is useful to denote the key procedures of serial version factorization.

LetP(k)be the preparation procedure at step k: preparation of factors for the sub rows below the kth diagonal term performed on the processor Pk (Pk= the processor holding the kth column and diagonal term). Let U(k)be the update procedure at step k: update of submatrix Ai,j fori,j>k with the precalculated factors with kth diagonal term.

The first and simplest parallelization strategy is the broadcasting scheme using direct

broadcasting command in MPI (i.e., MPI_Bcast) at each step. The key stream can be summarized as: P(k)→ broadcasting to all processors →U(k). It is remarkably easy to understand and

implement, and indeed the broadcasting command can be almost freely interleaved into the routine. The key drawback of this approach, however, is that all other processors need to wait until the data from the sender processor Pkuntil P(k)is fully finished on Pkat each step, causing unnecessary waiting cost between processors. Furthermore, the broadcasting command itself possesses communication inefficiency as the number of processors increases.

To expand on this adverse nature, it is instructive to review the cost analysis of two algorithms–(1) parallel factorization followed by triangular system solving (Karniadakis and Kirby 2003) and (2) parallel Gaussian elimination (Casanova et al. 2009) –both based on simple broadcasting approach, and the total costs generally read

Total running time ≈𝛼

𝑝 +𝛽𝑝, (3.1)

where α'=α×n3/3; β'=(βn2/2+L)or (βn2+L)for the former and the latter, respectively.

Andαis basic operation cost per element; β is transfer cost per element; Lis communication startup cost; n is system size; p is the total number of processors.

Detailed derivation of the cost model given in eq. (3.1) shall be addressed in section 3.5.

If n>> p, then the first term in eq. (3.1) will govern the total running time, and we can achieve an asymptotic parallel efficiency of order 1. For a moderate size ofn, however, the effect of the second term in eq. (3.1) cannot be ignored, and simply increasing the total number of processors cannot guarantee the parallel efficiency. Indeed, the total cost will undesirably increase with the growth in the number of total processors by the second term of eq. (3.1), as shown in figure 3.4.

Hence, the simple broadcasting scheme is assumed to be the simplest yet poorest one in the later discussion, and used as the comparison base for other advanced parallel strategies.

Figure 3.4. Costs of parallel factorizations attained from numerical simulations of a test system

(size=2040): parallel Gaussian elimination (dashed line) and parallel factorization followed by triangular system solving (solid line).

The second one is the pipelined algorithm following the key notion of “pipelining,” in which logical topology is well incorporated. In the scheme, every processor knows its logically closest one, and upon receiving the crucial data, a processor always passes the buffer to the closest processor, and then performs U(k). In this fashion, waiting cost between processors can be remarkably reduced and the communication is efficiently accelerated. However, there still exists some latency, so-called “pipeline bubble” due to the difference of the computation time between the predecessor and successor, and the strictly fixed stream–namely, P(k)receiving/immediate sending →U(k).

The third and most advanced parallel algorithm is the so-called “look-ahead” scheme, which is not only considering logical topology, but also reducing pipeline bubble by placing top priority on communication over computation–if necessary for fast communication, sacrificing a

consecutive computation often happens. It should be stressed, however, that in the look-ahead method the total time cost might become very expensive with small number of total processors, even worse than the simple broadcasting method, since with a small number of processors the serial computation usually governs the total cost rather than communication.

Figure 3.5. Costs of look-ahead and pipelined factorization normalized by that of broadcasting method, all attained from simulations of a test system (size = 19176).

Figure 3.5 shows that both pipelined and look-ahead methods exhibit highly improved efficiency compared to simple broadcasting method. As pointed out, however, the poor performance of look-ahead method for small number of processors is noteworthy (e.g., for processors less than 24). Unlike look-ahead approach, however, the pipelined factorization method does not show poor performance even for small number of processors compared to the simple broadcasting method.

In sum, provided that a sufficiently large number of total processors are available, the look- ahead method can be regarded as the best parallel algorithm, but the advantage over pipelined algorithm is not significantly noticeable for a moderate size system. Furthermore, the look-ahead algorithm tends to perform badly for small number of total processors. Therefore, the present parallel platform adopted the pipelined strategy as the starting point, on which optimization in accordance with key features of the program of interest were carried out.