• Tidak ada hasil yang ditemukan

Comparison with Related Methods .1 Distributed Gradient Descent (DGD).1Distributed Gradient Descent (DGD)

DISTRIBUTED SOLUTION OF LARGE-SCALE SYSTEMS OF EQUATIONS

6.4 Comparison with Related Methods .1 Distributed Gradient Descent (DGD).1Distributed Gradient Descent (DGD)

Table 6.1: A summary of the convergence rates of different methods. DGD:

Distributed Gradient Descent, D-NAG: Distributed Nesterovโ€™s Accelerated Gradient Descent, D-HBM: Distributed Heavy-Ball Method, Mou et al: Consensus algorithm of [146], B-Cimmino: Block Cimmino Method, APC: Accelerated Projection-based Consensus. The smaller the convergence rate is, the faster is the method. Note that ๐œŒGD โ‰ฅ ๐œŒNAG โ‰ฅ ๐œŒHBMand๐œŒMou โ‰ฅ ๐œŒCim โ‰ฅ ๐œŒAPC.

DGD D-NAG D-HBM Mou et al. B-Cimmino APC (proposed)

๐œ…(๐ด๐‘‡๐ด)โˆ’1

๐œ…(๐ด๐‘‡๐ด)+1 1โˆ’โˆš 2

3๐œ…(๐ด๐‘‡๐ด)+1

โˆš

๐œ…(๐ด๐‘‡๐ด)โˆ’1

โˆš

๐œ…(๐ด๐‘‡๐ด)+1 1โˆ’๐œ‡min(๐‘‹) ๐œ…(๐‘‹)โˆ’1๐œ…(๐‘‹)+

1

โˆš

๐œ…(๐‘‹)โˆ’1

โˆš

๐œ…(๐‘‹)+1

โ‰ˆ1โˆ’๐œ…(๐ด2๐‘‡๐ด) โ‰ˆ1โˆ’โˆš 2

๐œ…(๐ด๐‘‡๐ด) โ‰ˆ1โˆ’๐œ…(๐‘‹)2 โ‰ˆ1โˆ’โˆš2

๐œ…(๐‘‹)

well to the problem.

The regular or full gradient descent has the update rule๐‘ฅ(๐‘ก+1)=๐‘ฅ(๐‘ก) โˆ’๐›ผ ๐ด๐‘‡(๐ด๐‘ฅ(๐‘ก) โˆ’ ๐‘), where๐›ผ >0 is the step size or learning rate. The distributed version of gradient descent is one in which each machine๐‘–has only a subset of the equations [๐ด๐‘–, ๐‘๐‘–], and computes its own part of the gradient, which is ๐ด๐‘‡

๐‘– (๐ด๐‘–๐‘ฅ(๐‘ก) โˆ’๐‘๐‘–). The updates are then collectively done as:

๐‘ฅ(๐‘ก+1)=๐‘ฅ(๐‘ก) โˆ’๐›ผ

๐‘š

ร•

๐‘–=1

๐ด๐‘‡

๐‘– (๐ด๐‘–๐‘ฅ(๐‘ก) โˆ’๐‘๐‘–). (6.11) One can show that this also has linear convergence, and the rate of convergence is

๐œŒGD = ๐œ…(๐ด๐‘‡๐ด) โˆ’1

๐œ…(๐ด๐‘‡๐ด) +1 โ‰ˆ 1โˆ’ 2

๐œ…(๐ด๐‘‡๐ด). (6.12)

We should mention that since each machine needs to compute ๐ด๐‘‡

๐‘– (๐ด๐‘–๐‘ฅ(๐‘ก) โˆ’๐‘๐‘–) at each iteration๐‘ก, the computational complexity per iteration is 2๐‘ ๐‘›, which is identical to that of APC.

6.4.2 Distributed Nesterovโ€™s Accelerated Gradient Descent (D-NAG)

A popular variant of gradient descent is Nesterovโ€™s accelerated gradient descent [151], which has a memory term, and works as follows:

๐‘ฆ(๐‘ก+1) =๐‘ฅ(๐‘ก) โˆ’๐›ผ

๐‘š

ร•

๐‘–=1

๐ด๐‘‡

๐‘– (๐ด๐‘–๐‘ฅ(๐‘ก) โˆ’๐‘๐‘–), (6.13a) ๐‘ฅ(๐‘ก+1) =(1+๐›ฝ)๐‘ฆ(๐‘ก+1) โˆ’ ๐›ฝ ๐‘ฆ(๐‘ก). (6.13b) One can show [125] that the optimal convergence rate of this method is

๐œŒNAG =1โˆ’ 2 p3๐œ…(๐ด๐‘‡๐ด) +1

, (6.14)

which is improved over the regular distributed gradient descent (one can check that

๐œ…(๐ด๐‘‡๐ด)โˆ’1

๐œ…(๐ด๐‘‡๐ด)+1 โ‰ฅ 1โˆ’ โˆš 2

3๐œ…(๐ด๐‘‡๐ด)+1).

6.4.3 Distributed Heavy-Ball Method (D-HBM)

The heavy-ball method [169], otherwise known as the gradient descent with momen- tum, is another accelerated variant of gradient descent as follows:

๐‘ง(๐‘ก+1) =๐›ฝ ๐‘ง(๐‘ก) +

๐‘š

ร•

๐‘–=1

๐ด๐‘‡

๐‘– (๐ด๐‘–๐‘ฅ(๐‘ก) โˆ’๐‘๐‘–), (6.15a) ๐‘ฅ(๐‘ก+1) =๐‘ฅ(๐‘ก) โˆ’๐›ผ ๐‘ง(๐‘ก+1). (6.15b)

It can be shown [125] that the optimal rate of convergence of this method is ๐œŒHBM =

p

๐œ…(๐ด๐‘‡๐ด) โˆ’1 p

๐œ…(๐ด๐‘‡๐ด) +1

โ‰ˆ1โˆ’ 2

p

๐œ…(๐ด๐‘‡๐ด)

, (6.16)

which is further improved over DGD and D-NAG (๐œ…(๐ด

๐‘‡๐ด)โˆ’1

๐œ…(๐ด๐‘‡๐ด)+1 โ‰ฅ 1โˆ’ โˆš 2

3๐œ…(๐ด๐‘‡๐ด)+1

โ‰ฅ

โˆš

๐œ…(๐ด๐‘‡๐ด)โˆ’1

โˆš

๐œ…(๐ด๐‘‡๐ด)+1). This is similar to, but not the same as, the rate of convergence of APC.

The difference is that the condition number of ๐ด๐‘‡๐ด =ร๐‘š

๐‘–=1๐ด๐‘‡

๐‘– ๐ด๐‘– is replaced with the condition number of๐‘‹ =ร๐‘š

๐‘–=1๐ด๐‘‡

๐‘– ๐ด๐‘–๐ด๐‘‡

๐‘–

โˆ’1

๐ด๐‘–in APC. Given its structure as the sum of projection matrices, one may speculate that ๐‘‹ has a much better condition number than ๐ด๐‘‡๐ด. Indeed, our experiments with random, as well as real, data sets suggest that this is the case and that the condition number of๐‘‹ is often significantly better (see Table 6.2).

6.4.4 Alternating Direction Method of Multipliers (ADMM)

Alternating Direction Method of Multipliers (more specifically, consensus ADMM [185, 44]), is another generic method for solving optimization problems with separable cost function ๐‘“(๐‘ฅ) =ร๐‘š

๐‘–=1 ๐‘“๐‘–(๐‘ฅ)distributedly, by defining additional local variables. Each machine๐‘– holds local variables๐‘ฅ๐‘–(๐‘ก) โˆˆR๐‘›and ๐‘ฆ๐‘–(๐‘ก) โˆˆR๐‘›, and the masterโ€™s value is ยฏ๐‘ฅ(๐‘ก) โˆˆR๐‘›, for any time๐‘ก. For ๐‘“๐‘–(๐‘ฅ) = 1

2k๐ด๐‘–๐‘ฅโˆ’๐‘๐‘–k2, the update rule of ADMM simplifies to

๐‘ฅ๐‘–(๐‘ก+1) = (๐ด๐‘‡

๐‘– ๐ด๐‘–+๐œ‰ ๐ผ๐‘›)โˆ’1(๐ด๐‘‡

๐‘– ๐‘๐‘–โˆ’๐‘ฆ๐‘–(๐‘ก) +๐œ‰๐‘ฅยฏ(๐‘ก)), ๐‘– โˆˆ [๐‘š] (6.17a)

ยฏ

๐‘ฅ(๐‘ก+1) = 1 ๐‘š

๐‘š

ร•

๐‘–=1

๐‘ฅ๐‘–(๐‘ก+1) (6.17b)

๐‘ฆ๐‘–(๐‘ก+1) = ๐‘ฆ๐‘–(๐‘ก) +๐œ‰(๐‘ฅ๐‘–(๐‘ก+1) โˆ’๐‘ฅยฏ(๐‘ก+1)), ๐‘– โˆˆ [๐‘š] (6.17c) It turns out that this method is very slow (and often unstable) in its native form for the application in hand. One can check that when system (6.1) has a solution, all the ๐‘ฆ๐‘–variables converge to zero in steady state. Therefore, setting ๐‘ฆ๐‘–โ€™s to zero can speed up the convergence significantly. We use this modified version in Section 6.6 for comparison.

We should also note that the computational complexity of ADMM is ๐‘‚(๐‘ ๐‘›) per iteration (the inverse is computed using matrix inversion lemma), which is again the same as that of gradient-type methods and APC.

6.4.5 Block Cimmino Method

The Block Cimmino method [69, 191, 11], which is a parallel method specifically for solving linear systems of equations, is perhaps the closest algorithm in spirit to APC. It is, in a way, a distributed implementation of the so-called Kaczmarz method [111]. The convergence of the Cimmino method is slower by an order in comparison with APC (its convergence time is the square of that of APC), and it turns out that APC includes this method as a special case when๐›พ =1.

The block Cimmino method is the following:

๐‘Ÿ๐‘–(๐‘ก) = ๐ด+

๐‘– (๐‘๐‘–โˆ’ ๐ด๐‘–๐‘ฅยฏ(๐‘ก)), ๐‘– โˆˆ [๐‘š] (6.18a)

ยฏ

๐‘ฅ(๐‘ก+1) =๐‘ฅยฏ(๐‘ก) +๐œˆ

๐‘š

ร•

๐‘–=1

๐‘Ÿ๐‘–(๐‘ก), (6.18b)

where ๐ด+

๐‘– = ๐ด๐‘‡

๐‘– (๐ด๐‘–๐ด๐‘‡

๐‘–)โˆ’1is the pseudoinverse of ๐ด๐‘–.

Proposition 32. The APC method (Algorithm 4) includes the block Cimmino method as a special case for๐›พ =1.

Proof. When๐›พ =1, Eq. (6.3a) becomes

๐‘ฅ๐‘–(๐‘ก+1)=๐‘ฅ๐‘–(๐‘ก) โˆ’๐‘ƒ๐‘–(๐‘ฅ๐‘–(๐‘ก) โˆ’๐‘ฅยฏ(๐‘ก))

=๐‘ฅ๐‘–(๐‘ก) โˆ’ ๐ผโˆ’ ๐ด๐‘‡

๐‘– (๐ด๐‘–๐ด๐‘‡

๐‘–)โˆ’1๐ด๐‘–

(๐‘ฅ๐‘–(๐‘ก) โˆ’๐‘ฅยฏ(๐‘ก))

=๐‘ฅยฏ(๐‘ก) + ๐ด๐‘‡

๐‘– (๐ด๐‘–๐ด๐‘‡

๐‘– )โˆ’1๐ด๐‘–(๐‘ฅ๐‘–โˆ’๐‘ฅยฏ(๐‘ก))

=๐‘ฅยฏ(๐‘ก) + ๐ด๐‘‡

๐‘– (๐ด๐‘–๐ด๐‘‡

๐‘– )โˆ’1(๐‘๐‘–โˆ’๐ด๐‘–๐‘ฅยฏ(๐‘ก))

In the last equation, we used the fact that๐‘ฅ๐‘– is always a solution to ๐ด๐‘–๐‘ฅ =๐‘๐‘–. Notice that the above equation is no longer an โ€œupdateโ€ in the usual sense, i.e.,๐‘ฅ๐‘–(๐‘ก+1)does not depend on๐‘ฅ๐‘–(๐‘ก) directly. This can be further simplified using the pseudoinverse of ๐ด๐‘–, ๐ด+

๐‘– = ๐ด๐‘‡

๐‘– (๐ด๐‘–๐ด๐‘‡

๐‘–)โˆ’1as

๐‘ฅ๐‘–(๐‘ก+1) =๐‘ฅยฏ(๐‘ก) +๐ด+

๐‘– (๐‘๐‘–โˆ’ ๐ด๐‘–๐‘ฅยฏ(๐‘ก)). It is then easy to see from the Cimminoโ€™s equation (6.18a) that

๐‘Ÿ๐‘–(๐‘ก) =๐‘ฅ๐‘–(๐‘ก+1) โˆ’๐‘ฅยฏ(๐‘ก).

Therefore, the update (6.18b) can be expressed as

ยฏ

๐‘ฅ(๐‘ก+1) =๐‘ฅยฏ(๐‘ก) +๐œˆ

๐‘š

ร•

๐‘–=1

๐‘Ÿ๐‘–(๐‘ก)

=๐‘ฅยฏ(๐‘ก) +๐œˆ

๐‘š

ร•

๐‘–=1

(๐‘ฅ๐‘–(๐‘ก+1) โˆ’๐‘ฅยฏ(๐‘ก))

= (1โˆ’๐‘š ๐œˆ)๐‘ฅยฏ(๐‘ก) +๐œˆ

๐‘š

ร•

๐‘–=1

๐‘ฅ๐‘–(๐‘ก+1),

which is nothing but the same update rule as in (6.3b) with๐œ‚ =๐‘š ๐œˆ. It is not hard to show that optimal rate of convergence of the Cimmino method is

๐œŒCim = ๐œ…(๐‘‹) โˆ’1

๐œ…(๐‘‹) +1 โ‰ˆ 1โˆ’ 2

๐œ…(๐‘‹), (6.19)

which is slower (by an order) than that of APC (

โˆš

๐œ…(๐‘‹)โˆ’1

โˆš

๐œ…(๐‘‹)+1 โ‰ˆ 1โˆ’ โˆš2

๐œ…(๐‘‹)).

6.4.6 Consensus Algorithm of Mou et al.

As mentioned earlier, a projection-based consensus algorithm for solving linear systems over a network was recently proposed by Mou et al. [146, 134]. For the master-worker setting studied here, the corresponding network would be a clique, and the algorithm reduces to

๐‘ฅ๐‘–(๐‘ก+1) =๐‘ฅ๐‘–(๐‘ก) +๐‘ƒ๐‘–ยฉ

ยญ

ยซ 1 ๐‘š

ยฉ

ยญ

ยซ

๐‘š

ร•

๐‘—=1

๐‘ฅ๐‘—(๐‘ก)ยช

ยฎ

ยฌ

โˆ’๐‘ฅ๐‘–(๐‘ก)ยช

ยฎ

ยฌ

, ๐‘– โˆˆ [๐‘š], (6.20) which is transparently equivalent to APC with๐›พ =๐œ‚ =1:

๐‘ฅ๐‘–(๐‘ก+1) =๐‘ฅ๐‘–(๐‘ก) +๐‘ƒ๐‘–(๐‘ฅยฏ(๐‘ก) โˆ’๐‘ฅ๐‘–(๐‘ก)), ๐‘– โˆˆ [๐‘š],

ยฏ

๐‘ฅ(๐‘ก+1) = 1 ๐‘š

๐‘š

ร•

๐‘–=1

๐‘ฅ๐‘–(๐‘ก+1),

It is straightforward to show that the rate of convergence in this case is

๐œŒMou =1โˆ’๐œ‡min(๐‘‹) (6.21)

which is much slower than the block Cimmino method and APC. One can easily check that

1โˆ’๐œ‡min(๐‘‹) โ‰ฅ ๐œ…(๐‘‹) โˆ’1 ๐œ…(๐‘‹) +1 โ‰ฅ

p

๐œ…(๐‘‹) โˆ’1 p

๐œ…(๐‘‹) +1 .

Even though this algorithm is slow, it is useful for applications where a fully- distributed (networked) solution is desired. Another networked algorithm for solving a least-squares problem has been recently proposed in [207].

A summary of the convergence rates of all the related methods discussed in this section is provided in Table 6.1.