DISTRIBUTED SOLUTION OF LARGE-SCALE SYSTEMS OF EQUATIONS
6.4 Comparison with Related Methods .1 Distributed Gradient Descent (DGD).1Distributed Gradient Descent (DGD)
Table 6.1: A summary of the convergence rates of different methods. DGD:
Distributed Gradient Descent, D-NAG: Distributed Nesterovโs Accelerated Gradient Descent, D-HBM: Distributed Heavy-Ball Method, Mou et al: Consensus algorithm of [146], B-Cimmino: Block Cimmino Method, APC: Accelerated Projection-based Consensus. The smaller the convergence rate is, the faster is the method. Note that ๐GD โฅ ๐NAG โฅ ๐HBMand๐Mou โฅ ๐Cim โฅ ๐APC.
DGD D-NAG D-HBM Mou et al. B-Cimmino APC (proposed)
๐ (๐ด๐๐ด)โ1
๐ (๐ด๐๐ด)+1 1โโ 2
3๐ (๐ด๐๐ด)+1
โ
๐ (๐ด๐๐ด)โ1
โ
๐ (๐ด๐๐ด)+1 1โ๐min(๐) ๐ (๐)โ1๐ (๐)+
1
โ
๐ (๐)โ1
โ
๐ (๐)+1
โ1โ๐ (๐ด2๐๐ด) โ1โโ 2
๐ (๐ด๐๐ด) โ1โ๐ (๐)2 โ1โโ2
๐ (๐)
well to the problem.
The regular or full gradient descent has the update rule๐ฅ(๐ก+1)=๐ฅ(๐ก) โ๐ผ ๐ด๐(๐ด๐ฅ(๐ก) โ ๐), where๐ผ >0 is the step size or learning rate. The distributed version of gradient descent is one in which each machine๐has only a subset of the equations [๐ด๐, ๐๐], and computes its own part of the gradient, which is ๐ด๐
๐ (๐ด๐๐ฅ(๐ก) โ๐๐). The updates are then collectively done as:
๐ฅ(๐ก+1)=๐ฅ(๐ก) โ๐ผ
๐
ร
๐=1
๐ด๐
๐ (๐ด๐๐ฅ(๐ก) โ๐๐). (6.11) One can show that this also has linear convergence, and the rate of convergence is
๐GD = ๐ (๐ด๐๐ด) โ1
๐ (๐ด๐๐ด) +1 โ 1โ 2
๐ (๐ด๐๐ด). (6.12)
We should mention that since each machine needs to compute ๐ด๐
๐ (๐ด๐๐ฅ(๐ก) โ๐๐) at each iteration๐ก, the computational complexity per iteration is 2๐ ๐, which is identical to that of APC.
6.4.2 Distributed Nesterovโs Accelerated Gradient Descent (D-NAG)
A popular variant of gradient descent is Nesterovโs accelerated gradient descent [151], which has a memory term, and works as follows:
๐ฆ(๐ก+1) =๐ฅ(๐ก) โ๐ผ
๐
ร
๐=1
๐ด๐
๐ (๐ด๐๐ฅ(๐ก) โ๐๐), (6.13a) ๐ฅ(๐ก+1) =(1+๐ฝ)๐ฆ(๐ก+1) โ ๐ฝ ๐ฆ(๐ก). (6.13b) One can show [125] that the optimal convergence rate of this method is
๐NAG =1โ 2 p3๐ (๐ด๐๐ด) +1
, (6.14)
which is improved over the regular distributed gradient descent (one can check that
๐ (๐ด๐๐ด)โ1
๐ (๐ด๐๐ด)+1 โฅ 1โ โ 2
3๐ (๐ด๐๐ด)+1).
6.4.3 Distributed Heavy-Ball Method (D-HBM)
The heavy-ball method [169], otherwise known as the gradient descent with momen- tum, is another accelerated variant of gradient descent as follows:
๐ง(๐ก+1) =๐ฝ ๐ง(๐ก) +
๐
ร
๐=1
๐ด๐
๐ (๐ด๐๐ฅ(๐ก) โ๐๐), (6.15a) ๐ฅ(๐ก+1) =๐ฅ(๐ก) โ๐ผ ๐ง(๐ก+1). (6.15b)
It can be shown [125] that the optimal rate of convergence of this method is ๐HBM =
p
๐ (๐ด๐๐ด) โ1 p
๐ (๐ด๐๐ด) +1
โ1โ 2
p
๐ (๐ด๐๐ด)
, (6.16)
which is further improved over DGD and D-NAG (๐ (๐ด
๐๐ด)โ1
๐ (๐ด๐๐ด)+1 โฅ 1โ โ 2
3๐ (๐ด๐๐ด)+1
โฅ
โ
๐ (๐ด๐๐ด)โ1
โ
๐ (๐ด๐๐ด)+1). This is similar to, but not the same as, the rate of convergence of APC.
The difference is that the condition number of ๐ด๐๐ด =ร๐
๐=1๐ด๐
๐ ๐ด๐ is replaced with the condition number of๐ =ร๐
๐=1๐ด๐
๐ ๐ด๐๐ด๐
๐
โ1
๐ด๐in APC. Given its structure as the sum of projection matrices, one may speculate that ๐ has a much better condition number than ๐ด๐๐ด. Indeed, our experiments with random, as well as real, data sets suggest that this is the case and that the condition number of๐ is often significantly better (see Table 6.2).
6.4.4 Alternating Direction Method of Multipliers (ADMM)
Alternating Direction Method of Multipliers (more specifically, consensus ADMM [185, 44]), is another generic method for solving optimization problems with separable cost function ๐(๐ฅ) =ร๐
๐=1 ๐๐(๐ฅ)distributedly, by defining additional local variables. Each machine๐ holds local variables๐ฅ๐(๐ก) โR๐and ๐ฆ๐(๐ก) โR๐, and the masterโs value is ยฏ๐ฅ(๐ก) โR๐, for any time๐ก. For ๐๐(๐ฅ) = 1
2k๐ด๐๐ฅโ๐๐k2, the update rule of ADMM simplifies to
๐ฅ๐(๐ก+1) = (๐ด๐
๐ ๐ด๐+๐ ๐ผ๐)โ1(๐ด๐
๐ ๐๐โ๐ฆ๐(๐ก) +๐๐ฅยฏ(๐ก)), ๐ โ [๐] (6.17a)
ยฏ
๐ฅ(๐ก+1) = 1 ๐
๐
ร
๐=1
๐ฅ๐(๐ก+1) (6.17b)
๐ฆ๐(๐ก+1) = ๐ฆ๐(๐ก) +๐(๐ฅ๐(๐ก+1) โ๐ฅยฏ(๐ก+1)), ๐ โ [๐] (6.17c) It turns out that this method is very slow (and often unstable) in its native form for the application in hand. One can check that when system (6.1) has a solution, all the ๐ฆ๐variables converge to zero in steady state. Therefore, setting ๐ฆ๐โs to zero can speed up the convergence significantly. We use this modified version in Section 6.6 for comparison.
We should also note that the computational complexity of ADMM is ๐(๐ ๐) per iteration (the inverse is computed using matrix inversion lemma), which is again the same as that of gradient-type methods and APC.
6.4.5 Block Cimmino Method
The Block Cimmino method [69, 191, 11], which is a parallel method specifically for solving linear systems of equations, is perhaps the closest algorithm in spirit to APC. It is, in a way, a distributed implementation of the so-called Kaczmarz method [111]. The convergence of the Cimmino method is slower by an order in comparison with APC (its convergence time is the square of that of APC), and it turns out that APC includes this method as a special case when๐พ =1.
The block Cimmino method is the following:
๐๐(๐ก) = ๐ด+
๐ (๐๐โ ๐ด๐๐ฅยฏ(๐ก)), ๐ โ [๐] (6.18a)
ยฏ
๐ฅ(๐ก+1) =๐ฅยฏ(๐ก) +๐
๐
ร
๐=1
๐๐(๐ก), (6.18b)
where ๐ด+
๐ = ๐ด๐
๐ (๐ด๐๐ด๐
๐)โ1is the pseudoinverse of ๐ด๐.
Proposition 32. The APC method (Algorithm 4) includes the block Cimmino method as a special case for๐พ =1.
Proof. When๐พ =1, Eq. (6.3a) becomes
๐ฅ๐(๐ก+1)=๐ฅ๐(๐ก) โ๐๐(๐ฅ๐(๐ก) โ๐ฅยฏ(๐ก))
=๐ฅ๐(๐ก) โ ๐ผโ ๐ด๐
๐ (๐ด๐๐ด๐
๐)โ1๐ด๐
(๐ฅ๐(๐ก) โ๐ฅยฏ(๐ก))
=๐ฅยฏ(๐ก) + ๐ด๐
๐ (๐ด๐๐ด๐
๐ )โ1๐ด๐(๐ฅ๐โ๐ฅยฏ(๐ก))
=๐ฅยฏ(๐ก) + ๐ด๐
๐ (๐ด๐๐ด๐
๐ )โ1(๐๐โ๐ด๐๐ฅยฏ(๐ก))
In the last equation, we used the fact that๐ฅ๐ is always a solution to ๐ด๐๐ฅ =๐๐. Notice that the above equation is no longer an โupdateโ in the usual sense, i.e.,๐ฅ๐(๐ก+1)does not depend on๐ฅ๐(๐ก) directly. This can be further simplified using the pseudoinverse of ๐ด๐, ๐ด+
๐ = ๐ด๐
๐ (๐ด๐๐ด๐
๐)โ1as
๐ฅ๐(๐ก+1) =๐ฅยฏ(๐ก) +๐ด+
๐ (๐๐โ ๐ด๐๐ฅยฏ(๐ก)). It is then easy to see from the Cimminoโs equation (6.18a) that
๐๐(๐ก) =๐ฅ๐(๐ก+1) โ๐ฅยฏ(๐ก).
Therefore, the update (6.18b) can be expressed as
ยฏ
๐ฅ(๐ก+1) =๐ฅยฏ(๐ก) +๐
๐
ร
๐=1
๐๐(๐ก)
=๐ฅยฏ(๐ก) +๐
๐
ร
๐=1
(๐ฅ๐(๐ก+1) โ๐ฅยฏ(๐ก))
= (1โ๐ ๐)๐ฅยฏ(๐ก) +๐
๐
ร
๐=1
๐ฅ๐(๐ก+1),
which is nothing but the same update rule as in (6.3b) with๐ =๐ ๐. It is not hard to show that optimal rate of convergence of the Cimmino method is
๐Cim = ๐ (๐) โ1
๐ (๐) +1 โ 1โ 2
๐ (๐), (6.19)
which is slower (by an order) than that of APC (
โ
๐ (๐)โ1
โ
๐ (๐)+1 โ 1โ โ2
๐ (๐)).
6.4.6 Consensus Algorithm of Mou et al.
As mentioned earlier, a projection-based consensus algorithm for solving linear systems over a network was recently proposed by Mou et al. [146, 134]. For the master-worker setting studied here, the corresponding network would be a clique, and the algorithm reduces to
๐ฅ๐(๐ก+1) =๐ฅ๐(๐ก) +๐๐ยฉ
ยญ
ยซ 1 ๐
ยฉ
ยญ
ยซ
๐
ร
๐=1
๐ฅ๐(๐ก)ยช
ยฎ
ยฌ
โ๐ฅ๐(๐ก)ยช
ยฎ
ยฌ
, ๐ โ [๐], (6.20) which is transparently equivalent to APC with๐พ =๐ =1:
๐ฅ๐(๐ก+1) =๐ฅ๐(๐ก) +๐๐(๐ฅยฏ(๐ก) โ๐ฅ๐(๐ก)), ๐ โ [๐],
ยฏ
๐ฅ(๐ก+1) = 1 ๐
๐
ร
๐=1
๐ฅ๐(๐ก+1),
It is straightforward to show that the rate of convergence in this case is
๐Mou =1โ๐min(๐) (6.21)
which is much slower than the block Cimmino method and APC. One can easily check that
1โ๐min(๐) โฅ ๐ (๐) โ1 ๐ (๐) +1 โฅ
p
๐ (๐) โ1 p
๐ (๐) +1 .
Even though this algorithm is slow, it is useful for applications where a fully- distributed (networked) solution is desired. Another networked algorithm for solving a least-squares problem has been recently proposed in [207].
A summary of the convergence rates of all the related methods discussed in this section is provided in Table 6.1.