Code Construction - CODED COMPUTATION FOR DISTRIBUTED GRADIENT DESCENT

CODED COMPUTATION FOR DISTRIBUTED GRADIENT DESCENT

7.3 Code Construction

7.2.2 Computational Trade-offs

In a distributed scheme that does not employ redundancy, the taskmaster has to wait for all the workers to finish in order to compute the full gradient. However, in the scheme outlined above, the taskmaster needs to wait for the fastest 𝑓 machines to recover the full gradient. Clearly, this requires more computation by each machine.

Note that in the uncoded setting, the amount of computation that each worker does is ¹_𝑛 of the total work, whereas in the coded setting each machine performs a ^𝑤_𝑘 fraction of the total work. From (7.2), we know that if a scheme can tolerate 𝑠 stragglers, the fraction of computation that each worker does is ^𝑤_𝑘 ≥ ^𝑠+1

𝑛 . Therefore, the computation load of each worker increases by a factor of (𝑠+1). As will be explained further in Section 7.5, there is a sweet spot for ^𝑤_𝑘 (and consequently𝑠) that minimizes the expected total time that the master waits in order to recover the full gradient update.

It is worth noting that it is often assumed [197, 123, 70] that the decoding vectors are precomputed for all possible combinations of returning machines, and the decoding cost is not taken into account in the total computation time. In a practical system, however, it is not very reasonable to compute and store all the decoding vectors, especially as there are ^𝑛_𝑓

such vectors, which grows quickly with𝑛. In this work, we introduce an online algorithm for computing the decoding vectors on the fly, for the indices of the 𝑓 workers that respond first. The approach is based on the idea of inverting Vandermonde matrices, which can be done very efficiently. In the sequel, we show how to construct an encoding matrixBfor any𝑤 , 𝑘 and𝑛, such that the system is resilient to 𝑤 𝑛

𝑘

−1 stragglers, along with an efficient algorithm for computing the decoding vectors{aF :F ⊂ [𝑛],|F | = 𝑓}.

7.3.1 Balanced Mask Matrices

We will utilize techniques from [90, 91] to construct the matrixM(and thenB). For that, we present the following definition.

Definition 7(Balanced Matrix). A matrixM ∈ {0,1}^𝑛^×^𝑘 is column (row)-balanced if for fixed row (column) weight, the weights of any two columns (rows) differ by at most 1.

Ultimately, we are interested in a matrixMwith row weight𝑤that prescribes a mask for the encoding matrixB. As an example, let𝑛 =8,𝑘 =4 and𝑤=3. Then,Mis given by







1 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 1 1







, (7.5)

where each column is of weight ^𝑛𝑤_𝑘 =6. The following algorithm produces a balanced mask matrix. For a fixed column weight𝑑, each row has weight either 𝑘 𝑑

𝑛

or𝑘 𝑑 𝑛

. Algorithm 6RowBalancedMaskMatrix(𝑛,𝑘,𝑑,𝑡)

Input:

𝑛: Number of rows 𝑘: Number of columns 𝑑: Weight of each column 𝑡: Offset parameter

Output: Row-balancedM∈ {0,1}^𝑛×𝑘 M←0𝑛×𝑘

for 𝑗 =0 to 𝑘−1do for𝑖 =0 to𝑑−1do

𝑟 =(𝑖+ 𝑗 𝑑+𝑡)𝑛 ⊲The quantity (𝑥)𝑛denotes𝑥modulo𝑛. M^{𝑟 , 𝑗} =1

end for end for return M

As a result, when𝑑is chosen as^𝑛𝑤_𝑘 ∈Z, all rows will be of weight𝑤. As an example, the matrixMin (7.5) is generated by calling RowBalancedMaskMatrix(8,4,6,0).

Algorithm 6 can be used to generate a mask matrixMfor the encoding matrixB:

The 𝑗^thcolumn ofBwill be chosen as a Reed–Solomon codeword whose support is that of the 𝑗^th column ofM.

7.3.2 Correctness of Algorithm 6

To lighten notation, we prove correctness for 𝑡 = 0. The general case follows immediately.

Proposition 34. Let𝑘 , 𝑑 and𝑛be integers where𝑑 < 𝑛. The row weights of matrix M∈ {0,1}^𝑛×𝑘 produced by Algorithm 6 for𝑡 =0are

𝑤_𝑖 =











 𝑘 𝑑

𝑛

, 𝑖 ∈ {0, . . . ,(𝑘 𝑑−1)𝑛}, 𝑘 𝑑

𝑛

, 𝑖 ∈ {(𝑘 𝑑)𝑛, . . . , 𝑛−1}. Proof. The nonzero entries in column 𝑗 ofMare given by

S𝑗 ={𝑗 𝑑 , . . . ,(𝑗+1)𝑑−1}𝑛,

where the subscript𝑛denotes reducing the elements of the set modulo𝑛. Collectively, the nonzero indices in all columns are given by

S ={0, . . . 𝑑−1, . . . ,(𝑘−1)𝑑 , . . . 𝑘 𝑑−1}𝑛.

In case𝑛 | 𝑘 𝑑, each element inS, after reducing modulo𝑛, appears the same number of times. As a result, those indices correspond to columns of equal weight, namely

𝑘 𝑑

𝑛 . Hence, the two cases of𝑤_𝑖 are identical along with their corresponding index sets.

In the case where𝑛- 𝑘 𝑑, each of the first 𝑘 𝑑 𝑛

𝑛elements, after reducing modulo𝑛, appears the same number of times. As a result, the nonzero entries corresponding to those indices are distributed evenly amongst the𝑛rows, each of which is of weight

𝑘 𝑑 𝑛

. The remaining indices{𝑘 𝑑 𝑛

𝑛, . . . , 𝑘 𝑑−1}𝑛contribute an additional nonzero entry to their respective rows, those indexed by{0, . . . ,(𝑘 𝑑−1)𝑛}. Finally, we have that the first(𝑘 𝑑)𝑛rows are of weight 𝑘 𝑑

𝑛

+1=𝑘 𝑑 𝑛

, while the remaining ones are of weight 𝑘 𝑑

𝑛

Now consider the case when 𝑡 is not necessarily equal to zero. This amounts to shifting (cyclically) the entries in each column by𝑡 positions downwards. As a result, the rows themselves are shifted by the same amount, allowing us to conclude the following.

Corollary 35. Let𝑘 , 𝑑, and𝑛be integers where𝑑 < 𝑛. The row weights of matrix 𝑀 ∈ {0,1}^𝑛×𝑘 produced by Algorithm 6 are

𝑤_𝑖 =











 𝑘 𝑑

𝑛

𝑖 ∈ {𝑡 , . . . ,(𝑡+𝑘 𝑑−1)𝑛}, 𝑘 𝑑

𝑛

𝑖 ∈ {0, 𝑡−1} ∪ {(𝑡+𝑘 𝑑)𝑛, . . . , 𝑛−1}. 7.3.3 Reed–Solomon Codes

This subsection provides a quick overview of Reed–Solomon Codes. A Reed–

Solomon code of length 𝑛 and dimension 𝑓 is a linear subspace RS[𝑛, 𝑓] of C^𝑛 corresponding to the evaluation of polynomials of degree less than 𝑓 with coefficients inCon a set of𝑛distinct points{𝛼₁, . . . , 𝛼_𝑛}, also chosen fromC. When𝛼_𝑖 =𝛼^𝑖, where𝛼∈Cis an𝑛^throot of unity, the evaluations of the polynomial𝑡(𝑥)=Í^𝑓−1

𝑖=0 𝑡_𝑖𝑥^𝑖 on{1, 𝛼, . . . , 𝛼^𝑛−¹}corresponds to





 𝑡(1) 𝑡(𝛼) .. . 𝑡(𝛼^𝑛⁻¹)













1 1 · · · 1

1 𝛼 · · · 𝛼^𝑓⁻¹ ..

. .. .

.. .

.. . 1 𝛼^𝑛⁻¹ · · · 𝛼⁽^𝑛^{−1) (}^𝑓⁻¹⁾











 𝑡₀ 𝑡₁ .. . 𝑡_𝑓−1







=Gt. (7.6)

It is well-known that any 𝑓 rows ofGform an invertible matrix, which implies that specifying any 𝑓 evaluations{𝑡(𝛼^𝑖¹), . . . , 𝑡(𝛼^𝑖^𝑓)}of a polynomial𝑡(𝑥)of degree at most 𝑓 −1 characterizes it. In particular, fixing 𝑓 −1 evaluations of the polynomial to zero characterizes𝑡(𝑥) uniquely up to scaling. This property will give us the ability to constructBfromM.

7.3.4 General construction

In case𝑑 = ^{𝑤 𝑘}_𝑛 ∉ Z, the chosen row weight𝑤 prevents the existence ofMwhere each column weight is minimal. We have to resort to Algorithm 7 that yieldsM comprised of two matricesMℎandM𝑙 according to

M= h

Mℎ M𝑙

i .

The matricesMℎandM𝑙 are constructed using Algorithm 6. Each column ofMℎ

has weight 𝑑_ℎ :=𝑛𝑤 𝑘

and each column ofM𝑙 has weight𝑑_𝑙 := 𝑛𝑤 𝑘

. Note that according to (7.2), we require 𝑑_𝑙 ≥ 2 in order to tolerate a positive number of stragglers.

Algorithm 7Column-balanced Mask MatrixM

Dalam dokumen From Network Dynamics to Optimization Algorithms (Halaman 180-184)