Preliminaries .1 Problem Setup.1Problem Setup

CODED COMPUTATION FOR DISTRIBUTED GRADIENT DESCENT

7.2 Preliminaries .1 Problem Setup.1Problem Setup

machines, 𝑓, given a prespecified computational effort per machine.

2. We provide an efficient online decoder, with time complexity 𝑂(𝑓²) for recovering the gradient from any 𝑓 machines, which is faster than the best known method [197],𝑂(𝑓³).

3. We analyze the total computation time, and provide a method for finding the optimal coding parameters. We consider heavy-tailed delays, which have been widely observed in CPU job runtimes in practice [124, 93, 94].

The rest of the chapter is organized as follows. In Section 7.2, we describe the problem setup and explain the design objectives in detail. Section 7.3, provides the construction of our coding scheme, using the idea of balanced Reed-Solomon codes.

Our efficient online decoder is presented in Section 7.4. We then characterize the total computation time, and describe the optimal choice of coding parameters, in Section 7.5. Finally, we provide our numerical results in Section 7.6, and conclude in Section 7.7.

7.2 Preliminaries

master intends to train a model using gradient descent by distributing the gradient updates amongst the workers. More precisely, consider a typical scenario, where we want to learn parameters 𝛽∈R^𝑝 by minimizing a generic loss function𝐿(D;𝛽) over a given dataset D = {(𝑥_𝑖, 𝑦_𝑖)}^𝑁

𝑖=1, where 𝑥_𝑖 ∈ R^𝑝 and 𝑦_𝑖 ∈ R. The loss function can be expressed as the sum of the losses for individual data points, i.e.

𝐿(D;𝛽)=Í^𝑁

𝑖=1ℓ(𝑥_𝑖, 𝑦_𝑖;𝛽).Therefore, the full gradient, with respect to 𝛽, is given by

∇𝐿(D;𝛽) =

𝑁

𝑖=1

∇ℓ(𝑥_𝑖, 𝑦_𝑖;𝛽). (7.1) The data can be divided into 𝑘 (disjoint) chunks {D₁, . . . ,D𝑘} of size ^𝑁_𝑘, and clearly the gradient can also be written as∇𝐿(D;𝛽) = Í^𝑘

𝑖=1

(𝑥 , 𝑦)∈D𝑖 ∇ℓ(𝑥 , 𝑦;𝛽).

Define 𝑔_𝑖 := Í

(𝑥 , 𝑦)∈D𝑖∇ℓ(𝑥 , 𝑦;𝛽) as the partial gradient of chunk 𝑖 for every 𝑖, and 𝑔^𝑇 := [𝑔^𝑇

1, 𝑔^𝑇

2, . . . , 𝑔^𝑇

𝑘], where 𝑔_𝑖 is a row vector of length 𝑝. Therefore

∇𝐿(D;𝛽) = 1_1×𝑘𝑔. Now suppose each worker𝑊_𝑖 is assigned 𝑤 data partitions {D𝑖₁, . . . ,D𝑖_𝑤}, on which it computes the partial gradients{𝑔_𝑖

1, 𝑔_𝑖

2, . . . , 𝑔_𝑖

𝑤}. Note that the “redundancy” in computation is introduced here, since each chunk is allowed to be assigned to multiple workers. Each worker then has to compute its partial gradients, and return a prespecifiedlinear combinationof them to the master.

As it will be explained in detail, 𝑘and𝑤are to be chosen in such a way that the total computation time is minimized. For a fixed𝑘 and𝑤, we want to be able to recover the gradient using the linear combinations received from the fastest 𝑓 machines at the master (or equivalently tolerate𝑠:=𝑛− 𝑓 stragglers). Note that we do not assume any prior knowledge about the stragglers, i.e., we shall design a scheme that enables master to recover the gradient from any set of 𝑓 machines. It is known [197] that for any fixed𝑘 and𝑤, an upper-bound on the number of stragglers thatany schemecan tolerate is:

𝑠 ≤ j𝑤 𝑛 𝑘

−1. (7.2)

The scheme proposed in this work achieves this bound. A coding scheme designed to tolerate𝑠stragglers consists of anencoding matrixB, and a collection of decoding vectors{aF : F ⊂ [𝑛],|F |=𝑛−𝑠}. The matrixBshould satisfy:

1. Each row ofBcontains exactly𝑤nonzero entries.

2. The linear space generated by any 𝑓 rows ofBcontains the all-one vector of length𝑘,11×𝑘.

The values of these nonzero entries prescribe the linear combination sent by𝑊_𝑖. In other words, thecoded partial gradientsent from𝑊_𝑖 to𝑀 is given by

𝑐_𝑖 =

𝑘

𝑗=1

B𝑖, 𝑗𝑔_𝑗 =B𝑖𝑔, (7.3)

whereB𝑖denotes the𝑖^th row ofB. The𝑐_𝑖’s define the encoded computation matrix C∈C^𝑛×𝑝asC=

h 𝑐^𝑇

1 𝑐^𝑇

2 · · · 𝑐^𝑇

𝑛

i^𝑇

=B𝑔, where𝑐_𝑖 ∈C^1×𝑝. The decoding vectors are chosen as follows: letF ={𝑖₁, . . . , 𝑖_𝑓}be the indices of the returning machines and letBF be the sub-matrix ofBwith rows indexed byF. If11×𝑘 is in the linear space generated by the rows ofBF, as the second property suggests,aF is chosen such thataFBF =1_1×𝑘. As a result, we have

aFCF =aFBF𝑔=1_1×𝑘 𝑔 =∇𝐿(D;𝛽). (7.4) When this holds for any set of indices F ⊂ [𝑛]of size 𝑓, it means that the gradient can be recovered from the set of 𝑓 machines that return fastest.

The pseudocode listing of Algorithm 5 outlines the overall procedure for implementing our scheme, as just described.

Algorithm 5Pseudocode of the proposed scheme Require:

{D₁, . . . ,D𝑘}: Dataset partition B1, . . . ,B𝑛: Encoding vectors 𝑇: Number of iterations {𝜂_𝑡}^𝑇

𝑡=1: Learning rates Ensure: 𝛽_𝑇: Parameters

PartitionD into{D₁, . . . ,D𝑘}

Assign to𝑊_𝑖the partitions{D𝑖₁, . . .D𝑖𝑤} Assign to𝑊_𝑖the encoding vectorB𝑖

𝛽₀ ←0𝑝×1

while𝑡 < 𝑇 do

𝑀 sends out𝛽_𝑡 to𝑊₁, . . . , 𝑊_𝑛 𝑊_𝑖computes partial gradients𝑔_𝑖

1, . . . , 𝑔_𝑖

𝑤

𝑊_𝑖encodes𝑔_𝑖

1, . . . , 𝑔_𝑖

𝑤 to𝑐_𝑖usingB𝑖

𝑀 computesaF corresponding to first 𝑓 returning machines 𝑀 recovers∇(D, 𝛽_𝑡) using{𝑐_𝑖}𝑖∈F andaF

𝑀 updates model: 𝛽_𝑡+₁= 𝛽_𝑡−𝜂_𝑡∇(D, 𝛽_𝑡) end while

return 𝛽_𝑇

7.2.2 Computational Trade-offs

In a distributed scheme that does not employ redundancy, the taskmaster has to wait for all the workers to finish in order to compute the full gradient. However, in the scheme outlined above, the taskmaster needs to wait for the fastest 𝑓 machines to recover the full gradient. Clearly, this requires more computation by each machine.

Note that in the uncoded setting, the amount of computation that each worker does is ¹_𝑛 of the total work, whereas in the coded setting each machine performs a ^𝑤_𝑘 fraction of the total work. From (7.2), we know that if a scheme can tolerate 𝑠 stragglers, the fraction of computation that each worker does is ^𝑤_𝑘 ≥ ^𝑠+1

𝑛 . Therefore, the computation load of each worker increases by a factor of (𝑠+1). As will be explained further in Section 7.5, there is a sweet spot for ^𝑤_𝑘 (and consequently𝑠) that minimizes the expected total time that the master waits in order to recover the full gradient update.

It is worth noting that it is often assumed [197, 123, 70] that the decoding vectors are precomputed for all possible combinations of returning machines, and the decoding cost is not taken into account in the total computation time. In a practical system, however, it is not very reasonable to compute and store all the decoding vectors, especially as there are ^𝑛_𝑓

such vectors, which grows quickly with𝑛. In this work, we introduce an online algorithm for computing the decoding vectors on the fly, for the indices of the 𝑓 workers that respond first. The approach is based on the idea of inverting Vandermonde matrices, which can be done very efficiently. In the sequel, we show how to construct an encoding matrixBfor any𝑤 , 𝑘 and𝑛, such that the system is resilient to 𝑤 𝑛

𝑘

−1 stragglers, along with an efficient algorithm for computing the decoding vectors{aF :F ⊂ [𝑛],|F | = 𝑓}.

Dalam dokumen From Network Dynamics to Optimization Algorithms (Halaman 177-180)