• Tidak ada hasil yang ditemukan

Preliminaries .1 Problem Setup.1Problem Setup

CODED COMPUTATION FOR DISTRIBUTED GRADIENT DESCENT

7.2 Preliminaries .1 Problem Setup.1Problem Setup

machines, 𝑓, given a prespecified computational effort per machine.

2. We provide an efficient online decoder, with time complexity 𝑂(𝑓2) for recovering the gradient from any 𝑓 machines, which is faster than the best known method [197],𝑂(𝑓3).

3. We analyze the total computation time, and provide a method for finding the optimal coding parameters. We consider heavy-tailed delays, which have been widely observed in CPU job runtimes in practice [124, 93, 94].

The rest of the chapter is organized as follows. In Section 7.2, we describe the problem setup and explain the design objectives in detail. Section 7.3, provides the construction of our coding scheme, using the idea of balanced Reed-Solomon codes.

Our efficient online decoder is presented in Section 7.4. We then characterize the total computation time, and describe the optimal choice of coding parameters, in Section 7.5. Finally, we provide our numerical results in Section 7.6, and conclude in Section 7.7.

7.2 Preliminaries

master intends to train a model using gradient descent by distributing the gradient updates amongst the workers. More precisely, consider a typical scenario, where we want to learn parameters π›½βˆˆR𝑝 by minimizing a generic loss function𝐿(D;𝛽) over a given dataset D = {(π‘₯𝑖, 𝑦𝑖)}𝑁

𝑖=1, where π‘₯𝑖 ∈ R𝑝 and 𝑦𝑖 ∈ R. The loss function can be expressed as the sum of the losses for individual data points, i.e.

𝐿(D;𝛽)=Í𝑁

𝑖=1β„“(π‘₯𝑖, 𝑦𝑖;𝛽).Therefore, the full gradient, with respect to 𝛽, is given by

βˆ‡πΏ(D;𝛽) =

𝑁

Γ•

𝑖=1

βˆ‡β„“(π‘₯𝑖, 𝑦𝑖;𝛽). (7.1) The data can be divided into π‘˜ (disjoint) chunks {D1, . . . ,Dπ‘˜} of size π‘π‘˜, and clearly the gradient can also be written asβˆ‡πΏ(D;𝛽) = Γπ‘˜

𝑖=1

Í

(π‘₯ , 𝑦)∈D𝑖 βˆ‡β„“(π‘₯ , 𝑦;𝛽).

Define 𝑔𝑖 := Í

(π‘₯ , 𝑦)∈Dπ‘–βˆ‡β„“(π‘₯ , 𝑦;𝛽) as the partial gradient of chunk 𝑖 for every 𝑖, and 𝑔𝑇 := [𝑔𝑇

1, 𝑔𝑇

2, . . . , 𝑔𝑇

π‘˜], where 𝑔𝑖 is a row vector of length 𝑝. Therefore

βˆ‡πΏ(D;𝛽) = 11Γ—π‘˜π‘”. Now suppose each workerπ‘Šπ‘– is assigned 𝑀 data partitions {D𝑖1, . . . ,D𝑖𝑀}, on which it computes the partial gradients{𝑔𝑖

1, 𝑔𝑖

2, . . . , 𝑔𝑖

𝑀}. Note that the β€œredundancy” in computation is introduced here, since each chunk is allowed to be assigned to multiple workers. Each worker then has to compute its partial gradients, and return a prespecifiedlinear combinationof them to the master.

As it will be explained in detail, π‘˜and𝑀are to be chosen in such a way that the total computation time is minimized. For a fixedπ‘˜ and𝑀, we want to be able to recover the gradient using the linear combinations received from the fastest 𝑓 machines at the master (or equivalently tolerate𝑠:=π‘›βˆ’ 𝑓 stragglers). Note that we do not assume any prior knowledge about the stragglers, i.e., we shall design a scheme that enables master to recover the gradient from any set of 𝑓 machines. It is known [197] that for any fixedπ‘˜ and𝑀, an upper-bound on the number of stragglers thatany schemecan tolerate is:

𝑠 ≀ j𝑀 𝑛 π‘˜

k

βˆ’1. (7.2)

The scheme proposed in this work achieves this bound. A coding scheme designed to tolerate𝑠stragglers consists of anencoding matrixB, and a collection of decoding vectors{aF : F βŠ‚ [𝑛],|F |=π‘›βˆ’π‘ }. The matrixBshould satisfy:

1. Each row ofBcontains exactly𝑀nonzero entries.

2. The linear space generated by any 𝑓 rows ofBcontains the all-one vector of lengthπ‘˜,11Γ—π‘˜.

The values of these nonzero entries prescribe the linear combination sent byπ‘Šπ‘–. In other words, thecoded partial gradientsent fromπ‘Šπ‘– to𝑀 is given by

𝑐𝑖 =

π‘˜

Γ•

𝑗=1

B𝑖, 𝑗𝑔𝑗 =B𝑖𝑔, (7.3)

whereB𝑖denotes the𝑖th row ofB. The𝑐𝑖’s define the encoded computation matrix C∈C𝑛×𝑝asC=

h 𝑐𝑇

1 𝑐𝑇

2 Β· Β· Β· 𝑐𝑇

𝑛

i𝑇

=B𝑔, where𝑐𝑖 ∈C1×𝑝. The decoding vectors are chosen as follows: letF ={𝑖1, . . . , 𝑖𝑓}be the indices of the returning machines and letBF be the sub-matrix ofBwith rows indexed byF. If11Γ—π‘˜ is in the linear space generated by the rows ofBF, as the second property suggests,aF is chosen such thataFBF =11Γ—π‘˜. As a result, we have

aFCF =aFBF𝑔=11Γ—π‘˜ 𝑔 =βˆ‡πΏ(D;𝛽). (7.4) When this holds for any set of indices F βŠ‚ [𝑛]of size 𝑓, it means that the gradient can be recovered from the set of 𝑓 machines that return fastest.

The pseudocode listing of Algorithm 5 outlines the overall procedure for implementing our scheme, as just described.

Algorithm 5Pseudocode of the proposed scheme Require:

{D1, . . . ,Dπ‘˜}: Dataset partition B1, . . . ,B𝑛: Encoding vectors 𝑇: Number of iterations {πœ‚π‘‘}𝑇

𝑑=1: Learning rates Ensure: 𝛽𝑇: Parameters

PartitionD into{D1, . . . ,Dπ‘˜}

Assign toπ‘Šπ‘–the partitions{D𝑖1, . . .D𝑖𝑀} Assign toπ‘Šπ‘–the encoding vectorB𝑖

𝛽0 ←0𝑝×1

while𝑑 < 𝑇 do

𝑀 sends out𝛽𝑑 toπ‘Š1, . . . , π‘Šπ‘› π‘Šπ‘–computes partial gradients𝑔𝑖

1, . . . , 𝑔𝑖

𝑀

π‘Šπ‘–encodes𝑔𝑖

1, . . . , 𝑔𝑖

𝑀 to𝑐𝑖usingB𝑖

𝑀 computesaF corresponding to first 𝑓 returning machines 𝑀 recoversβˆ‡(D, 𝛽𝑑) using{𝑐𝑖}π‘–βˆˆF andaF

𝑀 updates model: 𝛽𝑑+1= π›½π‘‘βˆ’πœ‚π‘‘βˆ‡(D, 𝛽𝑑) end while

return 𝛽𝑇

7.2.2 Computational Trade-offs

In a distributed scheme that does not employ redundancy, the taskmaster has to wait for all the workers to finish in order to compute the full gradient. However, in the scheme outlined above, the taskmaster needs to wait for the fastest 𝑓 machines to recover the full gradient. Clearly, this requires more computation by each machine.

Note that in the uncoded setting, the amount of computation that each worker does is 1𝑛 of the total work, whereas in the coded setting each machine performs a π‘€π‘˜ fraction of the total work. From (7.2), we know that if a scheme can tolerate 𝑠 stragglers, the fraction of computation that each worker does is π‘€π‘˜ β‰₯ 𝑠+1

𝑛 . Therefore, the computation load of each worker increases by a factor of (𝑠+1). As will be explained further in Section 7.5, there is a sweet spot for π‘€π‘˜ (and consequently𝑠) that minimizes the expected total time that the master waits in order to recover the full gradient update.

It is worth noting that it is often assumed [197, 123, 70] that the decoding vectors are precomputed for all possible combinations of returning machines, and the decoding cost is not taken into account in the total computation time. In a practical system, however, it is not very reasonable to compute and store all the decoding vectors, especially as there are 𝑛𝑓

such vectors, which grows quickly with𝑛. In this work, we introduce an online algorithm for computing the decoding vectors on the fly, for the indices of the 𝑓 workers that respond first. The approach is based on the idea of inverting Vandermonde matrices, which can be done very efficiently. In the sequel, we show how to construct an encoding matrixBfor any𝑀 , π‘˜ and𝑛, such that the system is resilient to 𝑀 𝑛

π‘˜

βˆ’1 stragglers, along with an efficient algorithm for computing the decoding vectors{aF :F βŠ‚ [𝑛],|F | = 𝑓}.