CODED COMPUTATION FOR DISTRIBUTED GRADIENT DESCENT
7.2 Preliminaries .1 Problem Setup.1Problem Setup
machines, π, given a prespecified computational effort per machine.
2. We provide an efficient online decoder, with time complexity π(π2) for recovering the gradient from any π machines, which is faster than the best known method [197],π(π3).
3. We analyze the total computation time, and provide a method for finding the optimal coding parameters. We consider heavy-tailed delays, which have been widely observed in CPU job runtimes in practice [124, 93, 94].
The rest of the chapter is organized as follows. In Section 7.2, we describe the problem setup and explain the design objectives in detail. Section 7.3, provides the construction of our coding scheme, using the idea of balanced Reed-Solomon codes.
Our efficient online decoder is presented in Section 7.4. We then characterize the total computation time, and describe the optimal choice of coding parameters, in Section 7.5. Finally, we provide our numerical results in Section 7.6, and conclude in Section 7.7.
7.2 Preliminaries
master intends to train a model using gradient descent by distributing the gradient updates amongst the workers. More precisely, consider a typical scenario, where we want to learn parameters π½βRπ by minimizing a generic loss functionπΏ(D;π½) over a given dataset D = {(π₯π, π¦π)}π
π=1, where π₯π β Rπ and π¦π β R. The loss function can be expressed as the sum of the losses for individual data points, i.e.
πΏ(D;π½)=Γπ
π=1β(π₯π, π¦π;π½).Therefore, the full gradient, with respect to π½, is given by
βπΏ(D;π½) =
π
Γ
π=1
ββ(π₯π, π¦π;π½). (7.1) The data can be divided into π (disjoint) chunks {D1, . . . ,Dπ} of size ππ, and clearly the gradient can also be written asβπΏ(D;π½) = Γπ
π=1
Γ
(π₯ , π¦)βDπ ββ(π₯ , π¦;π½).
Define ππ := Γ
(π₯ , π¦)βDπββ(π₯ , π¦;π½) as the partial gradient of chunk π for every π, and ππ := [ππ
1, ππ
2, . . . , ππ
π], where ππ is a row vector of length π. Therefore
βπΏ(D;π½) = 11Γππ. Now suppose each workerππ is assigned π€ data partitions {Dπ1, . . . ,Dππ€}, on which it computes the partial gradients{ππ
1, ππ
2, . . . , ππ
π€}. Note that the βredundancyβ in computation is introduced here, since each chunk is allowed to be assigned to multiple workers. Each worker then has to compute its partial gradients, and return a prespecifiedlinear combinationof them to the master.
As it will be explained in detail, πandπ€are to be chosen in such a way that the total computation time is minimized. For a fixedπ andπ€, we want to be able to recover the gradient using the linear combinations received from the fastest π machines at the master (or equivalently tolerateπ :=πβ π stragglers). Note that we do not assume any prior knowledge about the stragglers, i.e., we shall design a scheme that enables master to recover the gradient from any set of π machines. It is known [197] that for any fixedπ andπ€, an upper-bound on the number of stragglers thatany schemecan tolerate is:
π β€ jπ€ π π
k
β1. (7.2)
The scheme proposed in this work achieves this bound. A coding scheme designed to tolerateπ stragglers consists of anencoding matrixB, and a collection of decoding vectors{aF : F β [π],|F |=πβπ }. The matrixBshould satisfy:
1. Each row ofBcontains exactlyπ€nonzero entries.
2. The linear space generated by any π rows ofBcontains the all-one vector of lengthπ,11Γπ.
The values of these nonzero entries prescribe the linear combination sent byππ. In other words, thecoded partial gradientsent fromππ toπ is given by
ππ =
π
Γ
π=1
Bπ, πππ =Bππ, (7.3)
whereBπdenotes theπth row ofB. Theππβs define the encoded computation matrix CβCπΓπasC=
h ππ
1 ππ
2 Β· Β· Β· ππ
π
iπ
=Bπ, whereππ βC1Γπ. The decoding vectors are chosen as follows: letF ={π1, . . . , ππ}be the indices of the returning machines and letBF be the sub-matrix ofBwith rows indexed byF. If11Γπ is in the linear space generated by the rows ofBF, as the second property suggests,aF is chosen such thataFBF =11Γπ. As a result, we have
aFCF =aFBFπ=11Γπ π =βπΏ(D;π½). (7.4) When this holds for any set of indices F β [π]of size π, it means that the gradient can be recovered from the set of π machines that return fastest.
The pseudocode listing of Algorithm 5 outlines the overall procedure for implementing our scheme, as just described.
Algorithm 5Pseudocode of the proposed scheme Require:
{D1, . . . ,Dπ}: Dataset partition B1, . . . ,Bπ: Encoding vectors π: Number of iterations {ππ‘}π
π‘=1: Learning rates Ensure: π½π: Parameters
PartitionD into{D1, . . . ,Dπ}
Assign toππthe partitions{Dπ1, . . .Dππ€} Assign toππthe encoding vectorBπ
π½0 β0πΓ1
whileπ‘ < π do
π sends outπ½π‘ toπ1, . . . , ππ ππcomputes partial gradientsππ
1, . . . , ππ
π€
ππencodesππ
1, . . . , ππ
π€ toππusingBπ
π computesaF corresponding to first π returning machines π recoversβ(D, π½π‘) using{ππ}πβF andaF
π updates model: π½π‘+1= π½π‘βππ‘β(D, π½π‘) end while
return π½π
7.2.2 Computational Trade-offs
In a distributed scheme that does not employ redundancy, the taskmaster has to wait for all the workers to finish in order to compute the full gradient. However, in the scheme outlined above, the taskmaster needs to wait for the fastest π machines to recover the full gradient. Clearly, this requires more computation by each machine.
Note that in the uncoded setting, the amount of computation that each worker does is 1π of the total work, whereas in the coded setting each machine performs a π€π fraction of the total work. From (7.2), we know that if a scheme can tolerate π stragglers, the fraction of computation that each worker does is π€π β₯ π +1
π . Therefore, the computation load of each worker increases by a factor of (π +1). As will be explained further in Section 7.5, there is a sweet spot for π€π (and consequentlyπ ) that minimizes the expected total time that the master waits in order to recover the full gradient update.
It is worth noting that it is often assumed [197, 123, 70] that the decoding vectors are precomputed for all possible combinations of returning machines, and the decoding cost is not taken into account in the total computation time. In a practical system, however, it is not very reasonable to compute and store all the decoding vectors, especially as there are ππ
such vectors, which grows quickly withπ. In this work, we introduce an online algorithm for computing the decoding vectors on the fly, for the indices of the π workers that respond first. The approach is based on the idea of inverting Vandermonde matrices, which can be done very efficiently. In the sequel, we show how to construct an encoding matrixBfor anyπ€ , π andπ, such that the system is resilient to π€ π
π
β1 stragglers, along with an efficient algorithm for computing the decoding vectors{aF :F β [π],|F | = π}.