Chapter VI: Reed–Solomon Codes for Distributed Computation
6.2 Problem Setup
We consider the problem of straggler mitigation in a distributed gradient descent setting. We focus on the task of fitting a vector of parametersβ ∈ Rp to a dataset D ={(xi, yi)}Ni=1, wherexi ∈Rp andyi ∈R, where the goal is minimizing a loss function of form
L(D;β) =
N
X
i=1
`(xi, yi;β), (6.1)
where`(xi, yi;β)is the loss function of the model when evaluated at xi. Gradient descent algorithms are usually used for solving such machine learning problems.
The gradient of the loss function, with respect toβis given by
∇L(D;β) =
N
X
i=1
∇`(xi, yi;β). (6.2) The modelβevolves as a function of time according to the gradient descent rule:
βt+1 =βt−ηt∇L(D;β),
whereηtis known as the learning rate and is allowed to depend on time.
We assume the setting where the computation is distributed equally amongst n machines and the master is to recover the gradient from anyn−smachines in order to update the modelβt, wheresdepends on the computational load. See Fig. 6.1 . More formally, consider the partition of the dataset into {D1, . . . ,Dk} where the Di’s are disjoint subsets of D of size Nk (assume k dividesN). This allows us to
Master
W1
W2
Wn
Wn
Figure 6.1: Schematic representation of the taskmaster and thenworkers.
rewrite (6.2) as
∇L(D;β) =
k
X
i=1
X
(x,y)∈Di
∇`(x, y;β). (6.3)
Define gi := P
(x,y)∈Di∇`(x, y;β) and g := [g1t, gt2, . . . , gtk]t, where gi is a row vector of size p. Let1 denote the all-one column vector of lengthk. The goal is to recover ∇L(D;β) = 1tg in a distributed fashion. We have a set of n workers {W1, W2, . . . , Wn}. Also define the taskmaster, M, as the entity responsible for computing the gradient from the (partial) result of computations done by workers.
In principle, eachWican compute a partial gradient and send it back toM. OnceM receives{g1, . . . , gk}, it computes the gradient by simply adding the results. In order to avoid the impact of stragglers, designing a flexible scheme which can compute the gradient using the result of computations of some (but not all) machines is crucial.
In order to have a scheme that can tolerate any set of stragglers of size at mosts, the taskmaster should be able to recover the gradient vector given the result of any f := n −s machines. This set of machines can vary from iteration to iteration and so our scheme should accommodate this fact. Suppose each worker Wi is provided with wdata partitions, {Di1, . . . ,Diw}, on which it computes the partial gradients {gi1, gi2, . . . , giw}. Here, we assume that the workers are doing same amount of computations, i.e.,wis fixed for all workers. It is here where we see that redundancy is introduced, as the quantitywnis usually greater than (or equal to)k. The following proposition indicates the upper bound for the number of stragglers in terms of system parameters. We refer the reader to [Tan+16] for the proof.
Proposition 6.1. In a distributed system withnworkers, if we partition the data into kdisjoint sets and each worker computes the gradient on exactlywsets, the number of stragglers,s, must satisfy
s≤jwn k
k−1. (6.4)
After computing the corresponding partial gradients,{gi1, gi2, . . . , giw}, each worker Wi can send all these vectors to M. However, there is benefit in Wi sending a linear combination of {gi1, . . . , giw}. Indeed, the amount of data being sent (communication load) is reduced by a factor ofw, and we can show that with an appropriate choice of the encoding coefficients, there is no loss in the number of stragglers that can be tolerated.
This choice amounts to a careful distribution of the data partitions among the n workers, and prescribing a linear combination to eachWi. Mathematically, this can be expressed as designing a matrixB∈Cn×ksuch that:
• Each row ofBcontains exactlywnonzero entries.
• The linear space generated by anyf rows ofBcontains the all one vector,1t. The nonzero locations inbti, theith row ofB, determine the indices of data partitions assigned to workeri. The first property ensures that each worker computes partial gradients on exactlywchunks of data.
Furthermore, the values of these nonzero entries prescribe the linear combination.
The coded partial gradient sent byWiis given by
ci =
k
X
j=1
Bi,jgj =btig, (6.5)
whereci ∈C1×pis a row vector which denotes the linear combination of the partial gradient sent from Wi to M. Since bi has sparsity w each worker essentially computes a weighted sum of the gradients of w chunks of the data. Define the encoded computation matrixC∈Cn×p as
C= [ct1, ct2, . . . , ctn]t=Bg, (6.6)
where theith row ofCis equal toci. The goal is design matrixB in such a way that the taskmaster be able to recover the gradient∇L(D;β) =1tgfrom anyfrows ofC. In particular, letF ={i1, . . . , if}be the index set of surviving machines and letBF be the sub-matrix ofBindexed byF. If1tis in the linear space generated by the rows ofBF, there is a corresponding vector,aF such that:
atFCF =atFBFg =1tg =∇L(D;β). (6.7) This has to hold for any set of indices F ⊂ [n] of size f. Using the theory of maximum-distance separable (MDS) codes, the authors of [Tan+16], construct an encoding matrix B, along with the set of decoding vectors {aF : F ⊂ [n],|F| = f}. The encoding matrix they designed must satisfy w ≥ nk(s+ 1)according to Proposition 6.1.
Preliminaries
In a distributed scheme that does not employ redundancy, the taskmaster has to wait for all the workers to finish in order to compute the full gradient. However, in the scheme outlined above, the taskmaster needs to wait for the fastest f machines to recover the full gradient. Clearly, this requires more computation by each machine.
Note that in the uncoded setting, the amount of computation that each worker does is
1
nof the total work, whereas in the coded setting each machine performs awk fraction of the total work. From (6.4), we know that if a scheme can tolerates stragglers, the fraction of computation that each worker does is
w
k ≥ s+ 1
n . (6.8)
Therefore, the computation load of each worker increases by a factor of (s+ 1). At this point, it is not immediately clear what the optimal value for wk is. This optimization will be analyzed in detail in Section 6.5.
It is worth noting that it is often assumed [Tan+16; Lee+16; DCG16] that the decod- ing vectors are precomputed for all possible combinations of returning machines, and the decoding cost is not taken into account in the total computation time. In a practical system, however, it is not very reasonable to compute and store all the de- coding vectors, especially as there are nf
such vectors, and this grows quickly with n. A contribution of this chapter is the design of an online algorithm for computing the decoding vectors on the fly, for the indices of the f workers that respond first.
The approach is based on the idea of inverting Vandermonde matrices, which can be
done very efficiently. In the sequel, we show how to construct an encoding matrixB for anyw, kandn, such that the system is resilient town
k
−1stragglers, along with an efficient algorithm for computing the decoding vectors{aF :F ⊂[n],|F|=f}.