Problem Setup - Reed–Solomon Codes for Distributed Computation

Chapter VI: Reed–Solomon Codes for Distributed Computation

6.2 Problem Setup

We consider the problem of straggler mitigation in a distributed gradient descent setting. We focus on the task of fitting a vector of parametersβ ∈ R^p to a dataset D ={(x_i, y_i)}^Ni=1, wherex_i ∈R^p andy_i ∈R, where the goal is minimizing a loss function of form

L(D;β) =

i=1

`(x_i, y_i;β), (6.1)

where`(xi, yi;β)is the loss function of the model when evaluated at xi. Gradient descent algorithms are usually used for solving such machine learning problems.

The gradient of the loss function, with respect toβis given by

∇L(D;β) =

i=1

∇`(x_i, y_i;β). (6.2) The modelβevolves as a function of time according to the gradient descent rule:

βt+1 =βt−ηt∇L(D;β),

whereη_tis known as the learning rate and is allowed to depend on time.

We assume the setting where the computation is distributed equally amongst n machines and the master is to recover the gradient from anyn−smachines in order to update the modelβ_t, wheresdepends on the computational load. See Fig. 6.1 . More formally, consider the partition of the dataset into {D1, . . . ,Dk} where the Dⁱ’s are disjoint subsets of D of size ^N_k (assume k dividesN). This allows us to

Master

Figure 6.1: Schematic representation of the taskmaster and thenworkers.

rewrite (6.2) as

∇L(D;β) =

i=1

(x,y)∈D_i

∇`(x, y;β). (6.3)

Define g_i := P

(x,y)∈D_i∇`(x, y;β) and g := [g₁^t, g^t₂, . . . , g^t_k]^t, where g_i is a row vector of size p. Let1 denote the all-one column vector of lengthk. The goal is to recover ∇L(D;β) = 1^tg in a distributed fashion. We have a set of n workers {W₁, W₂, . . . , W_n}. Also define the taskmaster, M, as the entity responsible for computing the gradient from the (partial) result of computations done by workers.

In principle, eachWican compute a partial gradient and send it back toM. OnceM receives{g₁, . . . , g_k}, it computes the gradient by simply adding the results. In order to avoid the impact of stragglers, designing a flexible scheme which can compute the gradient using the result of computations of some (but not all) machines is crucial.

In order to have a scheme that can tolerate any set of stragglers of size at mosts, the taskmaster should be able to recover the gradient vector given the result of any f := n −s machines. This set of machines can vary from iteration to iteration and so our scheme should accommodate this fact. Suppose each worker Wi is provided with wdata partitions, {Dⁱ1, . . . ,Dⁱw}, on which it computes the partial gradients {g_i₁, g_i₂, . . . , g_i_w}. Here, we assume that the workers are doing same amount of computations, i.e.,wis fixed for all workers. It is here where we see that redundancy is introduced, as the quantitywnis usually greater than (or equal to)k. The following proposition indicates the upper bound for the number of stragglers in terms of system parameters. We refer the reader to [Tan+16] for the proof.

Proposition 6.1. In a distributed system withnworkers, if we partition the data into kdisjoint sets and each worker computes the gradient on exactlywsets, the number of stragglers,s, must satisfy

s≤jwn k

k−1. (6.4)

After computing the corresponding partial gradients,{g_i₁, g_i₂, . . . , g_i_w}, each worker W_i can send all these vectors to M. However, there is benefit in W_i sending a linear combination of {g_i₁, . . . , g_i_w}. Indeed, the amount of data being sent (communication load) is reduced by a factor ofw, and we can show that with an appropriate choice of the encoding coefficients, there is no loss in the number of stragglers that can be tolerated.

This choice amounts to a careful distribution of the data partitions among the n workers, and prescribing a linear combination to eachW_i. Mathematically, this can be expressed as designing a matrixB∈C^n×ksuch that:

• Each row ofBcontains exactlywnonzero entries.

• The linear space generated by anyf rows ofBcontains the all one vector,1^t. The nonzero locations inb^t_i, theith row ofB, determine the indices of data partitions assigned to workeri. The first property ensures that each worker computes partial gradients on exactlywchunks of data.

Furthermore, the values of these nonzero entries prescribe the linear combination.

The coded partial gradient sent byW_iis given by

c_i =

j=1

B_i,jg_j =b^t_ig, (6.5)

wherec_i ∈C^1×pis a row vector which denotes the linear combination of the partial gradient sent from W_i to M. Since b_i has sparsity w each worker essentially computes a weighted sum of the gradients of w chunks of the data. Define the encoded computation matrixC∈C^n×p as

C= [c^t₁, c^t₂, . . . , c^t_n]^t=Bg, (6.6)

where theith row ofCis equal toc_i. The goal is design matrixB in such a way that the taskmaster be able to recover the gradient∇L(D;β) =1^tgfrom anyfrows ofC. In particular, letF ={i₁, . . . , i_f}be the index set of surviving machines and letBF be the sub-matrix ofBindexed byF. If1^tis in the linear space generated by the rows ofBF, there is a corresponding vector,aF such that:

a^t_FCF =a^t_FBFg =1^tg =∇L(D;β). (6.7) This has to hold for any set of indices F ⊂ [n] of size f. Using the theory of maximum-distance separable (MDS) codes, the authors of [Tan+16], construct an encoding matrix B, along with the set of decoding vectors {aF : F ⊂ [n],|F| = f}. The encoding matrix they designed must satisfy w ≥ n^k(s+ 1)according to Proposition 6.1.

Preliminaries

In a distributed scheme that does not employ redundancy, the taskmaster has to wait for all the workers to finish in order to compute the full gradient. However, in the scheme outlined above, the taskmaster needs to wait for the fastest f machines to recover the full gradient. Clearly, this requires more computation by each machine.

Note that in the uncoded setting, the amount of computation that each worker does is

nof the total work, whereas in the coded setting each machine performs a^w_k fraction of the total work. From (6.4), we know that if a scheme can tolerates stragglers, the fraction of computation that each worker does is

k ≥ s+ 1

n . (6.8)

Therefore, the computation load of each worker increases by a factor of (s+ 1). At this point, it is not immediately clear what the optimal value for ^w_k is. This optimization will be analyzed in detail in Section 6.5.

It is worth noting that it is often assumed [Tan+16; Lee+16; DCG16] that the decoding vectors are precomputed for all possible combinations of returning machines, and the decoding cost is not taken into account in the total computation time. In a practical system, however, it is not very reasonable to compute and store all the decoding vectors, especially as there are ⁿ_f

such vectors, and this grows quickly with n. A contribution of this chapter is the design of an online algorithm for computing the decoding vectors on the fly, for the indices of the f workers that respond first.

The approach is based on the idea of inverting Vandermonde matrices, which can be

done very efficiently. In the sequel, we show how to construct an encoding matrixB for anyw, kandn, such that the system is resilient to_wn

−1stragglers, along with an efficient algorithm for computing the decoding vectors{aF :F ⊂[n],|F|=f}.

Dalam dokumen PDF Error-CorrectingCodesforNetworks,Storageand Computation (Halaman 117-121)