Off-line scheduler - Colocation-aware Resource Management for Distributed Parallel Applications

The off-line scheduler can try to find the best placement for the CNN models with the given goal. From other modules, we can predict how the performance of CNN models changes when they are co-located. The placement algorithm of our scheduler can use this information to find which CNN models have to be co-located for efficiency.

Performance goals and metrics

We focused on two of the popularly used goals for resource management, namely guaranteeing Qthe uality of Service, and maximizing the throughput.

Guaranteeing Quality of Service The placement scheduler needs to guarantee the performance of latency-sensitive applications. As the QoS threshold, we used the slowdown of the CNN model. For example, if the QoS threshold is 10%, the slowdown of the latency-sensitive CNN models should be less than 10%.

Maximizing throughput The placement scheduler needs to maximize the throughput of the applications, which implies using the resource efficiently. In general, the throughput of the application can be computed as the reciprocal of the execution time. However, CNN models may have different execution time, so we defined normalized throughput (Original execution time / actual execution time). For example, if the slowdown of a model is 100%, the normalized throughput is 0.5.

Min-max pairing (MMP) algorithm

We propose a heuristic placement algorithm called the Min-Max Pairing(MMP). Our algorithm pairs the CNN models, and co-locates the CNN models in the same pair to the same resources. To minimize the slowdown of the CNN models, we have to pair CNN models with a low slowdown when they are co-located. From the interference predictor, we can get the performance effect of every possible pair of CNN models. Therefore, we can pair the models on the basis of this information. Our MMP algorithm has two parts, namely generating a matrix and pairing the models.

Generating Matrix Before pairing the CNN models, MMP generates a matrix that has the information of every possible pair of models. Let the number of latency-sensitive CNN models be N, and the number of batch CNN models beM. We assume that M is equal to or greater than N. Then, the size of the matrix will be N ∗M. For each pair, the GPU pattern manager generates one input for the latency-sensitive CNN models to which the kernel-split is not applied, and many inputs for the batch CNN models with various split factors. Then, the interference predictor predicts the slowdown of the CNN models with various split factors for the batch CNN model. From the result, the algorithm selects the case in which the slowdown of latency-sensitive CNN model is less than the QoS threshold, and places the slowdown of the batch CNN model in this case into the matrix. If there is no case in which the slowdown of the latency-sensitive CNN model is less than the QoS threshold, the algorithm leaves the element of the matrix for this pair empty. If there are two or more cases in which the QoS threshold can be satisfied, algorithm selects the minimum slowdown of the batch CNN model.

Min-Max Pairing After the matrix generation, our MMP algorithm chooses appropriate pairs to co-locate. First, it removes the completely empty rows or columns. If a row of the matrix is empty, there is no way to guarantee the QoS of the latency-sensitive CNN model while

Algorithm 2 Min-Max Pairing Algorithm

1: input: M =m11, m12, ...m21, ..., m_NM: slowdown matrix

2: A =a₁, ...a_N: latency-sensitive CNN models

3: B =b1, ...bM: number of batch CNN models

4: x, y=N, M

5: Remove empty columns of M and decrease y by the number of empty columns

6: foreach row R_i inM do

7: if Ri is emptythen

8: Schedule ai alone

9: RemoveR_i and decrease x by 1

10: end if

11: end for

12: whilex >0 do

13: foreach row R_i inM do

14: if R_i has only one element then

15: j = the index of element inRi

16: Schedule a_i withb_j, decreasex andy by 1

17: Remove all elements in row iand column j inM

18: break

19: end if

20: end for

21: foreach column C_j inM do

22: if Cj has only one element andx==y then

23: i = the index of element inC_j

24: Schedule ai withbj, decreasex andy by 1

25: Remove all elements in row iand column j inM

26: break

27: end if

28: end for

29: Remove the maximum elementmij inM

30: if ColumnR_i is empty then

31: Decrease y by 1

32: end if

33: end while

co-locating any batch CNN models; therefore, this model will be scheduled alone. If a column of the matrix is empty, co-locating the batch CNN model makes it impossible to guarantee the QoS of any latency-sensitive CNN model; therefore, this model will not be used. Then, it finds the row that has only one element left. If it exists, there is only way left to schedule the latency-

Table 5.2: The configurations of GPU cluster

Resource Specification

CPU Intel(R) Xeon(R) octa-core CPU E5-2640 v4 @ 2.60GHz * 2

GPU Tesla K80 (Kepler) * 2

Memory 32GB for CPU, 12GB for each GPU

# nodes 4

Network Infiniband

sensitive CNN model. Therefore, the algorithm selects the pair and removes all the elements in the same row and the same column from the matrix. Thereafter, the algorithm finds the column that has only one element left. If it exists, there is only way left to schedule the batch CNN model. However, if there is a sufficient number of remaining batch CNN models, we do not have to use this model. Therefore, it checks the number of remaining batch and latency-sensitive CNN models. If the number of remaining models is the same, the algorithm selects the pair, and removes all the elements in the same row and the same column from the matrix. Next, the algorithm finds the maximum slowdown in the matrix and removes that element, and repeats the procedure from finding a row. Algorithm 3.4 shows the second part of this algorithm.

Dalam dokumen Colocation-aware Resource Management for Distributed Parallel Applications in Consolidated (Halaman 84-87)