The off-line scheduler can try to find the best placement for the CNN models with the given goal. From other modules, we can predict how the performance of CNN models changes when they are co-located. The placement algorithm of our scheduler can use this information to find which CNN models have to be co-located for efficiency.
Performance goals and metrics
We focused on two of the popularly used goals for resource management, namely guaranteeing Qthe uality of Service, and maximizing the throughput.
Guaranteeing Quality of Service The placement scheduler needs to guarantee the per- formance of latency-sensitive applications. As the QoS threshold, we used the slowdown of the CNN model. For example, if the QoS threshold is 10%, the slowdown of the latency-sensitive CNN models should be less than 10%.
Maximizing throughput The placement scheduler needs to maximize the throughput of the applications, which implies using the resource efficiently. In general, the throughput of the application can be computed as the reciprocal of the execution time. However, CNN models may have different execution time, so we defined normalized throughput (Original execution time / actual execution time). For example, if the slowdown of a model is 100%, the normalized throughput is 0.5.
Min-max pairing (MMP) algorithm
We propose a heuristic placement algorithm called the Min-Max Pairing(MMP). Our algo- rithm pairs the CNN models, and co-locates the CNN models in the same pair to the same resources. To minimize the slowdown of the CNN models, we have to pair CNN models with a low slowdown when they are co-located. From the interference predictor, we can get the perfor- mance effect of every possible pair of CNN models. Therefore, we can pair the models on the basis of this information. Our MMP algorithm has two parts, namely generating a matrix and pairing the models.
Generating Matrix Before pairing the CNN models, MMP generates a matrix that has the information of every possible pair of models. Let the number of latency-sensitive CNN models be N, and the number of batch CNN models beM. We assume that M is equal to or greater than N. Then, the size of the matrix will be N ∗M. For each pair, the GPU pattern manager generates one input for the latency-sensitive CNN models to which the kernel-split is not applied, and many inputs for the batch CNN models with various split factors. Then, the interference predictor predicts the slowdown of the CNN models with various split factors for the batch CNN model. From the result, the algorithm selects the case in which the slowdown of latency-sensitive CNN model is less than the QoS threshold, and places the slowdown of the batch CNN model in this case into the matrix. If there is no case in which the slowdown of the latency-sensitive CNN model is less than the QoS threshold, the algorithm leaves the element of the matrix for this pair empty. If there are two or more cases in which the QoS threshold can be satisfied, algorithm selects the minimum slowdown of the batch CNN model.
Min-Max Pairing After the matrix generation, our MMP algorithm chooses appropriate pairs to co-locate. First, it removes the completely empty rows or columns. If a row of the matrix is empty, there is no way to guarantee the QoS of the latency-sensitive CNN model while
Algorithm 2 Min-Max Pairing Algorithm
1: input: M =m11, m12, ...m21, ..., mNM: slowdown matrix
2: A =a1, ...aN: latency-sensitive CNN models
3: B =b1, ...bM: number of batch CNN models
4: x, y=N, M
5: Remove empty columns of M and decrease y by the number of empty columns
6: foreach row Ri inM do
7: if Ri is emptythen
8: Schedule ai alone
9: RemoveRi and decrease x by 1
10: end if
11: end for
12: whilex >0 do
13: foreach row Ri inM do
14: if Ri has only one element then
15: j = the index of element inRi
16: Schedule ai withbj, decreasex andy by 1
17: Remove all elements in row iand column j inM
18: break
19: end if
20: end for
21: foreach column Cj inM do
22: if Cj has only one element andx==y then
23: i = the index of element inCj
24: Schedule ai withbj, decreasex andy by 1
25: Remove all elements in row iand column j inM
26: break
27: end if
28: end for
29: Remove the maximum elementmij inM
30: if ColumnRi is empty then
31: Decrease y by 1
32: end if
33: end while
co-locating any batch CNN models; therefore, this model will be scheduled alone. If a column of the matrix is empty, co-locating the batch CNN model makes it impossible to guarantee the QoS of any latency-sensitive CNN model; therefore, this model will not be used. Then, it finds the row that has only one element left. If it exists, there is only way left to schedule the latency-
Table 5.2: The configurations of GPU cluster
Resource Specification
CPU Intel(R) Xeon(R) octa-core CPU E5-2640 v4 @ 2.60GHz * 2
GPU Tesla K80 (Kepler) * 2
Memory 32GB for CPU, 12GB for each GPU
# nodes 4
Network Infiniband
sensitive CNN model. Therefore, the algorithm selects the pair and removes all the elements in the same row and the same column from the matrix. Thereafter, the algorithm finds the column that has only one element left. If it exists, there is only way left to schedule the batch CNN model. However, if there is a sufficient number of remaining batch CNN models, we do not have to use this model. Therefore, it checks the number of remaining batch and latency-sensitive CNN models. If the number of remaining models is the same, the algorithm selects the pair, and removes all the elements in the same row and the same column from the matrix. Next, the algorithm finds the maximum slowdown in the matrix and removes that element, and repeats the procedure from finding a row. Algorithm 3.4 shows the second part of this algorithm.