Off-line Scheduling - Colocation-aware Resource Management for Distributed Parallel Application

Methodology

We experimentally evaluated the performance of our scheduling algorithm. Our min-max pairing (MMP) algorithm use the GPU pattern manager and the interference predictor that contained the GPU pattern simulator as explained in section II to find the best placement for the CNN models and schedule them. The scheduler focused on guaranteeing the QoS threshold for the latency-sensitive CNN models, and the batch CNN models could be scheduled according to the scheduling decision. We compared our algorithm with the following heuristics:

• Random pairing (Random): This heuristic randomly pairs up each latency-sensitive CNN model with the random batch CNN model. It can apply the kernel-split to the batch CNN model to reduce the slowdown of the latency-sensitive CNN model. If the slowdown of the CNN model is lower than QoS threshold, the CNN models in the pair will be co-located.

If not, the latency-sensitive CNN model will be scheduled alone.

• MMP without split (W/o Split): This algorithm is the same as our min-max pairing except that it cannot use the kernel-split.

• Greedy placement (Greedy): This heuristic also uses the matrix and generates the matrix in the same manner as MMP. Then, it chooses the pair with the lowest slowdown.

We used three QoS thresholds, namely 10%, 20%, and 30%, to evaluate the scheduling algorithms. The split factor for the kernel split is limited to 8 for our algorithm and the other heuristics because using a higher split factor makes the batch CNN models very slow. Our GPU cluster is too small to execute all the CNN models simultaneously, so we execute the models in a pair (or only one model) simultaneously, and executed the models in the next pair after the execution of the earlier models was completed.

Experimental Results

Random W/o Split Greedy MMP Oracle

0.8 1.0 1.2 1.4 1.6 1.8 2.0

Normalized Throughput

10% 20% 30%

Figure 8: Normalized Throughput of Algorithms

Figure 8 shows the normalized throughput with the QoS threshold varying from 10% to 30%. We compared our MMP algorithm with three heuristics and oracle (the best placement).

We calculated the sum of the normalized throughput for each pair and computed the average of these values. When a latency-sensitive CNN model was scheduled alone without a pair, its normalized throughput was 1. Our MMP algorithm achieves a higher normalized throughput than the other heuristics. Moreover, our algorithm achieves 99.26% of the oracle on average.

The greedy heuristic shows the same normalized throughput when the QoS threshold was 10%, because there were only a small number of possible pairs that they could choose because of the QoS violation. When the QoS threshold was relaxed, the greedy heuristic exhibited lower performance than our MMP algorithm. The normalized throughput of the W/o split was only

1 when the QoS threshold was 10%, because any latency-sensitive CNN model cannot be co- located with lower than 10% of slowdown without using kernel split. When the QoS threshold was relaxed, we obtained a higher normalized throughput, because the batch CNN models could be co-located with the latency-sensitive CNN models without a kernel split or a lower split factor.

Random W/o Split Greedy MMP MMP+Net Oracle 0.8

1.0 1.2 1.4 1.6 1.8 2.0

Normalized Throughput

10% 20% 30%

Figure 9: Normalized Throughput of Algorithms in Restricted Scenario

In the first result, the latency-sensitive applications do not pair up with VGG16 model except random heuristic. The VGG16 model has a large size of parameters, and it shows high network bandwidth. So, there is no performance degradation from the network interference in the first result. Unfortunately, we cannot co-locate two CNN models with large parameter size, because they consume a large amount of the device memory, and our GPUs does not have sufficient amount of the device memory to support them simultaneously. However, we can use the Trivial model that consumes high network bandwidth without large parameter size. To show the effect of network communication with distributed parallel models, we do experiment with restricted scenario that only three batch models are exist. In this scenario, the number of latency-sensitive models is same with the number of batch models, and that means there is no extra batch models.

We exclude GoogleNet, Resnet50 batch models in this scenario.

Figure 9 shows the normalized throughput with varying QoS threshold in the restricted scenario. In this scenario, VGG16 model is paired up with latency-sensitive models in some cases. Our MMP algorithm (including W/o split) and greedy heuristic uses the prediction result of simulator, and it shows higher error rate when the latency-sensitive model is colocated with VGG16 models than other cases, because there is no consideration of network interference in our simulator. In this scenario, our MMP algorithm achieves 94.48% of oracle in average. As we mentioned as section 4.3, we also have prediction model that considers the effect of network. We use this model for our MMP algorithm to schedule CNN models, and we call this as MMP+Net.

Our MMP+Net shows exactly same scheduling decision with oracle with all QoS thresholds.

V Discussion

Dalam dokumen Colocation-aware Resource Management for Distributed Parallel Applications in Consolidated (Halaman 90-93)