Experimental Results on a Cluster with Two Types

Methodology

We experimentally evaluate the performance of our ML-based placement technique using a real cluster with two node types as described in Section I. When placing parallel applications on the cluster, our simulated annealing based placement algorithm uses the performance model built in Section 4.1 to estimate the runtimes of applications for a hypothetical placement. We compare our technique (ML-G) with the following five heuristics:

• Random placement (Random): This heuristic randomly places parallel applications among the candidate VC configurations.

• Greedy placement (Greedy): This algorithm places a parallel application in one of the most consolidated homogeneous VC configurations, where the application has the smallest solo run runtime, if it is available. This is inspired by the greedy algorithm in Quasar [15], where for an application, the scheduler initially sorts server types according to their performance and then selects servers which is capable of higher performance in the sorted order while attempting to pack the application within a few servers.

• Heterogeneity-aware interference-ignorant placement (HA): This algorithm considers the effect of different VC configurations with heterogeneous resources on the performance of a parallel application. It places a parallel application in one of its candidate VC configurations based on their solo run runtimes in a greedy manner. It assigns the application an

available VC configuration with the smallest runtime, without reflecting possible interference effects caused by co-runners.

• Interference-aware heterogeneity-ignorant placement (IA): If two parallel applications which have the same type of dominant resource are placed together on a set of the same nodes, their performance will be degraded, as they compete for the same type of resources in the nodes. To reduce the interference effect, this algorithm creates groups of two for given applications such that two applications with different dominant resource types are paired, if possible, and then places each pair randomly (on C1, C3 or C5 in our setup), being oblivious to the heterogeneity of resources. Note that this approach can be used only in the batch mode.

• ML-based placement with individual models (ML-I): This algorithm places a parallel application in the same manner asML-Gexcept that the individual modeling approach is also used for Hadoop and Spark applications. For target Hadoop and Spark applications, the profiling overhead for ML-I can exceed that for ML-G, which can use small input size for profiling.

For the experiments in this subsection, we provide a set of four applications as input to the placement algorithms (i.e., batch mode). In Greedy and HA, we need to order multiple applications in the set. For each application, we compute the maximum speedup asr_worst/r_best, whererworstandrbestare the runtimes of its worst (i.e., longest) and best (i.e., shortest) solo runs between the C2 and C6 configurations forGreedyand among the six candidate VC configurations forHA. Then, we sort parallel applications in a workload in a decreasing order of their computed maximum speedups. For each application in the sorted order,GreedyandHAmake a placement decision. For Random and IA, the average performance of five random placements is shown in the results.

Table 2.6 presents the nine workloads used in our experiments. The ratio of the number of storage-intensive applications (denoted as "S"), that of network-intensive applications (denoted as "N"), and that of applications utilizing both network and storage resources (denoted as "B") in each workload is also given in the table. For the big data analytics applications, the input sizes (i.e., 1x) described in Table 2.2 are used in these workloads. The workloads are selected such that there are various ratios of the numbers of storage-intensive, network-intensive, and storage- network-intensive applications, including two homogeneous workloads that are composed of only one type of applications. In our experimental setup, the total number of candidate placements for four applications is 74. For each workload, we explore all of the candidate placements to find the best placement of the four applications (Best), which has the maximum geometric mean of the speedups.

WL1 WL2 WL3 WL4 WL5 WL6 WL7 WL8 WL9 Avg Workloads

0.5 0.6 0.7 0.8 0.9 1.0

Geomean of Speedup

Random Greedy IA HA ML-G ML-I Best

Figure 10: Speedups of workloads, normalized to the best placement Table 2.7: Best configuration for each workload

App1 App2 App3 App4

WL1 WCS: C1 WCS: C3 TG: C3 Zeus: C5

WL2 GS: C3 WCS: C2 CG: C3 NAMD: C5

WL3 WCS: C3 TG: C3 LAMMPS: C5 Zeus: C2 WL4 GS: C3 TG: C3 NAMD: C5 Zeus: C1

WL5 WCS: C1 GS: C5 TG: C1 CG: C5

WL6 GS: C3 GS: C3 NAMD: C5 LAMMPS: C1

WL7 WCS: C1 GH: C1 NAMD: C6 LAMMPS: C6

WL8 GS: C1 GS: C3 WCS: C3 GH: C5

WL9 CG: C5 LAMMPS: C5 NAMD: C2 Zeus: C2

Experimental Results

Figure 10 shows the speedup of each of the nine workloads (on the geometric mean), normalized to that of the best placement. Our ML-based placement technique ML-G achieves 96.13%

of the performance of Best on average. The performance improvements of ML-G compared to Random,Greedy,HA, and IA are 24.80%, 12.95%, 10.61%, and 21.11% on average, respectively.

The performance of ML-G is fairly comparable to that of ML-I, which achieves 97.72% of Best on average. With regard to homogeneous workloads, they are less sensitive to the placement configuration, as they consist of similar applications in terms of preferred resources and interference effects. Thus, the performance difference between good and poor configurations becomes smaller. Our performance models have modest error rates, but the models can compute the relative performance gap between different placements and distinguish between good and poor placements reasonably well.

For various algorithms, we analyze the complexity of profiling overhead for a target application with the corresponding target input in a cluster which has T node types. With Greedy, HA,IA, and ML-I, the profiling complexity isO(logT), while withML-G, no additional profiling is not required, because we use the previously profiled information for the application (possibly with a different input size). In our setup of a cluster with two types, for Greedy, HA, and IA, two, six, and three solo runs are profiled, respectively. ForML-I, 46 and 86 runs are profiled for storage and network intensive applications.

We subsequently analyze the best placements of the workloads in our heterogeneous cluster.

Table 2.7 indicates that there is no single best placement that works for all workloads, despite the fact that there is a tendency of network-intensive applications to favor T2 nodes that provide a higher network I/O bandwidth, while storage-intensive applications prefer to be placed on T1 nodes where they have only one co-runner VM and may use more of the storage I/O resource.

Five different sets of VC configurations are used for the best placements. Even when the same set of VC configurations is used to place four applications, depending on the combination of applications, the same application can be assigned a different VC configuration.

Recall that we include two heterogeneous VC configurations as the candidate VC configurations for a parallel application. All of the best placements except for WL5, WL7 and WL9 use the heterogeneous VC configuration of C3, and all of the applications placed in C3, except for CG in WL2, are storage-intensive big data applications. For big data analytics applications, the performance can be improved by allocating the favored nodes partially, and the performance can be enhanced if their VMs are executed with other applications over different nodes, with less contention over the same types of resources. This result shows that it is beneficial to exploit heterogeneous VC configurations with hardware asymmetry.

Dalam dokumen Colocation-aware Resource Management for Distributed Parallel Applications in Consolidated (Halaman 34-37)