Colocation-aware Resource Management for Distributed Parallel Applications in Consolidated

First, we analyze the effects of resource heterogeneity, different VM configurations, and interference between VMs on the performance of different parallel distributed applications. For a heterogeneous cluster, the total search space for VM deployment of distributed parallel applications is difficult.

Table 2.1: Specifications of our heterogeneous cluster

Effects of Different VC Configurations

When we compare the execution times in the three configurations C2, C4 and C5, which use a total of four (physical) nodes, but the number of nodes per row is different, for 132.Zeusmp2 and CG, which require tight synchronization between VMs, the performance does not improve at all , unless all VMs are running on the desired nodes. The effect of heterogeneity on the performance of parallel applications is quite different depending on the synchronization and parallelism patterns.

Table 2.4: Runtimes and resource usages of parallel applications in C2 and C5

Effects of Interference

However, for tightly coupled parallel applications, the effects of contention on network I/O and inter-VM communication jointly affect performance results. In T, the effect of LAMMPS on performance is the same as GrepSpark, but in the other configurations, the performance degradation of GrepSpark is higher than that of LAMMPS.

Dominant Resource

To limit the number of candidate VC configurations, for each application, we can select k types, where k=2×log2T, from T types. In a heterogeneous cluster, a set of the candidate VC configurations with homogeneous and heterogeneous resources is computed for a parallel application.

Figure 4: Correlation between the performance and resource usage patterns

Performance Model

We also profile the application runtimes with other parallel applications in the set. For the models of MPI-based applications, their solo-run runtime (with target input data) in all candidate VC configurations is profiled offline.

Placement based on Simulated Annealing

Finally, we calculate the average of the estimated running times from the models as the final score. If the values of the input attributes for the application (ie, performance metrics) are very different from those of the training set, the model is unlikely to accurately estimate the execution time.

Performance Estimation Accuracy

When ANN, SVM and Gaussian Process Regression are used, the average error rate is very high, especially for MPI-based applications. For these algorithms, the average error rates for the MPI-based and big data analytics applications exceed 100% and 60%, respectively.

Table 2.5: Validation results of applications

Experimental Results on a Cluster with Two Types

The ratio of the number of storage-intensive applications (denoted as "S"), that of network-intensive applications (denoted as "N"), and that of applications that use network and storage resources (denoted as "B ") in each load is also given in the table. For each workload, we explore all candidate deployments to find the best deployment of the four applications (Best), which has the maximum geometric mean speedups.

Figure 10: Speedups of workloads, normalized to the best placement Table 2.7: Best configuration for each workload

Large Scale Simulations

In online mode, we assume that 40 applications are submitted at the same time, but there is an order among them. In online mode, the performance improvements with ML-G are 15.09% and 15.05% on average compared to Greedy and HA, respectively.

Experimental Results on a Cluster with Four Types

Therefore, WordCount is placed on a VC configuration with all four node types to improve performance. However, a VC configuration with T3 and T4 types is also used for WordCount, when one of the four node types is exhausted by other applications in the workload.

Discussion

AsT and increases, the number of candidate VC configurations for an application and profile runs of the application increases. At the second level, both algorithms basically map tasks from the same application to the same node for each platform.

Target platform and applications

It distributes the resources of different platforms to each of the applications based on the first-level platform affinity, and for each platform it assigns the tasks of the applications, to which certain resources of the first-level platform are allocated, to computing nodes based on the co-runner affinity at the second level. At the first level, the Fair-AllCore algorithm allocates the resources of each platform equally to each application, while the worst AllCore algorithm allocates the resources to an application with the worst platform affinity.

Table 3.2: Size of input and output data and memory usage of a task for each application Application Input data Intermediate data Output data Memory Usage

Experimental Analysis

Table 3.3 presents the average runtimes and standard deviations of tasks for each of the applications running OneCore across the different platforms. In this section, we first discuss two metrics, which represent the platform and co-runner affinities of the applications.

Platform and co-runner affinity metrics

The lower the value of the co-runner affinity of K to C is, the more suitable combination C is to execute K inP. Finally, the co-runner affinity of K for P is calculated as the average of the co-runner affinity values of K for all combinations.

Two-level scheduling algorithm

In second-level scheduling, for each of the platforms, we first search for an application with the maximum co-driver affinity (whose performance may be most negatively affected by co-driver tasks) on the platform. Simple VM placement restrictions on CPU compatibility [43], special hardware requirements, and license requirements [44] can be set based on machine attributes.

Table 3.6: Default resource configuration

Types of VM placement constraints

For multiple VMtoNode-Must constraints, we calculate the intersection of the constraints, SV toHM ust. For multiple VMtoVM-Must constraints, we compute the union of the constraints, SV toV Must.

VM equivalence with respect to VM placement constraints

If multiple constraints of the same type are given to a VM, we can combine them as a single node group or VM as follows. We also need to compute a set of nodes that meet the total resource requirements for u,Nres.

Overview

We then filter out nodes that currently do not have enough resources to meet its resource requirements from a set of all nodes in the cluster. Once we have a unit u for v, we compute a set of candidate nodesNcand for u as discussed in the previous section.

VM Placement for Energy Saving

Then, for each node h in the computed set (in ascending index order), it searches for VM units whose resource requirements are equal to or greater than yours. Then, for each node in the computed set, it searches for VM units that cannot be placed due to the VMtoVM-MustNot constraint.

VM Placement for Load Balancing

It first searches for a set of nodes in G which satisfy the VMtoNode-Must and VMtoVM-MustNot ofu constraints (i.e. NV toHM ust∩NV toV M ustN ot). The complexity of the in-place migration algorithm is O(V Nnode), and that of the periodic migration algorithm is O(V2Nnode) when it periodically migrates VMs (under the assumption that each VM is migrated to another node at most once).

Methodology and Metrics

VMs in the same VM group may require a subset of the VMs to be placed on the same node, becoming a VM unit. If the VMtoNode-Must constraint percentage, CRV toN M ust, is x%, then x% of groups have the VMtoNode-Must constraint.

Effects of VM placement constraints

For ES-basic, with a larger Ncore, the normalized Nactive increases, because the size of a VM unit increases and the effects of the VMtoVM-MustNot constraint also become strong, and Tmig also increases. For LB-basic, a large VM unit has a significant impact on the load, but Tmig increases slightly as Ncore increases.

Figure 2: Effects of N core on ES/LB-basic, compared to no constraint

Performance comparison of the algorithms

For both the ES and LB algorithms, Tmig difference is not strongly affected by Nnode. Rf ail of the LB algorithms is very small as in Figure 4, and for Periodic-mig it becomes lower than that of LB-basic when Nnode=512.

Discussion

The effect of interference may vary depending on the characteristics and resource requirements of the applications. The performance of the other Cifar10 models was also greatly affected when placed next to the VGG16.

Table 4.2: Runtimes and overheads of SPEC CPU2006 applications

GPU usage pattern

Kernel Split

GPU usage pattern simulator

New duration for the running memcpy functions is calculated based on their PCIe bandwidth and maximum bandwidth. The duration of the running memcpy functions was also changed when one of them completed.

Off-line profiling

In the case of the memcpy calls, they can be executed simultaneously, but must share the limited PCIe bandwidth. However, if a memcpy is submitted when there are one or more memcpy functions currently running, their duration will change after scheduling the new memcpy.

GPU pattern manager

As explained in section 3.2, we can predict the GPU usage pattern with any partition factor. Finally, the manager repeats the call of the groups according to the given split factor.

Interference predictor

Based on the GPU usage pattern without core splitting, the manager tracks function calls from start to finish and aggregates the flagged function calls. Then the manager calculates the execution time of the kernel and memcpy calls that are in any group based on the given division factor and the regression result of each call.

Off-line scheduler

If the QoS threshold e.g. is 10%, the slowdown of the latency-sensitive CNN models should be less than 10%. Then, the interference predictor predicts the slowdown of the CNN models with different split factors for the batch CNN model.

Implementation

Therefore, the algorithm selects the pair and removes all elements in the same row and same column from the matrix. If the number of remaining models is the same, the algorithm selects the pair and removes all elements in the same row and same column from the matrix.

Experimental Setup

Effect of Kernel Split

Because doubling the split factor meant halving the core length, the core length decreased exponentially. Their slowing down was almost directly proportional to the number of additional cores (ie, split factor - 1), and the slowdown from kernel splitting was 13 20% per nuclear.

Accuracy of prediction model

Therefore, we predicted the performance of the CNN models when they were in the same location, taking into account the effect of both the device and host side interference. We measure the delay of the CNN models only based on host-side interference by using different GPUs while sharing the same nodes and same sockets.

Off-line Scheduling

It can apply kernel splitting to the CNN batch model to reduce the slowness of the delay-sensitive CNN model. If the delay of the CNN model is lower than the QoS threshold, the paired CNN models will be collocated.

Figure 8: Normalized Throughput of Algorithms

Architectures for Communication

Overheads

Stoica, "Spark: Cluster computing with working sets," in Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, 2010. Chakradhar, "Interference-driven resource management for gpu-based heterogeneous clusters," in Proceedings of the 21st International Symposium on High Performance parallel and distributed computing, ser.