Efficient Machine Learning on Heterogeneous Computing Systems through a Coordinated Runtime

As machine learning grows, a heterogeneous computing system is actively used for a solution to increase the efficiency of machine learning. Our paper presents CEML, which is a runtime system for improving the efficiency of machine learning on heterogeneous computing systems. CEML dynamically adapts the heterogeneous computing system to the estimated efficient system state to improve the efficiency while satisfying constraints.

Heterogeneous computing systems are helpful in improving the performance of machine learning on mobile and embedded devices [12]. A heterogeneous computing system can simultaneously perform the operations of a machine learning application using multiple heterogeneous computing devices that have different characteristics. To address this problem, this paper presents CEML [13], a coordinated implementation system for efficient machine learning on heterogeneous computing systems.

This paper proposes CEML [13], a coordinated runtime system for improving the performance of machine learning applications on heterogeneous computing systems. The performance and power estimators for CEML predict the performance and power consumption when the target application runs on the heterogeneous target computer system at given device frequencies. The CEML runtime manager finds the effective device frequencies for the target machine learning application on the target heterogeneous computing system based on the optimization metrics (eg performance under power budget and energy optimization).

Background and Motivation

Heterogeneous Computing

TensorFlow

TensorFlow supports concurrent execution of independent operations across the available computing devices to improve utilization of the devices in the target system. The target machine learning applications in this work are assumed to be implemented as TensorFlow applications. While CEML is evaluated using TensorFlow applications, CEML can be applied in other machine learning frameworks such as Caffe or PyTorch [14].

Motivation Example of CEML

We also note that the performance sensitivity of the benchmark varies across heterogeneous computing systems. As shown in Figures 2.1 and 2.2, MNi performance is insensitive to processor frequency on the Jetson TX2. In contrast, the sensitivity of MNi performance on the Jetson Xavier is relatively high relative to processor frequency.

For example, the GPU consumes the most significant power among the CPU, GPU, and memory with the CFon Jetson TX2, whereas the CPU consumes the most significant power among the three devices with the WD on the Jetson TX2. We also observe that the performance characteristics of a benchmark on each heterogeneous computing system are different. For example, in the case of CF, the GPU uses significantly more power on the Jetson Xavier than the Jetson TX2.

In contrast, in the case of LR, CPU has more power ratio on Jetson Xavier than Jetson TX2. The observed data trends show that the software support for efficient machine learning on heterogeneous computer systems needs to coordinately control the devices. The machine learning application shows different characteristics in terms of performance and power consumption, and the same machine learning application shows different characteristics depending on the heterogeneous computing system.

Therefore, static approaches that require extensive offline profiling are impractical for real-world heterogeneous computing systems. Since heterogeneous computing systems have a large system state space, each benchmark has different characteristics in terms of performance and energy consumption.

Figure 2.3: Power consumption of each device normalized to total power consumption on Jetson TX2 sensitivity of WD on Jetson TX2 is relatively high to CPU frequency

Methodology

Design and Implementation

Performance Estimator

Therefore, CEML must profile seven samples of performance data collected using different system metrics to derive the coefficients in Equation 4.1.

Power Estimator

PD,fD =D,fD·UD+ζD,fD (4.2) The CEML shall estimate the usage of each device to estimate the energy consumption using equation 4.2 for the target system state. To make the device usage estimation model, we investigate the sensitivity of the benchmark device usage to the device or other device frequency. Figures 4.3 and 4.4 show the machine learning application device utilization sensitivity to device frequency on Jetson TX2 and Jetson Xavier, respectively.

We also observe that the use of devices is influenced by the frequency of other devices. This is mainly because the device takes more data per unit of time when the other device is faster. Although omitted, there is usage sensitivity interaction between each device pair similar to the aforementioned performance estimation model.

The utility estimation model used in CEML consists of a first-order term for each device and a quadratic term for each pair of devices. The first-order term represents the device usage sensitivity for each device frequency and the quadratic term represents the device usage sensitivity between each device pair.

Figure 4.3: Device utilization with each device frequency on Jetson TX2

Runtime Manager

During the exploration phase, the runtime manager builds performance and power estimation models based on the collected data samples. After building the estimation models, the runtime manager explores the system state space to determine the efficient system state using the exhaustive search algorithm and the local search algorithm. Exhaustive and local search algorithms showing different convergence system states and overheads used in the execution manager are used to explore the system state space.

It preserves the best system state that is judged to have the best efficiency for the target machine learning application while satisfying user-specific constraints among the studied system states. It estimates the efficiency of the target system state, and if the target system state is more efficient than the current state based on estimated efficiencies, it keeps the current state as the best system state (Line 9). When all system states in the system state space have been explored, it terminates the search algorithm.

However, it is still useful for the heterogeneous computer system that has a small system state space. The local search algorithm explores the system state space, similar to the hill-climbing algorithm. It calculates the efficiency of the neighbor system states and chooses the best neighbor system state that is expected to maximize the user-defined metric while satisfying the neighbor system state constraints (line 5).

It compares the current best system state and the best neighboring system state and then chooses the system state to be more efficient (Lines 7-9). If the best neighboring system state has higher efficiency than the current state, it repeats the process. After the search process is complete, the runtime manager configures the heterogeneous computing system to the best system state estimated to maximize the efficiency of the target machine learning application (Line 12 of Algorithm 1 and Line 12 of Algorithm 2).

When the runtime manager detects the program phase change, it begins the adaptation process that resumes with the profiling phase to determine a new efficient state of the system. If the runtime manager detects changing metrics or constraints, it goes into the exploration phase to determine a new efficient state of the system based on new metrics or constraints.

Table 4.1: Seven system states profiled during profiling phase System Profiled system states (GHz)

Evaluation

Nevertheless, the energy efficiency improvement with CEML versions shown in Figures 5.2 and 5.4 demonstrates the effectiveness of CEML without extensive offline profiling. Because CEML-L can converge local optimal system state, CEML-L sometimes chooses less efficient system state than CEML-E (e.g., MNon Jetson TX2 and INandMNon Jetson Xavier). Due to estimation errors, CEML-L can find a more efficient system state than CEML-E (e.g. RBandVPon Jetson Xavier).

We also show an example scenario to demonstrate the effectiveness of the readjustment functionality of CEML. In this example scenario, the power constraint of the target heterogeneous computing system (i.e. Jetson TX2) changes while executing the benchmark (i.e. LR) with CEML. Because this power limit is enough to run the benchmark at the fastest system state (i.e., the maximum device frequencies for CPU, GPU, and memory), CEML configures the target heterogeneous computing system to this system state.

CEML effectively detects the change of the constraint, it causes the runtime manager to restart the exploration phase to determine a new efficient system state under the power constraint, and adapt the heterogeneous target computer system to the new efficient system state that satisfies the constraint . Finally, we evaluate the runtime overheads of CEML in terms of the CPU usage and the 14. The performance overheads of the CEML-Land CEML-E versions are insignificant on all evaluated benchmarks.

Notably, the average exploration time of each CEML version is 7.2 and 176.4 microseconds and 35.7 and 827.2 microseconds on Jetson TX2 and Jetson Xavier, respectively.

Figure 5.1: Estimation errors of the estimators on Jetson TX2

Related Work

Conclusions

Bibliography

Dally, "Eie: Efficient Inference Engine in Deep Compressive Neural Network," in Proceedings of the 43rd International Symposium on Computer Architecture, ser. Tang, “Djinn and tonic: Dnn as a service and its implications for future warehouse-scale computing,” in Proceedings of the 42 Annual International Symposium on Computer Architecture, ser. Baek, “Ceml: a coordinated runtime system for efficient machine learning on heterogeneous computing systems,” in Euro-Par 2018: Parallel Processing, ser.

Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM International Conference on Multimedia, ser. Yoon, “In-data center performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. Baek, “Iacm: Integrated adaptive cache management for high-performance and energy-efficient gpgpu computing,” in 2016 IEEE 34th International Conference on Computer Design (ICCD), October 2016, pp.

Kozyrakis, “Toward energy proportionality for large-scale delay-critical workloads,” in Proceedings of the 41st Annual Int. Kozyrakis, "Heracles: Improving resource efficiency at scale," in Proceedings of 42nd Annual International Symposium on Computer Architecture, ser. Vishin, “Hierarchical power management for asymmetric multicores in the dark silicon era,” in Proceedings of the 50th Annual Design Automation Conference, ser.

Baek, "Hap: A heterogeneity-conscious runtime system for adaptive pipeline parallelism," i Proceedings of the 22nd International Conference on Euro-Par 2016: Parallel Processing - Volume 9833, ser. Baek, "Rmc: An integration runtime system for adaptive many-core computing," i2016 International Conference on Embedded Software (EMSOFT), okt. 2016, s. Mitra, "Power-performance modeling of mobile gaming workloads on heterogeneous mpsocs," inProceedings of den 52. årlige designautomatiseringskonference, ser.

Mahlke, "Equalizer: Dynamic tuning of gpu resources for efficient execution," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, ser. Li, “C-brain: A deep learning accelerator that tames the diversity of cnns through adaptive data-level parallelization,” in Proceedings of the 53rd Annual Design Automation Conference, ser.