System Software Techniques Leveraging Emerging Hardware for High-Performance, Efficient, and

First, we define the impact of LLC and memory bandwidth partitioning on the performance and fairness of consolidated workloads. RMC rapidly controls the concurrency level of target applications using thread packing technique to improve performance and efficiency.

Contributions

We present an in-depth characterization of the impact of LLC and memory bandwidth partitioning on the performance and fairness of the consolidated applications. CoPart dynamically analyzes the characteristics of the consolidated applications and divides the LLC and memory bandwidth in a coordinated manner to maximize the overall fairness of the applications.

Organization

Our experimental results show that PALM is effective in providing significantly higher performance and lower energy consumption than thread packing (e.g., 47.0% lower execution time and 39.3% lower energy consumption than thread packing at high concurrency) at different synchronizations. - intensive benchmarks and system configurations, provides performance and power consumption comparable to thread reduction, and significantly improves dynamic server consolidation efficiency and power consumption limiting performance. Next, the background of power limiting is explained, including the emerging function of hardware power limiting.

LLC and Memory Bandwidth Partitioning

One of the main disadvantages of thread bundling in the context of memory bandwidth allocation is that it cannot be applied to single-threaded applications. One of the main advantages of MBA is that it can effectively allocate memory bandwidth without impeding any per-core activity (eg, arithmetic instructions) except for outgoing requests from the L2 cache to the LLC.

Power Capping

The main advantage of software-based energy coverage is that it is very flexible because different stamps and algorithms can be used in software. The main disadvantage of software-based energy coverage is the long adjustment time and weaker energy coverage guarantees.

Dynamic Concurrency Control

Dynamic threading is an adaptive technique that aims to determine the optimal number of threads for multi-threaded applications [33, 34]. Based on the profiling results, it determines the optimal number of threads that are expected to maximize performance and runs the target application with the optimal number of threads.

Experimental Methodology

Benchmarks: Table 2 summarizes the nine memory bandwidth-intensive benchmarks selected from the PARSEC, SPLASH, and NPB benchmark suites [38–41]. In addition, we use the STREAMbenchmark [51] to empirically determine the maximum memory bandwidth traffic on the target system.

Characterization of Memory BW Partitioning Techniques

We now investigate the performance impact of three memory bandwidth partitioning techniques on high memory bandwidth demanding benchmarks. First, we investigate the performance impact of each memory bandwidth partitioning technique.

Figure 1: Memory traffic with the memory bandwidth partitioning techniques Table 3: Dynamic range and granularity

Design and Implementation

If the efficiency of the current system state is lower than that of the previous system state, HyPart rolls back to the previous system state. System state generation subphase: If the rollback count is below the predetermined threshold or the efficiency of the current system state is higher than the previous system state, HyPart moves to the system state generation subphase.

Figure 5: Overall execution flow of HyPart

Evaluation

We run each workload mix by controlling the memory bandwidth of each benchmark in the workload mix using thread packing (TP), clock modulation (CM), and HyPart. Next, we investigate the sensitivity of HyPart performance to the memory bandwidth budget allocated to each application in the workload mix.

Figure 8: Dynamic range and granularity of HyPart

Conclusions

In contrast, HyPart effectively combines the three memory bandwidth allocation techniques and uses efficient states to benchmark across different memory bandwidth budgets. Next, a characterization study is presented for the impact of LLC bandwidth allocation and memory on the performance and fairness of consolidated workloads.

Terminology and Experimental Methodology

Memory bandwidth-sensitive benchmarks are benchmarks whose performance is sensitive to the specified memory bandwidth. Benchmarks sensitive to LLC and memory bandwidth are benchmarks that exhibit high performance sensitivity to a given LLC capacity and memory bandwidth.

Table 4: System configuration Component Description

Characterization

The performance of the LLC and bandwidth-sensitive memory (LM) benchmarks exhibits high sensitivity to both the specified LLC capacity and the memory bandwidth. Fairness of LM applications is highly dependent on LLC and memory bandwidth allocation.

Figure 12: Performance impact of LLC and memory bandwidth partitioning on LLC-sensitive bench- bench-marks

Design and Implementation

Specifically, the resource manager briefly executes each application with all LLC modes and memory bandwidth allocated to the application. This is to determine the performance sensitivity of the application to both the allocated LLC capacity and memory bandwidth.

Figure 17: Fairness impact of LLC and memory BW partitioning with LLC- and memory BW-sensitive applications

Evaluation

In contrast, CoPart achieves significantly higher fairness by dynamically analyzing the characteristics of consolidated applications and distributing LLC and memory bandwidth to applications in awareness of their characteristics. CoPart distributes LLC and memory bandwidth to consolidated cluster workloads in an application-specific and fairness-aware manner, achieving significantly higher fairness than the EQ version.

Conclusions

Although the two state-of-the-art power-limiting techniques (i.e., PUPiL and CPC) are insightful, they have fundamental limitations in that they control only one of the performance-critical knobs (i.e., concurrency level and cross-component power allocation) in an isolated manner. In general, the two state-of-the-art power limiting techniques often do not achieve optimal performance across various benchmarks, total power budgets, and system scales.

Figure 29: Performance comparison of the state-of-the-art techniques

Experimental Methodology

Based on the performance model built through extensive offline profiling for jacobi, CPC determines the optimal power allocation across components and distributes power among CPUs and memory accordingly. For each benchmark, we evaluate the performance of the following four versions – (1) the statically best version running with the best system state in terms of performance, which is obtained by extensive offline profiling on each benchmark, (2) the PUPiL version [27 ], (3) the CPC version [28] and (4) the RPPC version.

Performance Characterization

We are now investigating the performance of multi-threaded applications in limiting power consumption on a 64-core server system. Finally, Figure 33 shows the performance of various benchmarks on a low-power 64-core server system.

Figure 30: Performance of various benchmarks on the 16-core server with the high power budget is higher than a threshold

Design and Implementation

Based on the category of the target multithreaded application, RPPC determines the initial state of the system that should provide high performance. RPPC starts exploring the system state space from the initial system state determined during the application classification phase.

Evaluation

First, RPPC continues to significantly outperform both state-of-the-art techniques, achieving performance comparable to the static best version on a 16-core server system with a low power budget. We continue to discuss the performance results on a 64-core server with high power consumption (ie 380 W).

Figure 35: Performance on the 16-core server with the high power budget

Conclusions

RPPC robustly detects the change in the total energy budget, immediately meets the new energy constraint (i.e. 70 W) and effectively adapts to an efficient system state for the new total energy budget. Here again, RPPC robustly detects the change in the total energy budget, immediately meets the new power constraint (i.e., 110 W) and effectively adapts to the efficient system state for the high energy budget.

Motivation

First, the motivation for the need for an integrated execution system for efficient multi-core computing is presented. To enable high-performance and energy-efficient multicore computing, we need an integrated execution system that combines state-of-the-art adaptive techniques to achieve the best possible performance and energy efficiency.

Design and Implementation

Another set of RMC API functions are implemented locally without the need to communicate with RMC. When the RDT receives a GET_NUM_THR request, it first checks the current usage per application core collected by the MTP.

Figure 42: The overall architecture of RMC RMC.

Applying RMC to Parallel Applications

Therefore, RMC can be effectively used to improve the performance and energy efficiency of faces by dynamically adjusting its concurrency level. Listing 3 shows how RMC can be used to transform pipeline-based parallel applications such as asdedup.

Evaluation

As a result, the performance and energy efficiency of the DT version is significantly worse than the RMC version. The RMC version of the microbenchmark significantly outperforms all other versions, including the optimal static version (i.e., ST-8).

Figure 43: Normalized time and energy of the benchmarks (each bar shows the average result of five runs.)

Conclusions

This section presents PALM [36], a progress- and location-aware adaptive task migration for efficient thread packaging, which consists of a thread-to-kernel mapping that implements progress- and location-aware task migration and a scheduling period controller, which dynamically adjusts the scheduling period according to the synchronization granularity of the target parallel applications. An in-depth performance characterization of standard thread packing is then explained to identify the root causes of its performance pathologies.

Figure 46: Performance sensitivity of RMC to the per-core utilization threshold

Experimental Methodology

The design and implementation of PALM are then discussed and quantitative evaluation is shown to demonstrate the effectiveness of PALM. The synchronization interval length of each benchmark in Table 11 is collected by running it with 16 cores and 16 threads.

Drawbacks of Standard Thread Packing

This is mainly because the scheduling activities of the Linux kernel are performed coarsely and do not mitigate the degradation of TP performance. Applications that use precise synchronization are more sensitive to TP performance pathologies.

Figure 48: Execution time breakdowns of thread reduction and thread packing with 16 threads on 15 cores

Design and Implementation

The thread-to-core mapper dynamically assigns each thread in the target application to the allocated cores in a progress- and locality-aware manner. The asymptotic time complexity of the thread-to-kernel mapper is O(NT·logNT+NT·NC) with the constraint NT≥NC.

Figure 49: Overall execution flow of the runtime system Algorithm 9 The threadToCoreMapper function

Evaluation

We investigate the performance of TR, TP and PALM when a non-constant load is applied to the LC application. We investigate the performance sensitivity of PALM to the key system parameters – the system scale and the performance heterogeneity of cores.

Figure 51: Execution time breakdowns of various versions

Conclusions

Furthermore, our experimental results show that PALM causes small performance and power consumption changes in all cases. Compared to TR, we recommend choosing PALM over TR if the target application requires dynamic contention control capability to increase its performance and energy efficiency and improve overall resource utilization and/or runs on HMP systems where performance heterogeneity of cores can cause imbalance across threads.

LLC and Memory Bandwidth Partitioning

While deep, they do not take into account software and hardware-based memory bandwidth sharing techniques to prevent contention between consolidated workloads. Our work (i.e., HyPart [24]) is significantly different from previous works in that it performs in-depth characterization of widely used software and hardware techniques for memory bandwidth allocation and proposes the hybrid allocation technique of memory bandwidth that effectively combines the wide memory bandwidth allocation techniques used.

Power Capping

Previous works mainly focus on task and/or paging and migration techniques to improve the performance and efficiency of NUMA and/or heterogeneous memory systems.

Dynamic Concurrency Control

Kozyrakis, "Heracles: Improving Resource Efficiency at Scale," in Proceedings of the 42nd Annual International Symposium on Computer Architecture, ser. Characterization and methodological considerations,” in Proceedings of the 22nd Annual International Symposium on Computer Architecture, ser.