Dynamic Concurrency Control - System Software Techniques Leveraging Emerging Hardware for High-

Previous research has proposed runtime techniques to dynamically control the concurrency of parallel applications [31–34, 46, 97, 98]. In [32], Cochran et al. proposed the thread packing technique that dynamically adjusts the number of allocated cores for multithreaded applications to improve their performance and energy efficiency. Sasaki et al. applied the thread packing technique to large-scale parallel applications [31]. The previous works [46, 97, 98] applied the thread packing technique to the multithreaded applications executed on the heterogeneous multiprocessing system whose cores implement the same instruction-set architecture (ISA) but exhibit different performance and energy-efficiency characteristics. While insightful, they assume that the thread count can be set to the maximum core count of the underlying system without causing any significant negative performance impact. Our work (i.e., RMC [35]) differs in the sense that we provide empirical evidence that dynamically adjusting the thread count has a significant impact on the performance of multithreaded applications. In addition, we propose an integrated runtime system that combines the thread packing and dynamic threading techniques in a cooperative manner.

Previous research has also proposed adaptive techniques that dynamically determine the optimal thread count (i.e., dynamic threading) for multithreaded applications [33, 34]. After running a portion of a parallel program or executing the entire program with some test input data, the proposed techniques determine the proper thread count and perform parallel execution without changing it during the entire execution. Our work (i.e., RMC [35]) differs in the sense that RMC provides a programming framework in which the runtime system continuously adapts the dynamic behavior of parallel applications through- out their entire life cycle in a fine-grained manner. In addition, RMC provides an integrated runtime system that simultaneously utilizes the two state-of-the-art adaptive techniques – thread packing and dynamic threading.

Prior works have extensively investigated the techniques based on thread packing to improve the performance and energy efficiency of parallel applications in the context of embedded [45, 46, 97, 98], many-core computing [31], dynamic server consolidation [8, 47, 81], and power-constrained computing [27]. While the prior works proposed various techniques to employ thread packing for dynamic concurrency control, they lack the investigation of the inefficiencies of thread packing. For instance, one of the representative prior works on thread packing merely assumes that thread packing would achieve the performance comparable with thread reduction across all the core counts [32], which is shown to be an invalid assumption through our quantitative evaluation. For another example, while one of the prior works employs thread packing to improve the performance of parallel applications under power capping [27], it fails to identify the inefficiencies of thread packing, achieving suboptimal performance as shown in our work (i.e., PALM [36]).

Our work (i.e., PALM [36]) significantly differs in that it investigates the inefficiencies of thread packing, identifies their root causes, proposes PALM to effectively address the performance pathologies

of thread packing, and quantifies the effectiveness of PALM in terms of performance, energy consumption, and server consolidation efficiency using various applications, system configurations, and scenarios (i.e., dynamic server consolidation, power capping). Moreover, our work (i.e., PALM [36]) is orthogonal to the prior works and can be effectively used to significantly improve their efficiency.

IX Conclusions

In this dissertation, we investigate the system software techniques that effectively leverage emerging hardware to maximize the performance, efficiency, and fairness of computer systems. Recent computer systems have adopted emerging hardware with advanced functionalities and/or improved performance and efficiency. However, there are several challenges with emerging hardware. The characteristics of emerging hardware are unknown but deriving useful properties from characterization studies is challenging because emerging hardware has different effects on applications with different characteristics.

Coordination of emerging hardware and other techniques is needed to maximize efficiency but coordinated optimization is challenging because system state space becomes large and complex due to the coordination. To address the problem, we first perform in-depth characterization studies for emerging hardware based on applications with various characteristics to derive findings which are useful for coordinated optimization. Guided by the characterization studies, we propose a set of system software techniques to effectively leverage the emerging hardware through the coordinated optimization.

First, we investigate the system software techniques to effectively leverage hardware-based last-level cache (LLC) and memory bandwidth partitioning functionalities. LLC and memory bandwidth are the highly performance-critical system resources. In modern computer systems where multiple workloads are often colocated on the same physical machine, LLC and memory bandwidth can be a source of performance interference because they are the shared hardware resources. For effective and practical memory bandwidth partitioning on commodity server systems, we first conduct an in-depth characterization of the widely used software-based (i.e., thread packing and clock modulation) and hardware-based (i.e., Intel MBA) memory bandwidth partitioning techniques in terms of the dynamic range, granularity, and efficiency. Guided by the characterization results, we design and implement HyPart, a hybrid technique for practical memory bandwidth partitioning on commodity servers. HyPart combines the three widely used memory bandwidth partitioning techniques in a coordinated manner in accordance with the application characteristics. We conduct quantitative evaluation with various benchmarks and system configurations to demonstrate the effectiveness of HyPart. Our evaluation results show that HyPart significantly outperforms the existing widely used memory bandwidth partitioning techniques in terms of dynamic range, granularity, and efficiency.

In addition, we investigate the system software technique for coordinated partitioning of LLC and memory bandwidth for fairness-aware workload consolidation. To this end, we characterize the impact of LLC and memory bandwidth partitioning on the performance and fairness of the consolidated workloads. Guided by the characterization study, we design and implement CoPart, coordinated partitioning of LLC and memory bandwidth for fairness-aware workload consolidation on commodity servers. Co- Part dynamically profiles the characteristics of the consolidated workloads for the allocated LLC and memory bandwidth and partitions LLC and memory bandwidth in a coordinated manner to maximize the fairness of the consolidated workloads. We formulate our resource allocation problem as the Hos- pital/Residents problem, which is an extensively studied and widely applied problem in economics.

Our quantitative evaluation results demonstrate that CoPart significantly improves the fairness of the

consolidated workloads, consistently achieves high fairness across various application and system configurations, and incurs small performance overhead.

Second, we investigate the system software technique to manage hardware-based power capping functionalities. Power is another important system resource. Maximizing performance under power capping is one of the most important problems in various computing domains ranging from embedded to datacenter computing. To this end, we propose RPPC, a holistic runtime system for maximizing performance under power capping. We first perform a performance characterization of various multithreaded applications under power capping with different concurrency levels, cross-component power allocation ratios between CPUs and memory, total power budgets, and system scales. Guided by the characterization results, we design and implement RPPC. RPPC dynamically controls the key system knobs (i.e., concurrency level and cross-component power allocation) in a coordinated manner consider- ing the characteristics (e.g., scalability and memory intensity) of the target applications. Our quantitative evaluation shows the effectiveness of RPPC in that it significantly outperforms the state-of-the-art techniques, achieves the performance comparable with the static best version which requires extensive offline profiling, and incurs small performance overhead.

Third, we investigate the system software techniques for effective dynamic concurrency control on many-core systems and heterogeneous multiprocessing systems. Many-core systems are emerging to satisfy the increasing demand for computation capacity. To maximize the performance and efficiency of parallel applications, however, the concurrency level should be carefully managed. To address the problem, we propose RMC, an integrated runtime system for adaptive many-core computing. RMC combines the two widely used dynamic concurrency control techniques (i.e., thread packing and dynamic threading) in a coordinated manner to exploit the advantages of both techniques. RMC quickly controls the concurrency level of the target parallel applications through the thread packing technique to improve the performance and efficiency. RMC further improves the performance and efficiency by determining the optimal thread count through the dynamic threading technique. Our quantitative evaluation demonstrates the effectiveness of RMC in that it outperforms the existing dynamic concurrency control techniques in terms of the performance and energy efficiency.

In addition, we investigate the system software technique for efficient thread packing, which is a widely used dynamic concurrency control technique. We first conduct an in-depth performance analy- sis of thread packing with various synchronization-intensive benchmarks and system configurations and identify that the imbalance in the per-thread computation resources across the threads and the contention among the threads packed on the same physical core are the root causes of the performance pathologies of thread packing. Based on the characterization results, we propose PALM, progress- and locality- aware adaptive task migration for efficient thread packing. To improve the efficiency of thread packing, PALM solves the three key problems, progress-aware task migration, locality-aware task migration, and scheduling period control. PALM supports not only symmetric multiprocessing systems but also heterogeneous multiprocessing systems where the performance heterogeneity of cores incurs the imbalance of per-thread computation resources across the threads of the multithreaded applications. Our quantitative evaluation demonstrates the effectiveness of PALM in that it achieves significantly higher performance

and lower energy consumption than the conventional thread packing. We also present case studies in which PALM considerably improves the efficiency of dynamic server consolidation and the performance under power capping.

References

[1] A. Herdrich, E. Verplanke, P. Autee, R. Illikkal, C. Gianos, R. Singhal, and R. Iyer, “Cache qos:

From concept to reality in the intelR xeonR processor e5-2600 v3 product family,” in2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), March 2016, pp.

657–668.

[2] X. Wang, S. Chen, J. Setter, and J. F. Martínez, “Swap: Effective fine-grain management of shared last-level caches with minimum hardware support,” in2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2017, pp. 121–132.

[3] “Intel 64 and ia-32 architectures software developer’s manual.”

[4] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le, “Rapl: Memory power estimation and capping,” inProceedings of the 16th ACM/IEEE International Symposium on Low Power Electronics and Design, ser. ISLPED ’10. New York, NY, USA: Association for Computing Machinery, 2010, p. 189–194. [Online]. Available: https://doi.org/10.1145/1840845.1840883 [5] C. Lefurgy, X. Wang, and M. Ware, “Power capping: A prelude to power shifting,”

Cluster Computing, vol. 11, no. 2, p. 183–195, jun 2008. [Online]. Available: https:

//doi.org/10.1007/s10586-007-0045-4

[6] J. Manferdelli, N. Govindaraju, and C. Crall, “Challenges and opportunities in many-core computing,”Proceedings of the IEEE, vol. 96, no. 5, pp. 808–815, May 2008.

[7] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen, “Single-isa heterogeneous multi-core architectures: The potential for processor power reduction,” inProceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 36.

USA: IEEE Computer Society, 2003, p. 81.

[8] D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis, “Heracles: Improving resource efficiency at scale,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture, ser. ISCA ’15. New York, NY, USA: Association for Computing Machinery, 2015, p. 450–462. [Online]. Available: https://doi.org/10.1145/2749469.2749475 [9] H. Zhu and M. Erez, “Dirigent: Enforcing qos for latency-critical tasks on shared multicore

systems,” in Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’16. New

York, NY, USA: Association for Computing Machinery, 2016, p. 33–47. [Online]. Available:

https://doi.org/10.1145/2872362.2872394

[10] D. Sanchez and C. Kozyrakis, “Vantage: Scalable and efficient fine-grain cache partitioning,” in Proceedings of the 38th Annual International Symposium on Computer Architecture, ser. ISCA

’11. New York, NY, USA: Association for Computing Machinery, 2011, p. 57–68. [Online].

Available: https://doi.org/10.1145/2000064.2000073

[11] H. Cook, M. Moreto, S. Bird, K. Dao, D. A. Patterson, and K. Asanovic, “A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ser. ISCA ’13. New York, NY, USA: Association for Computing Machinery, 2013, p. 308–319. [Online]. Available: https://doi.org/10.1145/2485922.2485949

[12] J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan, “Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems,” in2008 IEEE 14th International Symposium on High Performance Computer Architecture, Feb 2008, pp. 367–378.

[13] M. K. Qureshi and Y. N. Patt, “Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches,” in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 39. USA: IEEE Computer Society, 2006, p. 423–432. [Online]. Available: https:

//doi.org/10.1109/MICRO.2006.49

[14] N. El-Sayed, A. Mukkara, P.-A. Tsai, H. Kasture, X. Ma, and D. Sanchez, “Kpart: A hybrid cache partitioning-sharing technique for commodity multicores,” in 2018 IEEE International Sympo- sium on High Performance Computer Architecture (HPCA), Feb 2018, pp. 104–117.

[15] C. Xu, K. Rajamani, A. Ferreira, W. Felter, J. Rubio, and Y. Li, “Dcat: Dynamic cache management for efficient, performance-sensitive infrastructure-as-a-service,” in Proceedings of the Thirteenth EuroSys Conference, ser. EuroSys ’18. New York, NY, USA: Association for Computing Machinery, 2018. [Online]. Available: https://doi.org/10.1145/3190508.3190555 [16] Y. Xiang, X. Wang, Z. Huang, Z. Wang, Y. Luo, and Z. Wang, “Dcaps: Dynamic cache

allocation with partial sharing,” in Proceedings of the Thirteenth EuroSys Conference, ser.

EuroSys ’18. New York, NY, USA: Association for Computing Machinery, 2018. [Online].

Available: https://doi.org/10.1145/3190508.3190511

[17] V. Selfa, J. Sahuquillo, L. Eeckhout, S. Petit, and M. E. Gómez, “Application clustering policies to address system fairness with intel’s cache allocation technology,” in2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), Sep. 2017, pp. 194–

205.

[18] A. Herdrich, R. Illikkal, R. Iyer, D. Newell, V. Chadha, and J. Moses, “Rate-based qos techniques for cache/memory in cmp platforms,” in Proceedings of the 23rd International Conference on Supercomputing, ser. ICS ’09. New York, NY, USA: Association for Computing Machinery, 2009, p. 479–488. [Online]. Available: https://doi.org/10.1145/1542275.1542342

[19] D. R. Hower, H. W. Cain, and C. A. Waldspurger, “Pabst: Proportionally allocated bandwidth at the source and target,” in2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb 2017, pp. 505–516.

[20] F. Liu, X. Jiang, and Y. Solihin, “Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance,” inHPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture, Jan 2010, pp. 1–12.

[21] F. Liu and Y. Solihin, “Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors,” in Proceedings of the ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS ’11. New York, NY, USA: Association for Computing Machinery, 2011, p. 37–48. [Online]. Available:

https://doi.org/10.1145/1993744.1993749

[22] P. Cicotti, A. Tiwari, and L. Carrington, “Efficient speed (es): Adaptive dvfs and clock modulation for energy efficiency,” in 2014 IEEE International Conference on Cluster Computing (CLUSTER), Sep. 2014, pp. 158–166.

[23] R. Schöne, T. Ilsche, M. Bielert, D. Molka, and D. Hackenberg, “Software controlled clock modulation for energy efficiency optimization on intel processors,” inProceedings of the 4th Inter- national Workshop on Energy Efficient Supercomputing, ser. E2SC ’16. IEEE Press, 2016, p.

69–76.

[24] J. Park, S. Park, M. Han, J. Hyun, and W. Baek, “Hypart: A hybrid technique for practical memory bandwidth partitioning on commodity servers,” in Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT

’18. New York, NY, USA: Association for Computing Machinery, 2018. [Online]. Available:

https://doi.org/10.1145/3243176.3243211

[25] J. Park, S. Park, and W. Baek, “Copart: Coordinated partitioning of last-level cache and memory bandwidth for fairness-aware workload consolidation on commodity servers,” inProceedings of the Fourteenth EuroSys Conference 2019, ser. EuroSys ’19. New York, NY, USA: Association for Computing Machinery, 2019. [Online]. Available: https://doi.org/10.1145/3302424.3303963 [26] S. Govindan, D. Wang, A. Sivasubramaniam, and B. Urgaonkar, “Leveraging stored energy for

handling power emergencies in aggressively provisioned datacenters,” in Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS XVII. New York, NY, USA: Association for Computing Machinery, 2012, p. 75–86. [Online]. Available: https://doi.org/10.1145/2150976.2150985

[27] H. Zhang and H. Hoffmann, “Maximizing performance under a power cap: A comparison of hardware, software, and hybrid techniques,” in Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ser.

ASPLOS ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 545–559.

[Online]. Available: https://doi.org/10.1145/2872362.2872375

[28] R. Ge, X. Feng, Y. He, and P. Zou, “The case for cross-component power coordination on power bounded systems,” in2016 45th International Conference on Parallel Processing (ICPP), Aug 2016, pp. 516–525.

[29] c2018 IEEE. Reprinted, with permission, from: J. Park, S. Park, and W. Baek, “Rppc: A holistic runtime system for maximizing performance under power capping,” inProceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, ser. CCGrid ’18.

IEEE Press, 2018, p. 41–50. [Online]. Available: https://doi.org/10.1109/CCGRID.2018.00019 [30] T. David, R. Guerraoui, and V. Trigonakis, “Everything you always wanted to know about

synchronization but were afraid to ask,” inProceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ser. SOSP ’13. New York, NY, USA: Association for Computing Machinery, 2013, p. 33–48. [Online]. Available: https://doi.org/10.1145/2517349.2522714 [31] H. Sasaki, S. Imamura, and K. Inoue, “Coordinated power-performance optimization in many-

cores,” inProceedings of the 22nd International Conference on Parallel Architectures and Com- pilation Techniques, ser. PACT ’13. IEEE Press, 2013, p. 51–62.

[32] R. Cochran, C. Hankendi, A. K. Coskun, and S. Reda, “Pack & cap: Adaptive dvfs and thread packing under power caps,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-44. New York, NY, USA: Association for Computing Machinery, 2011, p. 175–185. [Online]. Available:

https://doi.org/10.1145/2155620.2155641

[33] K. K. Pusukuri, R. Gupta, and L. N. Bhuyan, “Thread reinforcer: Dynamically determining number of threads via os level monitoring,” in 2011 IEEE International Symposium on Workload Characterization (IISWC), Nov 2011, pp. 116–125.

[34] M. A. Suleman, M. K. Qureshi, and Y. N. Patt, “Feedback-driven threading: Power-efficient and high-performance execution of multi-threaded workloads on cmps,” inProceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS XIII. New York, NY, USA: Association for Computing Machinery, 2008, p. 277–286. [Online]. Available: https://doi.org/10.1145/1346281.1346317

[35] c 2016 IEEE. Reprinted, with permission, from: J. Park, E. Cho, and W. Baek,

“Rmc: An integrated runtime system for adaptive many-core computing,” in Proceedings of the 13th International Conference on Embedded Software, ser. EMSOFT ’16. New

York, NY, USA: Association for Computing Machinery, 2016. [Online]. Available:

https://doi.org/10.1145/2968478.2968494

[36] c2021 IEEE. Reprinted, with permission, from: J. Park, S. Park, M. Han, and W. Baek, “Palm:

Progress- and locality-aware adaptive task migration for efficient thread packing,” in2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2021, pp. 330–339.

[37] D. Gale and L. S. Shapley, “College Admissions and the Stability of Marriage,” The American Mathematical Monthly, vol. 69, no. 1, pp. 9–15, Jan. 1962. [Online]. Available:

http://dx.doi.org/10.2307/2312726

[38] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K.

Weeratunga, “The nas parallel benchmarks—summary and preliminary results,” inProceedings of the 1991 ACM/IEEE Conference on Supercomputing, ser. Supercomputing ’91. New York, NY, USA: Association for Computing Machinery, 1991, p. 158–165. [Online]. Available:

https://doi.org/10.1145/125826.125925

[39] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsec benchmark suite: Characterization and architectural implications,” in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’08. New York, NY, USA: Association for Computing Machinery, 2008, p. 72–81. [Online]. Available:

https://doi.org/10.1145/1454115.1454128

[40] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, “The splash-2 programs:

Characterization and methodological considerations,” in Proceedings of the 22nd Annual International Symposium on Computer Architecture, ser. ISCA ’95. New York, NY, USA: Association for Computing Machinery, 1995, p. 24–36. [Online]. Available: https:

//doi.org/10.1145/223982.223990

[41] S. Seo, G. Jo, and J. Lee, “Performance characterization of the nas parallel benchmarks in opencl,” in Proceedings of the 2011 IEEE International Symposium on Workload Characterization, ser. IISWC ’11. USA: IEEE Computer Society, 2011, p. 137–148. [Online].

Available: https://doi.org/10.1109/IISWC.2011.6114174

[42] T. Sherwood, B. Calder, and J. Emer, “Reducing cache misses using hardware and software page placement,” in Proceedings of the 13th International Conference on Supercomputing, ser. ICS

’99. New York, NY, USA: Association for Computing Machinery, 1999, p. 155–164. [Online].

Available: https://doi.org/10.1145/305138.305189

[43] Y. Ye, R. West, Z. Cheng, and Y. Li, “Coloris: A dynamic cache partitioning system using page coloring,” in Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, ser. PACT ’14. New York, NY, USA: Association for Computing Machinery, 2014, p. 381–392. [Online]. Available: https://doi.org/10.1145/2628071.2628104

Dalam dokumen System Software Techniques Leveraging Emerging Hardware for High-Performance, Efficient, and (Halaman 123-140)