Performance Characterization - System Software Techniques Leveraging Emerging Hardware for High

This section provides the performance characterization of multithreaded applications under power capping. Specifically, we aim to investigate the performance sensitivity of multithreaded applications to the active core count (i.e., concurrency level) and the CPU power allocation ratio (i.e., cross-component power allocation).

We also aim to investigate the performance sensitivity of multithreaded applications to their characteristics (i.e., memory (or core) intensity and scalability). Specifically, we categorize an application into the memory-intensive application if its last-level cache (LLC) misses per kilo instructions (LLC MPKI)

4 8 12 16

0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(a)cfd(CFD)

4 8 12 16

0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(b)jacobi(JA)

4 8 12 16

0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(c)kmeans(KM)

4 8 12 16

0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(d)kmeans-fuzzy(KZ)

4 8 12 16

0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(e)scalparc(SC)

4 8 12 16

0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(f)stream(ST)

Figure 30: Performance of various benchmarks on the 16-core server with the high power budget is higher than a threshold.⁸ Otherwise, we classify the application into the core-intensive application category. Finally, we aim to investigate the performance sensitivity of multithreaded applications to the total power budgets and system scales.

Before discussing the detailed performance characterization results, we define thesystem state space as the two dimensional space in which the concurrency level (i.e., the active core count (C)) and the cross-component power allocation ratio (i.e., the CPU power ratio (rCPU)) are the two primary dimen- sions. We then define asystem stateas a pair (i.e., (C,rCPU)) of the active core count and the CPU power ratio.

Figure 30 shows the performance of various benchmarks on the 16-core server system with the high power budget. Each tile shows the performance (i.e., the reciprocal of the execution time) of the corresponding system state, normalized to that of the system state with the best performance.

First, Figure 30 shows that the global optimal system state of each application widely differs with its architectural characteristics. The scalable benchmarks (i.e., CFD, JA, andST) achieve the highest performance when the core count is set to 16. In contrast, the non-scalable benchmarks (i.e.,KMandKZ) achieve the highest performance when the core count is set below 16 (e.g., 8). Further, the memory- intensive benchmarks (i.e., JA and ST) achieve the highest performance when the CPU power ratio is relatively low. On the other hand, the core-intensive benchmarks (i.e., CFD) achieves the highest performance with the relatively high CPU power ratio.

Second, the benchmarks classified into the same application category have the similar global optimal system states, which are closely located in the system state space. For instance, the global optimal system states forJAandST(i.e., memory-intensive benchmarks) are closely located in the system state space.

Third, the performance of each application can be effectively approximated as a concave function of the core count and the CPU power ratio near the global optimal system state. For instance, the performance ofSTexhibits a concave behavior near the global optimal system state.

The aforementioned observations provide an insight for an efficient design and implementation of

8Through the offline experiments with various benchmarks, we set the LLC MPKI threshold to 10.

4 8 12 16

0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(a)cfd(CFD)

4 8 12 16

0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(b)jacobi(JA)

4 8 12 16

0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(c)kmeans(KM)

4 8 12 16

0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(d)KZ(ST)

4 8 12 16

0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(e)scalparc(SC)

4 8 12 16

0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(f)stream(ST)

Figure 31: Performance of various benchmarks on the 16-core server with the low power budget a runtime system that aims to maximize the performance of multithreaded applications under power capping. Specifically, if the runtime system can correctly identify the characteristics of the target application, it can determine the initial system state, which is expected to have the quality similar to that of the global optimal system state. In addition, the runtime system can further improve the performance by dynamically exploring the system state space from the initial system state using a local search algo- rithm. This is primarily because the global optimal system state is expected to be closely located from the initial system state and the performance near the global optimal system state can be approximated as a concave function of the core count and the CPU power ratio.

Figure 31 shows the performance of various benchmarks on the 16-core server system with the low power budget. We observe the following performance data trends.

First, compared with the performance data trends on the 16-core server system with the high power budget, a larger number of benchmarks achieve the highest performance when the core count is less than the total core count (i.e., 16). For instance,JAachieves the highest performance at 8 cores with the low power budget, whereas it achieves the highest performance at 16 cores with the high power budget. This is primarily because memory-intensive benchmarks (e.g.,JAandST) achieve higher performance when the CPU power ratio is lower. Because the power allocated to CPUs is extremely low with the low power budget, memory-intensive benchmarks achieve higher performance when only a subset of cores are activated and each active core is allocated a sufficient amount of power to prevent clock throttling [28].

Second, the aforementioned observations on the 16-core server system with the high power budget are still valid with the experimental results on the 16-core server system with the low power budget.

This indicates that the observations can be used as an effective guideline to optimize the performance of multithreaded applications under power capping across a wide range of power budgets.

We now investigate the performance of multithreaded applications under power capping on the 64- core server system. Figure 32 shows the performance of various benchmarks on the 64-core server system with the high power budget. Due to space limit, we show the performance of a subset of the benchmarks that represent their application categories. We observe that the evaluated benchmarks, on

16 32 48 64

0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(a)cfd(CFD)

16 32 48 64

0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(b)kmeans(KM)

16 32 48 64

0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(c)stream(ST) Figure 32: Performance of various benchmarks on the 64-core server with the high power budget

16 32 48 64

0.7 0.72 0.74 0.76 0.78 0.8 0.82

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(a)cfd(CFD)

16 32 48 64

0.7 0.72 0.74 0.76 0.78 0.8 0.82

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(b)kmeans(KM)

16 32 48 64

0.7 0.72 0.74 0.76 0.78 0.8 0.82

Core Count

CPU Power Ratio

0.0 0.2 0.4 0.6 0.8 1.0

Normalized Performance

(c)stream(ST)

Figure 33: Performance of various benchmarks on the 64-core server with the low power budget the 64-core server system with the high power budget, exhibit the performance data trends similar to those on the 16-core server system with the high power budget.

Finally, Figure 33 shows the performance of various benchmarks on the 64-core server system with the low power budget. First, we observe that all the benchmarks achieve the highest performance when the active core count is smaller than the total core count (i.e., 64). This is primarily because all the benchmarks exhibit diminishing scalability at a higher core count (e.g., 64). Therefore, when the total power budget is low, it is more efficient to reduce the active core count and allocate higher power to each active core. Second, the aforementioned observations on the 16-core server system are valid on the 64- core server system, indicating that they can be effectively used to guide the design and implementation of an efficient runtime system across a wide range of system scales. Overall, we summarize the findings as follows:

• The performance of multithreaded applications under power capping widely varies across dif- ferent active core counts and CPU power ratios. The global optimal system state significantly outperforms suboptimal system states.

• The search space covered by the two state-of-the-art power-capping techniques (i.e., PUPiL [27]

and CPC [28]) often excludes the global optimal system states, making them achieve drastically suboptimal performance under power capping.

• The global optimal system states of the benchmarks in the same application category tend to be closely located in the system state space.

• The performance data trends show that the performance under power capping can be approximated as a concave function of the concurrency level and the CPU (memory) power ratio near the global optimal system state.

Phase 1:

Classification

Initial

State Phase 2:

Exploration

Phase 3:

Idle

|ΔT| ≤ ε

|ΔT| ≤ θ

|ΔT| > θ

|ΔT| > ε

|ΔPtotal_budget| > 0

Figure 34: Overall execution flow of RPPC

Dalam dokumen System Software Techniques Leveraging Emerging Hardware for High-Performance, Efficient, and (Halaman 71-75)