Evaluation - System Software Techniques Leveraging Emerging Hardware for High-Performance, Effi

Start coordinate()

victimApp ← sortedAppList.getNext()

victimApp ≠ NULL

victimApp.currState ← victimApp.getMinimumCoreState()

coordinationSucccess ← False appList.requiredCoreCount() ≤ NC

victimApp.currState ← victimApp.getFeasibleBestState() coordinationSucccess ← True

End coordinate() True

False

True False

Figure 7: Execution flow of the coordinate function

the newly generated system state is the same as the previously explored system state. If so, there is a possibility that HyPart would be indefinitely stuck at the same system state. To prevent this scenario, HyPart randomly selects a state among the states in the newly generated system state and removes it from the candidate state list of the corresponding application. This essentially eliminates the newly generated system state from the candidate system states. HyPart then transitions back to the system state generation sub-phase to generate a new system state to explore.

Exit sub-phase: Finally, if HyPart transitions to the exit sub-phase, it terminates the system state exploration phase because there is no more system state to explore or the rollback count has exceeded the threshold. It then transitions to the idle phase.

3.3.6 Idle Phase

During the idle phase, HyPart keeps monitoring the target system and applications without performing any adaptation activities. If a change (e.g., the termination of an application or a change in the memory bandwidth budgets) is detected, HyPart terminates the idle phase and re-triggers the aforementioned adaptation process.

0 10 20 30 40 50 60

0 20 40 60 80 100

Memory Traffic (GB/s)

Setting (%)

Thread Packing Clock Modulation MBA HyPart

Figure 8: Dynamic range and granularity of HyPart

3.4.1 Dynamic Range and Granularity

We investigate the dynamic range and granularity of HyPart. Figure 8 shows the memory traffic of the STREAMbenchmark with various memory bandwidth budgets that HyPart can configure. We observe the following data trends.

First, HyPart provides a wider dynamic range (i.e., 0.31GB/s–56.16GB/s) of memory bandwidth than thread packing (i.e., 9.51GB/s–56.16GB/s), clock modulation (i.e., 9.75GB/s–56.16GB/s), and MBA (i.e., 34.20GB/s–56.16GB/s). This is primarily because HyPart can configure lower memory bandwidth than the other three techniques by combining multiple partitioning techniques. For instance, the lowest bandwidth (i.e., 0.31GB/s) can be configured with HyPart by setting the level of each partitioning technique to the lowest level.

Second, HyPart provides a finer-grain control of memory bandwidth than the other three techniques.

Specifically, the granularity of HyPart is 0.02GB/s, which is significantly finer-grained than thread packing (i.e., 3.11GB/s), clock modulation (i.e., 3.32GB/s), and MBA (i.e., 5.47GB/s). This is primarily because HyPart comprises a significantly larger number of levels (i.e., 16×15×10=2400) than thread packing (i.e., 16), clock modulation (i.e., 15), and MBA (i.e., 10).

3.4.2 Efficiency

We now investigate the efficiency (i.e., throughput) of HyPart. We generate 15 workload mixes by randomly selecting two benchmarks out of the nine memory bandwidth-intensive benchmarks without replacement for each workload mix. Each benchmark in a workload mix is allocated an equal bandwidth budget (i.e., 28GB/s) and its thread count is set to the same as the core count (i.e., 16) of the evaluated server system.

We run each workload mix by controlling the memory bandwidth of each benchmark in the workload mix using thread packing (TP), clock modulation (CM), and HyPart. The TP version allocates four cores to each benchmark in the workload mix while setting the clock modulation and MBA levels to 100%.

0 1 2 3 4 5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 AVG

Normalized Throughput

Workload Mix Index

Thread Packing Clock Modulation HyPart

Figure 9: Overall throughput

The CM version sets the clock modulation level to 43.75% while setting the allocated core count and MBA level to 16 and 100%, respectively. Note that we cannot report the results of the MBA version because there is no feasible state due to the limited dynamic range of MBA (i.e., the minimum bandwidth that the MBA can configure is 34.2GB/s (Table 3)).

We run each version of the workload mixes for 100 seconds and measure the number of instructions that each benchmark in the workload mix executes. We then summarize the overall throughput of the workload mix using Equation 1 discussed in Section 3.3.1.

Figure 9 shows the overall throughput of the three versions of each workload mix. Each bar is the throughput (i.e., higher is better) of each version of the workload mix, normalized to that of the TP version. The right most bars show the average (i.e., geometric mean) throughput of each version across all workload mixes.

First, HyPart significantly outperforms thread packing. The TP version of the workload mixes achieves low throughput because it employs fewer cores to satisfy the memory bandwidth target (i.e., 28GB/s) while the settings of the other two partitioning techniques are at their maximum levels. In contrast, HyPart significantly outperforms thread packing by employing a sufficiently large number of cores for each application while satisfying the memory bandwidth budget.

Second, HyPart significantly outperforms clock modulation. To satisfy the target memory bandwidth budget, clock modulation sets the duty cycle percentage to a low value, significantly reducing the computation capacity allocated to each benchmark in the workload mixes. In contrast, HyPart executes each benchmark with an efficient system state by controlling the settings of the three memory bandwidth partitioning techniques in a coordinated manner, significantly outperforming clock modulation.

3.4.3 Sensitivity to the Parameters

We first investigate the performance sensitivity of HyPart to the application count. To quantify the performance sensitivity of HyPart, we sweep the application count in each workload mix from 2 to 4.

0.0 0.5 1.0 1.5 2.0

2 3 4

Normalized Throughput

Application Count

Thread Packing Clock Modulation HyPart

Figure 10: Performance sensitivity of HyPart to the application count

With thread packing, each benchmark is allocated 4, 2, and 1 cores when the application count is set to 2, 3, and 4, respectively. With clock modulation, the clock modulation level is set to 43.75%, 25%, and 18.75% when the application count is set to 2, 3, and 4, respectively. Again, we cannot report the results of the MBA version due to its limited dynamic range.

Figure 10 shows the throughput (normalized to the TP version) of each memory bandwidth partitioning technique, which is the average (i.e., geometric mean) across the 15 workload mixes. HyPart consistently outperforms the other memory bandwidth partitioning techniques across all application counts, demonstrating its performance robustness.

HyPart significantly outperforms clock modulation as the application count increases. This is primarily because the clock modulation level is set to a lower value (i.e., the percentage of duty cycles) as the bandwidth budget of each benchmark decreases with a larger application count, significantly reducing the computation capacity allocated to the benchmarks. In contrast, HyPart robustly finds the efficient system states across the different application counts, significantly outperforming clock modulation.

The performance gain of HyPart over thread packing decreases as the application count increases.

Because each benchmark is allocated a lower memory bandwidth budget with a larger application count, the number of feasible states for each benchmark decreases, providing fewer optimization opportunities for HyPart. Nevertheless, HyPart consistently outperforms thread packing across all application counts.

We then investigate the performance sensitivity of HyPart to the memory bandwidth budget allocated to each application in a workload mix. To quantify the performance sensitivity of HyPart, we use the aforementioned 15 workload mixes, each of which comprises two memory bandwidth-intensive benchmarks. We allocate the bandwidth budget to each application in a workload mix in a uniform (i.e., 28GB/s and 28GB/s) and skewed (i.e., 17.1GB/s and 39.2GB/s) manner. With thread packing and skewed memory bandwidth budgets, each benchmark is allocated 2 and 7 cores, respectively. With clock modulation and skewed memory bandwidth budgets, the clock modulation level is set to 25%, respectively.

0.0 0.5 1.0 1.5 2.0

Uniform Skewed

Normalized Throughput

Thread Packing Clock Modulation HyPart

Figure 11: Performance sensitivity of HyPart to the memory bandwidth budgets

Figure 11 shows the throughput (normalized to the TP version) of each memory bandwidth partitioning technique, which is the average (i.e., geometric mean) across all 15 workload mixes. HyPart consistently achieves higher performance across the different memory bandwidth budgets. The conventional memory bandwidth partitioning techniques tend to use a suboptimal state (e.g., small percentage of duty cycles with clock modulation) for a benchmark when it is allocated a low bandwidth budget. In contrast, HyPart effectively combines the three memory bandwidth partitioning techniques and employs efficient states for benchmarks across the different memory bandwidth budgets.

Overall, our quantitative evaluation demonstrates the effectiveness of HyPart in that it provides a wider dynamic range and finer-grain control of memory bandwidth and significantly higher performance than the conventional memory bandwidth partitioning techniques and achieves robust performance across the different settings of the underlying system.⁵

Dalam dokumen System Software Techniques Leveraging Emerging Hardware for High-Performance, Efficient, and (Halaman 39-43)