Energy and thermal management of CMPs by dynamic cache reconfiguration

But large size LLCs account for their significant leakage energy consumption, which has a circular dependence on the effective temperature of the chip circuit. At the beginning of the next week, while buying fuel for his car, an idea occurred to the driver. the main fuel of Basket..must be curtailed now!!..no food supply to the idle.”.

Modern Chip Multi-Processors (CMPs)

Components in CMPs

Accessing a data block from the home L2 bank (i.e., the bank located with the same tile of the requester core) has less latency than the distant banks (located in some other tiles). Note that L2 in this figure is considered as the on-chip LLC similar to TCMP architecture.

Figure 1.2: A Tiled CMP and a CMP with Centralised LLC.

Power Consumption in CMPs

Dynamic Power

Static Power

Ps=K1VDDT2e(αVDD+β)/T))+K2e(γVDD+δ) (1.3) Ps indicates the static current consumption due to subthreshold leakage of a CMOS circuit.

Short-Circuit Power

Thermal Issues in CMPs

Thermal Characteristics

Although power consumption can be changed instantaneously even in practical cases, temperature cannot. The temperature values shown along the Y-axis are in ◦C and the X-axis represents the timestamps during the run.

Mitigations-at a glance

The associated thermal capabilities of each chip element do not allow an instantaneous temperature change of the chip components. Because caches are typically colder than the hottest core region, and dynamically resizing large caches does not abruptly impact performance, with due attention to performance, some cache portions can be turned off to create thermal buffers to reduce chip temperature .

Figure 1.4: Change in Average Chip temperature during process execution.

Cache based Mitigations

Leakage Minimisation

Controlling Temperature

Objectives of this work

Motivations

Reduce cache hotspots by independently disabling heavily used banks and least used banks to create a thermal buffer with a reasonable reduction in leakage. Create a (dynamically adjustable) optimal amount of on-chip thermal buffer by disabling cache banks based on their location and proximity to cores to gradually lower chip temperature.

Principal Contributions

DiCeR at Cache Bank Level
DiCeR in combination with DAM technique
DiCeR for temperature control in TCMP
DiCeR for temperature control in CCMP

In particular, the following variations are proposed: (1) B ON ALL. The system may decide to turn off or on some cash banks periodically throughout the process execution. Cash banks closer to cores get the highest shutdown priority, with optimal management of future requests from the shutdown cash banks.

Summary

Resizing is done by turning off/on several cache banks, which can be implemented by the power gate at the circuit level. In addition to bank-level granularity, we have also proposed mode-level cache sizing, where performance degradation is addressed by including DAM (Dynamic Associativity Management).

Organisation of Thesis

The cache size change decision is triggered based on the dynamic change in system performance. However, from a thermal efficiency perspective, these disabled cache parts are used as on-chip thermal buffers to reduce the effective chip temperature, especially in CMPs with larger LLCs.

Access Patterns of Shared LLCs

Bank Level Granularity
Set Level Granularity
Dynamic Associativity Management (DAM)

CMP-SVR

Violation in Locality of Reference

It improves cache performance by reducing the number of conflict misses in the cache. The change in access patterns over long periods of time for a set of PARSEC applications is shown in Figure 2.5 for a TCMP architecture.

Figure 2.1: Variable bank usages across different benchmarks in a TCMP as shown in Figure 1.2(a).

Cache Energy Modeling

Dynamic and Leakage Energy

Sense amplifiers' dynamic energy is Edyn-senseamps, while dynamic energy for reading bitlines is Edyn-read-bitlines. Eleak−req−net and Eleak−rep−net represent leakage energy consumption for the request and the reply networks, respectively.

Channel Length, Temperature and Leakage

Pd implies dynamic power and Ps is the leakage current consumption of the chip circuits. As seen earlier, LLCs consume a significant portion of the total power on the chip, and most of that comes from leakage. Therefore, in Figure 2.7 we simulated the contribution of an LLC to the total energy consumption of the chip for different temperatures.

Figure 2.7: Contribution of LLCs to the total Chip power consumption.

Reducing Cache Leakage Consumption

Off-Line Techniques

47] generated cache miss equations to summarize the cache access behavior of the loop and its variables. Although this model predicts the cache misses for individual threads, it does not take the cache contention into account.

On-Line Techniques

In another exploration [82], a large portion of tag bits is moved from the cache to an external register (called the Tag Overflow Buffer), which acts as an identifier for the current location of the cache references. This policy shows a leakage saving of more than 40% for most of the tasks (the authors used).

Figure 2.12: Cache leakage reduction method at circuit level.

Thermal Management in CMPs

Core Level Management

Task Migration and DVFS are the most promising and preferred options to reduce the core temperature. In most cases, task migration and DVFS can have a noticeable impact on system performance.

Cache Based Policies

This technique lowers core and cache (MRAM/SRAM) layer temperatures by more than 5°C while maintaining critical chip temperatures. Most of the cache techniques developed earlier try to reduce the temperature by controlling cache accesses, i.e.

Summary

Moreover, LLCs are evaluated as the relatively colder on-chip components, but from the previous discussion it can be stated that LLCs can also generate on-chip hotspots. In the next part of the thesis, we therefore propose thermal-aware cache tuning, which reduces cache hotspots by disabling them.

Computer Architecture Simulators

Simics
GEMS: An Overview
CACTI
McPAT
HotSpot

Each of the L1 caches has a sequencer associated with it to manage the requests from the corresponding core. For our last two contributions we need to simulate the power and temperature of the entire chip.

Benchmarks

Parsec Benchmark Suite

Freqmine is used for Frequent Itemset Mining (FIMI) [141] with an array-based version of the Frequent Pattern-growth method. This is used to price a portfolio of swaptions using Heath-Jarrow-Morton (HJM) [142].

Simulation Methodologies

Used Benchmark Applications
Executing Benchmarks
Comparing Different CMP Architectures
Our Architectural Models

Reaching ROI implies that the benchmark initialization is over and all the threads spawned. The rest of the processes, such as warm-up and run policies, are the same as multi-threaded benchmarks.

Table 3.3: List of all the multi-threaded and multi-programmed benchmarks used for the experiments in this thesis.

Introduction

In this regard, our policy dynamically shuts down or enables cache banks based on system performance and banks' usage statistics. B ON OFF ALL A performance-related dynamic cache tuning strategy adjusts the size of the L2 caches by disabling cache banks.

Proposed Energy Saving Policy

Book Keeping and Future Requests

Similarly, during the flashing process, the target bank of the flashing bank will hang and not handle any requests during the flashing process. Storage overhead for source bank tracking Target bank selection is decided at runtime based on cache bank bank usage statistics.

Figure 4.5: Comparison between Migration (BSP H M, BSP C M) and Write- Write-Back (BSP H W, BSP C W) policies in terms of IPC, while turning off some

Constraints to maintain

Previous works, in which power saving was done by remapping techniques, used a remapping table at the L1 cache level and when any new bank is closed, the entries in all L1 caches must be updated. To see the effect of bank closure on IPC, we compare baseline IPC with the IPC of pure bank closure policy (where the number of closing banks is limited to 50% of the total number of banks), through a set of simulations.

Experimental Evaluation

Experimental Setup

Configuration details for this simulation setup are given in Table 4.3, which details the processor, memory, and network configurations. The PARSEC benchmark [6] was used as an application program to validate our proposed architecture, the details of which are also given in Chapter 3.

Results and Analysis

Comparison with baseline architecture
Comparison with BSP and Drowsy [1]
Analysis of power savings by varying the IPC con- straintstraint
Analysis of proposed policy on a larger cache
Summary

Relative to baseline, B ON OFF OPT delivers 19% and 29% more static energy savings than Drowsy C1 and Drowsy C2, respectively. Because B ON OFF OPT performs the reconfiguration intermittently, it has more control over the cache size.

Figure 4.6: Energy Delay Product obtained in proposed policies over the baseline, BSP and Drowsy, with 4MB L2 cache

Conclusion

Policy B ON OFF OPT gives the best savings compared to B ON OFF ALL and B OFF ONCE. The performance degradation is also less than 2% especially in the case of B ON OFF OPT.

Introduction

The DiCeR with DAM 108 consumption can be reduced by using a combination of cache bank shutdown and shutdown method, where the banks with minimal usage are candidates for shutdown as in BSP, and in some moderately used banks some ways are disabled to prevent leakage. This helps restore the performance loss and also creates new opportunities to exit each set in some ways.

Memory Latency in DiCeR

Here hij is the number of hits at bank j, whose requests were generated at core i. DiCeR with DAM 112 During reconfiguration, system also needs a few clock cycles to move data from victim to target or vice versa, which is negligible if done a limited number of times.

Proposed Energy Saving Policy

BSP vs. BSP SVR: A Comparative Analysis

More EDP gain is achieved in BSP SVR over BSP due to lower and lower NoC energy consumption. The performance loss threshold is set at 3% for both BSP and BSP SVR while running the simulations.

Figure 5.5: Savings in Static Energy with BSP SVR.

DiCeR with DAM at Multiple Granularities

Managing associativity: After stopping the path, the effective associativity of the bank is reduced, increasing the number of missed conflicts. This is done because CMP-SVR helps increase associativity by using RT string partitioning.

Experimental Evaluation

Results and Analysis

Static Energy Savings The obtained savings in cache leak energy are shown in Figure 5.11 for different benchmarks. Effect on smaller cache size Figure 5.14 shows the EDP savings and Figure 5.15 shows the static power savings for a 2MB 8-way associative L2 cache.

Figure 5.10: Savings in EDP in comparison with BSP and Drowsy Cache for 4MB 4-way L2.

Conclusion

Closing banks in this regard can be a promising option to achieve such goals, because closing (larger) banks creates (larger) thermal buffers on the chip with remarkably high leakage savings. Since reducing power consumption plays the crucial role in lowering the temperature, this work therefore uses DiCeR to create on-chip thermal buffers for lowering the average chip temperature without disrupting the computation.

Introduction

DiCeR Controls The temperature in a switched off bank TCMP 130 will be restored to some other cooler memory banks, called as the target bank. If the performance degrades more than a threshold value, then the disabled banks will be enabled.

Thermal issues: from LLC perspective

LLC: Thermal Characteristics
Thermal Management: Core vs Cache
Modeling Tile Temperature in TCMP
Target Bank Selection
Reconfiguring the LLC
Algorithmic Design

In both cases, powered down banks create thermal buffers, and will be able to reduce the chip temperature. If performance degradation is within a predefined limit (δ) and the hottest bank is not a target bank, the system will shutdown the hottest bank after migrating its contents to the target and remapping to shutdown bank's controller (line 11 to 15 ) make possible. ).

Figure 6.2: Peak Temperature of Caches in ◦ C.

Experimental Analysis

Simulation Setup
Temporal effect of cache resizing
EDP gain
Effect on Tile and Chip Thermal Profile
Varying Reconfiguration Interval
Summary

We show the final change in NoC latency and NoC energy consumption for all our applications in Figures 6.7 and 6.8, respectively. Both policies have good leakage savings as shown in Figure 6.10 for all our applications.

Figure 6.4: Temporal effect of cache resizing: body16.

Conclusion

The creation of thermal buffer together with the leakage effect reduction reduces the effective temperature of the tiles in our TCMP model. In such cases, the cooling rate is very slow for these banks even when they are turned off.

Introduction

In the case of caches, heavily used blocks can create cache hot spots, while the least used parts of the cache unnecessarily increase leakage consumption, contributing to chip temperature. Additionally, in the case of some modern applications, cache access patterns do not long-term match the classic property of cache access (theLocality of Reference).

Background

CCMP and its Leakage Hungry LLC

The basic architecture used in this work (shown in Figure 7.1) contains 16 homogeneous CPU cores, which are placed along the edge of the chip. For our basic architecture (Figure 7.1), the total power consumption of the chip can be divided into two main components (not counting the interconnection and I/O interface): (I) the power consumed by the individual cores, including processing power and current consumed by L1 caches (data and instruction); and (II) power consumed by the LLC (L2 in our case).

Thermal Potential of the Centralised LLCs

Independent of cache accesses, the power drain that has a circular dependence on temperature can be reduced by closing the memory banks. Therefore, reducing the drain power of the LLC by dynamically resizing it can be a promising option to reduce the chip temperature without affecting the computing units as we have already seen earlier.

Runtime Cache Behaviour

Preliminaries and Analytical Problem For- mulationmulation

Core Temperature Modeling
Problem Formulation
Performance Modeling with Cache Size
Thermal Model
Combined Analytics
Finding out Optimal b
Patterns for Cache Resizing

On the other hand, drastically reducing the cache size limits performance by incurring more cache misses. DiCeR in a CCMP for Thermal Efficiency 172 depends on the number of cache misses, the available number of cache banks, off-chip access delay, and delay incurred for redirected requests to target banks.

Table 7.2: Descriptions of the notations used in our analytical modeling.

DiCeR for Thermal Efficiency

Algorithms and Discussions

The clusters along the periphery have one on and three closed banks in each cluster. The bank that is ON in a group becomes the target for OFF banks in that group.