LiNoVo: Longevity Enhancement of Non-Volatile Caches by

210 7.12 Iso area analysis between the proposed HCA and basic SRAM and STT 211 7.13 Comparative analysis for different capacities NVM LLC (M). 215 7.19 Benchmarking for Different Intervals (I) for HCA Main Cache 215 A.1 Inherent Key Features of PARSEC Benchmarks .

Modern Chip Multi-Processors (CMPs)

Moore's Law [1] gives us more and more transistors on the chip with each generation of the process. Based on the nature of the instruction, application execution on CMPs is categorized into two categories: Memory Bound and Compute Bound.

Figure 1.1: Modern CMPs: design and floor-plan outlines

Cache Memory

Set-associative cache is considered to be the best architecture compared to direct mapped and fully associative cache.

Cache Memory Architecture in CMPs

Power Consumption in CMPs

Static Power

Dynamic Power

Short-Circuit Power

Memory Technologies used for Caches

The reason for this is the expensive recording procedures and the weak recording durability of these technologies. More details on the methodology of the work and the preliminary concepts and characteristics of the various memory technologies are discussed in Chapter 2.

Motivations

In other words, some of the blocks within the case set are heavily written while some are lightly written. These write hotspots together with the poor write endurance affect the lifetime of the non-volatile cache as a whole.

Thesis Contributions

Private Block and Prediction based Block Placement ap-

Intra-Set Write Variation mitigation using Write Restriction 13

This work reduces uneven writing in the region with negligible performance degradation and outperforms all previous works. The strategies reduce write non-uniformity between cache sets to an extent with negligible performance degradation.

Victim Caching to Improve the Performance of NVM Caches 15

The result of the proposed scheme is compared with the existing hybrid cache architecture: RWHCA [28] and the basic SRAM and STT-RAM based caches for the quad-core system. Whereas our dynamic region-based victim partitioning approach dynamically partitions the victim cache based on the number of victim evictions from the smaller region of the hybrid cache.

Organization of Thesis

Conventional Charge Based Memory Technology

Static Random Access Memory (SRAM)
Dynamic Random Access Memory (DRAM)

As can be seen from the figure, the core of the SRAM cell contains four transistors T1 to T4. Read Operation: To access the state of the DRAM cell, voltage is applied to the access line AL.

Emerging Non-Volatile Memory Technology

Spin Transfer Torque Random Access Memory (STT-
Phase Change Random Access Memory (PCRAM) 27

The SET and RESET latency of the PCRAM is larger than that of STT-RAM and DRAM [20]. Write operation: To change the state of the cell, a large voltage is applied across the bit lines.

Figure 2.4: (a). Conceptual view of STT-RAM cell (b) Schematic STT view with (1) Write ‘1’ operation (2) Write ‘0’ and Read operation (c) Parallel low resistance, representing ‘0’ state (d) Anti-parallel high resistance, representing

Challenges to Employ Emerging NVMs in the Caches

Challenges related to Write Operation

Using the SRAM ways in the HCA is to entertain most of the write operations in a cache bank;. This increase in miss rate in the HCA is mainly due to two reasons: (1) Irregularly sized partitions of the cache. 2) Placing a larger number of write-intensive blocks in the limited-size SRAM region.

Challenges related to Weak Write Endurance

Lifetime: The lifetime of the cache can be defined either as raw lifetime or fault tolerant lifetime [29]. With respect to write counts, the lifetime is the inverse of the maximum write counts on the block of the cache [31].

Figure 2.9: Write counts inside the cache set for the different workloads in the baseline STT/ReRAM caches

Reducing the Costly Write Operations

Migration Policies
Prediction Policies
Bypass Policies
Partitioning Policies
Reconfiguration Policies

59] proposed a static and dynamic approach based on the compiler hints for the block placement in the different regions of the hybrid cache. The placement of the block is based on the access patterns and the type of write.

Improving the Lifetime and the Endurance of Non-Volatile Cache . 46

Inter-Set Wear Leveling Policies

29] in i2wap changes the mapping of the cache sets to a fixed number of writes, determined by the Swap Threshold. In their work, the mapping of the set is adjusted by rotating the data within the cache set.

Cell Level Wear Leveling

Miscellaneous Wear Leveling

Summary

In this part, our proposed policy uses private blocks and avoids storing the data part in the persistent LLC area. To keep entries without data in the STT area of the hybrid cache, we make changes to the regular MESI protocol.

Figure 3.1: General overview of contribution in chapter 3

Motivation and Background

Private Blocks

Reuse Distance

MESI Protocol

The block is in state E when it has not been modified yet has exclusive permission. State I of the block represents the block which is not memorized or is not present.

Proposed Hybrid Cache Architecture

Basic Organization

States added in the MESI Protocol

Design and Operation of the Reuse Distance Aware Write

Initialization Phase
Usage or Update phase

Prediction of the First Write back from L1 to L2
Augmenting the Replacement Policy
Set Sampling

ST-C: An L2 cache entry in the STT region of the cache (without any owner or participant(s)). When the block is replaced from the L1 cache (Last PUTS or PUTX) and the state of the block in the L2 cache is ST-S: The writeback.

Figure 3.6: Overview of hybrid cache architecture along with the proposed block predictor

Experimental Methodology

Simulator Setup

In addition, each L2 cache block entry consists of a 10-bit trigger instruction and a 2-bit counter that counts the number of accesses. The remote block group corresponds to blocks whose reuse distance is greater than 26.

Table 3.3: Timing and energy parameters of the L2 cache and RDAWIP

Workloads

We also calculate the energy consumption of the additional circuits (reported in table 3.3) using NVSIM. This is mainly due to two reasons: First, the fields of the cache metadata and reuse distance table are updated simultaneously with the cache access.

Results and Analysis

Write Accesses
Energy Consumption
Performance
Misses Per Kilo Instruction
Prediction Accuracy
Iso Area Analysis
Storage Overhead
Impact of Set Sampling

Write Accesses
Energy Consumption
Performance
Misses Per Kilo Instruction
Prediction Accuracy
Storage Overhead
Discussion

Analysis on Multi-threaded Workloads with Larger Cores

The marginal increase in energy in the STT region is due to the increase in the number of shots. Reduced write cost from dataless inputs and prediction curve 89, affects prediction accuracy due to volatile inputs stored in RDT.

Figure 3.11: Normalized LLC writes of RWHCA(R), WI (W), P and T for quad core (Lower is better)

Summary

The block lifetime was estimated by considering the predictor field in the replacement decision. In the third scheme: Dynamic Way Aware Write Restriction (DWAWR), heavily written cache modes are considered to restrict writes at runtime.

Proposed Wear Leveling Techniques

Static Window Write Restriction (SWWR)

Main idea
Algorithm
Working Example
Limitation of SWWR

After the application executes I cycles, one of the windows in the cache is treated as a write-restricted window for the next I interval (line 7), and periodically for each interval a new window is selected by rotation (lines 9 to 11). Write hit (PUTX or writeback) and requested block B in Wi: If B belongs to selected Wi window with invalid line(s) in other windows.

Figure 4.2: Working example of proposed SWWR wear leveling policy

Dynamic Window Write Restriction (DWWR)

Main Idea
Algorithm
Working Example
Limitation of DWWR

Write Hit (Write-back or PUTX) and block B in Wi: The write request for block B in Wi is forwarded to the first invalid way(s) of the same cache set in the other windows. In the first case, a read request from the ULC to LLC (arrow 1) is answered normally (arrow 2).

Figure 4.4: Working example of proposed DWWR wear leveling policy

Dynamic Way Aware Write Restriction (DWAWR)

Main Idea
Algorithm
Working Example

It contains the list of cached modes that are treated as write-limited in the current range (line 5). For example, the counters way Z0 to Z15 are connected to each way W0 to W15 of the cache.

Figure 4.6 depicts the working example of the proposed DWAWR approach. In the example, the way counters Z 0 to Z 15 are associated with each way W 0 to W 15 of the cache

Experimental Methodology

Simulator Setup

In addition to the write counter, a 64B swap buffer is used for the swap operation of the block within the cache set. In DWWR, the size of the counterCi associated with each window is set to 13 bit.

Table 4.4: Timing and energy parameters for STT-RAM/ReRAM LLC

Workloads

Results and Analysis

Two Level Cache Analysis: L2-STT-RAM

Coefficient of Intra-Set Write Variation
Coefficient of Inter-Set Write Variation
Relative Lifetime
Performance
Overheads

Three Level Cache Analysis: L2-SRAM, L3-ReRAM
I2WAP versus Write Restriction
FLASH based Adaptive Wear Leveling Technique (FAWLT)

Note that negative values in the table imply the increase in inter-set writing variation. The respective improvements in the coefficient of intra-set writing variation represent the effectiveness of the proposed approaches: SWWR, DWWR and DWAWR.

Figure 4.8: Intra-Set write variation for quad core (lower is better)

Comparative Analysis for Parameters

Change in Interval (I )
Change in Write Restricted Window/Ways (m/n)
Change in Capacity
Change in Associativity
Storage Overhead
Lifetime Comparison Analysis
Recommended Values

This in turn increases the residence of the block and increases the write variation within the set more than in the reference case. The recommended values for intra-set wear leveling using vertical partitions refer to m = n = 4 and I = 2M (for dual-core), 1M (for quad-core) cycles.

Table 4.14: Comparative analysis for different interval values (I) LFT.= life- life-time, BASE = baseline STT-RAM, WrRes = Write Restricted

Summary

On the other hand, in the case of cache with larger associativity, the value of I should be reduced to control the block residency and reduce the within-set variation. Intra-Set Wear Leveling Using Horizontal Partitions 131 Figure 5.1 gives the general overview of the proposed contribution in this chapter.

Figure 5.1: General overview of contribution in chapter 5

Proposed Wear Leveling Technique: MWWR

Architecture

Intra-Set Wear Leveling Using Horizontal Partitions 133 is needed to track the write accesses for each sub-set of the modules inside a cache bank in the current interval or in previous intervals until it is reset. In the figure, these en route counters are represented by variable Cij, where i represents the module ID and j indicates the subsurface ID.

Operation

In the first case, a read request (shown by arrow-1) to the metro 2 of set-4 is normally served by the cache (arrow-2). In the second case, the write request (shown by dotted arrow-3) from the L1 cache to set-4 and sub-path 7 is normally served (arrow-4) by the L2 cache.

Figure 5.3: Working example of the proposed MWWR wear leveling tech- tech-nique. (a) Status of module-2 at time-stamp t1 (b) Status of module-2 at

Experimental Methodology

Results and Analysis

Coefficient of Intra-Set Write Variation
Relative Lifetime Improvement
Effect on Performance
Effect on Energy
FLASH based Adaptive Wear Leveling Technique (FAWLT)
Effect on Invalidation
Storage Overhead

We observe 1% degradation in CPI with respect to STT of Polf due to increased allocations and evictions of MRU blocks in the cache. The reduction compared to Wsmooth is mainly due to the difference between the write redirection policy of MWWR (transfer to any position of cache set) and block transfer policy of Wsmooth (transfers from hot to one cold subway).

Figure 5.4: Intra-Set write variation (lower is better)

Comparative Analysis for Parameters

Change in Interval (I )
Change in Write-Restricted Sub-Ways (m)
Change in Number of Modules (M )
Change in Capacity
Change in Associativity (A)
Recommended Values

This, in turn, improves lifetime and reduces write variation within the batch more than the reference case. In both cases, with higher and lower associativity, MWWR significantly reduces within-group write variation and improves lifetime.

Summary

The previous chapter reported the wear-leveling techniques to mitigate writing variation in the inner set. To measure the effectiveness, the proposed strategies are evaluated with the baseline and the existing method in the quad-core system.

Introduction

Our second technique: Fellow Set with Dynamic Spare Parts (FSDRP) is based on FSSRP in terms of the peer set. At runtime, a different cache window is used as a spare at a given interval to distribute writes uniformly.

Figure 6.1: General overview of contribution in chapter 6

Proposed Wear Leveling Techniques

Fellow Sets with Static Reserve Part (FSSRP)

Architecture
Operation
Limitation

Similarly, the write bit included in each block in the NP portion of the cache is represented by bmn (line 3). Then the write counter of the set (Wi) in which the write is performed is incremented (lines 22 to 24).

Figure 6.2: Working example of FSSRP wear leveling policy

Fellow Sets with Dynamic Reserve Part (FSDRP)

Architecture
Operation

The write bit (bij) and the shift bit (rij) of the cache location in which the block is located are reset (lines 36 to 38). In the second case, a request to write (arrow 3) to the block (path-6) belonging to the NP section is redirected (arrow 4) to the window of the RP section (W in2) of the lightly written set (set-4). in our case) of the other group.

Figure 6.4: Working example of FSDRP wear leveling policy

Experimental Setup

In addition to these cycles, an additional cycle is required in FSDRP to search for the block in the appropriate window of the other sets in the cogroup. Note that we considered the fast access mode of the NVSIM while modeling TGS to ensure the parallel access.

Results and Analysis

Coefficient of Inter-Set Write Variation
Reduction in Coefficient of Intra-Set Write Variation
Lifetime Improvement
Invalidation/flushes
Energy Overhead
Performance
Storage and Area Overhead

The reduction in the coefficient of write variation between sets is due to a uniform write distribution across the different sets. These improvements by the proposed schemes are mainly due to the reduction in the coefficient of write variation between sets.

Figure 6.5: Inter-Set write variation of proposed schemes: FSSRP and FSDRP and, Swap Shift against baseline STT-RAM (lower is better)

Parameter Comparison Analysis

Change in Interval (I )

Change in Group Size (m)

Change in Window or RP Size (r)

Change in Capacity (C)

Change in Associativity (A)

Recommended Values

Summary

Background and Motivation

Victim Cache

Motivation

Integration of Victim Cache With Non-Volatile Cache

Architecture

Operation

Weighted Least Recently Used Replacement Policy (WLRU) 186

Integration of Victim Cache with Hybrid Cache

Architecture

Access based Victim Block Placement (AVBP)

Working Example

Experimental Methodology

Simulator Setup

Workloads

Results and Analysis

Write Accesses

CPI Improvement

Effect on NVM based Main Cache
Effect on HCA based Main Cache

Execution Time Improvement

Energy Consumption

Miss Rate Improvement

Iso Area Analysis for HCA based Main Cache

Storage and Area Overhead

Parameter Comparative Analysis

Change in LLC size

Change in Number of VC entries (V n )

Change in Bias

Impact on NVM based Main Cache
Impact on HCA based Main Cache

Change in Interval (I )

Conclusion

Scope for Future Work

GEM-5

M5

Restrictions in M5

GEMS

CMP Architecture Supported by GEMS

Result Analysis

GEM-5 Limitations

Timing and Power Modeling Tools: CACTI and NVSIM

Benchmarks

PARSEC

Benchmark Descriptions

SPEC CPU 2006

Benchmark Descriptions

Simulation Procedure

Multi-threaded vs Multi-programmed Workloads

Benchmarks Used in Our Simulations

Benchmark Running Process

Comparing Different CMP Architecture

Modern CMPs: design and floor-plan outlines
Characteristics required at each level of cache
Schematic view of SRAM cell
Representational view of DRAM cell
Representational view of PCRAM cell
Representational view of ReRAM cell
Hybrid cache bank organization
Write counts across the cache set for the baseline STT/ReRAM
Write counts inside the cache set for the different workloads in the
Classification of write reduction techniques based on the type of
Classification on wear leveling techniques
General overview of contribution in chapter 3
Percentage of private and shared blocks brought from the main
Percentage of blocks having exclusive permission which are clean or
Reuse count example
Collected reuse distance for different workloads
Overview of hybrid cache architecture along with the proposed block
Organization of Reuse Distance Aware Write Intensity Predictor
State Diagram showing the modified transactions with respect to
Organization of set sampler