210 7.12 Iso area analysis between the proposed HCA and basic SRAM and STT 211 7.13 Comparative analysis for different capacities NVM LLC (M). 215 7.19 Benchmarking for Different Intervals (I) for HCA Main Cache 215 A.1 Inherent Key Features of PARSEC Benchmarks .
Modern Chip Multi-Processors (CMPs)
Moore's Law [1] gives us more and more transistors on the chip with each generation of the process. Based on the nature of the instruction, application execution on CMPs is categorized into two categories: Memory Bound and Compute Bound.
Cache Memory
Set-associative cache is considered to be the best architecture compared to direct mapped and fully associative cache.
Cache Memory Architecture in CMPs
Power Consumption in CMPs
Static Power
Dynamic Power
Short-Circuit Power
Memory Technologies used for Caches
The reason for this is the expensive recording procedures and the weak recording durability of these technologies. More details on the methodology of the work and the preliminary concepts and characteristics of the various memory technologies are discussed in Chapter 2.
Motivations
In other words, some of the blocks within the case set are heavily written while some are lightly written. These write hotspots together with the poor write endurance affect the lifetime of the non-volatile cache as a whole.
Thesis Contributions
Private Block and Prediction based Block Placement ap-
Intra-Set Write Variation mitigation using Write Restriction 13
This work reduces uneven writing in the region with negligible performance degradation and outperforms all previous works. The strategies reduce write non-uniformity between cache sets to an extent with negligible performance degradation.
Victim Caching to Improve the Performance of NVM Caches 15
The result of the proposed scheme is compared with the existing hybrid cache architecture: RWHCA [28] and the basic SRAM and STT-RAM based caches for the quad-core system. Whereas our dynamic region-based victim partitioning approach dynamically partitions the victim cache based on the number of victim evictions from the smaller region of the hybrid cache.
Organization of Thesis
Conventional Charge Based Memory Technology
- Static Random Access Memory (SRAM)
- Dynamic Random Access Memory (DRAM)
As can be seen from the figure, the core of the SRAM cell contains four transistors T1 to T4. Read Operation: To access the state of the DRAM cell, voltage is applied to the access line AL.
Emerging Non-Volatile Memory Technology
- Spin Transfer Torque Random Access Memory (STT-
- Phase Change Random Access Memory (PCRAM) 27
The SET and RESET latency of the PCRAM is larger than that of STT-RAM and DRAM [20]. Write operation: To change the state of the cell, a large voltage is applied across the bit lines.
Challenges to Employ Emerging NVMs in the Caches
Challenges related to Write Operation
Using the SRAM ways in the HCA is to entertain most of the write operations in a cache bank;. This increase in miss rate in the HCA is mainly due to two reasons: (1) Irregularly sized partitions of the cache. 2) Placing a larger number of write-intensive blocks in the limited-size SRAM region.
Challenges related to Weak Write Endurance
Lifetime: The lifetime of the cache can be defined either as raw lifetime or fault tolerant lifetime [29]. With respect to write counts, the lifetime is the inverse of the maximum write counts on the block of the cache [31].
Reducing the Costly Write Operations
- Migration Policies
- Prediction Policies
- Bypass Policies
- Partitioning Policies
- Reconfiguration Policies
59] proposed a static and dynamic approach based on the compiler hints for the block placement in the different regions of the hybrid cache. The placement of the block is based on the access patterns and the type of write.
Improving the Lifetime and the Endurance of Non-Volatile Cache . 46
Inter-Set Wear Leveling Policies
29] in i2wap changes the mapping of the cache sets to a fixed number of writes, determined by the Swap Threshold. In their work, the mapping of the set is adjusted by rotating the data within the cache set.
Cell Level Wear Leveling
Miscellaneous Wear Leveling
Summary
In this part, our proposed policy uses private blocks and avoids storing the data part in the persistent LLC area. To keep entries without data in the STT area of the hybrid cache, we make changes to the regular MESI protocol.
Motivation and Background
Private Blocks
Reuse Distance
MESI Protocol
The block is in state E when it has not been modified yet has exclusive permission. State I of the block represents the block which is not memorized or is not present.
Proposed Hybrid Cache Architecture
- Basic Organization
- States added in the MESI Protocol
- Design and Operation of the Reuse Distance Aware Write
- Initialization Phase
- Usage or Update phase
- Prediction of the First Write back from L1 to L2
- Augmenting the Replacement Policy
- Set Sampling
ST-C: An L2 cache entry in the STT region of the cache (without any owner or participant(s)). When the block is replaced from the L1 cache (Last PUTS or PUTX) and the state of the block in the L2 cache is ST-S: The writeback.
Experimental Methodology
Simulator Setup
In addition, each L2 cache block entry consists of a 10-bit trigger instruction and a 2-bit counter that counts the number of accesses. The remote block group corresponds to blocks whose reuse distance is greater than 26.
Workloads
We also calculate the energy consumption of the additional circuits (reported in table 3.3) using NVSIM. This is mainly due to two reasons: First, the fields of the cache metadata and reuse distance table are updated simultaneously with the cache access.
Results and Analysis
- Write Accesses
- Energy Consumption
- Performance
- Misses Per Kilo Instruction
- Prediction Accuracy
- Iso Area Analysis
- Storage Overhead
- Impact of Set Sampling
- Write Accesses
- Energy Consumption
- Performance
- Misses Per Kilo Instruction
- Prediction Accuracy
- Storage Overhead
- Discussion
- Analysis on Multi-threaded Workloads with Larger Cores
The marginal increase in energy in the STT region is due to the increase in the number of shots. Reduced write cost from dataless inputs and prediction curve 89, affects prediction accuracy due to volatile inputs stored in RDT.
Summary
The block lifetime was estimated by considering the predictor field in the replacement decision. In the third scheme: Dynamic Way Aware Write Restriction (DWAWR), heavily written cache modes are considered to restrict writes at runtime.
Proposed Wear Leveling Techniques
Static Window Write Restriction (SWWR)
- Main idea
- Algorithm
- Working Example
- Limitation of SWWR
After the application executes I cycles, one of the windows in the cache is treated as a write-restricted window for the next I interval (line 7), and periodically for each interval a new window is selected by rotation (lines 9 to 11). Write hit (PUTX or writeback) and requested block B in Wi: If B belongs to selected Wi window with invalid line(s) in other windows.
Dynamic Window Write Restriction (DWWR)
- Main Idea
- Algorithm
- Working Example
- Limitation of DWWR
Write Hit (Write-back or PUTX) and block B in Wi: The write request for block B in Wi is forwarded to the first invalid way(s) of the same cache set in the other windows. In the first case, a read request from the ULC to LLC (arrow 1) is answered normally (arrow 2).
Dynamic Way Aware Write Restriction (DWAWR)
- Main Idea
- Algorithm
- Working Example
It contains the list of cached modes that are treated as write-limited in the current range (line 5). For example, the counters way Z0 to Z15 are connected to each way W0 to W15 of the cache.
Experimental Methodology
Simulator Setup
In addition to the write counter, a 64B swap buffer is used for the swap operation of the block within the cache set. In DWWR, the size of the counterCi associated with each window is set to 13 bit.
Workloads
Results and Analysis
- Two Level Cache Analysis: L2-STT-RAM
- Coefficient of Intra-Set Write Variation
- Coefficient of Inter-Set Write Variation
- Relative Lifetime
- Performance
- Overheads
- Three Level Cache Analysis: L2-SRAM, L3-ReRAM
- I2WAP versus Write Restriction
- FLASH based Adaptive Wear Leveling Technique (FAWLT)
Note that negative values in the table imply the increase in inter-set writing variation. The respective improvements in the coefficient of intra-set writing variation represent the effectiveness of the proposed approaches: SWWR, DWWR and DWAWR.
Comparative Analysis for Parameters
- Change in Interval (I )
- Change in Write Restricted Window/Ways (m/n)
- Change in Capacity
- Change in Associativity
- Storage Overhead
- Lifetime Comparison Analysis
- Recommended Values
This in turn increases the residence of the block and increases the write variation within the set more than in the reference case. The recommended values for intra-set wear leveling using vertical partitions refer to m = n = 4 and I = 2M (for dual-core), 1M (for quad-core) cycles.
Summary
On the other hand, in the case of cache with larger associativity, the value of I should be reduced to control the block residency and reduce the within-set variation. Intra-Set Wear Leveling Using Horizontal Partitions 131 Figure 5.1 gives the general overview of the proposed contribution in this chapter.
Proposed Wear Leveling Technique: MWWR
Architecture
Intra-Set Wear Leveling Using Horizontal Partitions 133 is needed to track the write accesses for each sub-set of the modules inside a cache bank in the current interval or in previous intervals until it is reset. In the figure, these en route counters are represented by variable Cij, where i represents the module ID and j indicates the subsurface ID.
Operation
In the first case, a read request (shown by arrow-1) to the metro 2 of set-4 is normally served by the cache (arrow-2). In the second case, the write request (shown by dotted arrow-3) from the L1 cache to set-4 and sub-path 7 is normally served (arrow-4) by the L2 cache.
Experimental Methodology
Results and Analysis
- Coefficient of Intra-Set Write Variation
- Relative Lifetime Improvement
- Effect on Performance
- Effect on Energy
- FLASH based Adaptive Wear Leveling Technique (FAWLT)
- Effect on Invalidation
- Storage Overhead
We observe 1% degradation in CPI with respect to STT of Polf due to increased allocations and evictions of MRU blocks in the cache. The reduction compared to Wsmooth is mainly due to the difference between the write redirection policy of MWWR (transfer to any position of cache set) and block transfer policy of Wsmooth (transfers from hot to one cold subway).
Comparative Analysis for Parameters
- Change in Interval (I )
- Change in Write-Restricted Sub-Ways (m)
- Change in Number of Modules (M )
- Change in Capacity
- Change in Associativity (A)
- Recommended Values
This, in turn, improves lifetime and reduces write variation within the batch more than the reference case. In both cases, with higher and lower associativity, MWWR significantly reduces within-group write variation and improves lifetime.
Summary
The previous chapter reported the wear-leveling techniques to mitigate writing variation in the inner set. To measure the effectiveness, the proposed strategies are evaluated with the baseline and the existing method in the quad-core system.
Introduction
Our second technique: Fellow Set with Dynamic Spare Parts (FSDRP) is based on FSSRP in terms of the peer set. At runtime, a different cache window is used as a spare at a given interval to distribute writes uniformly.
Proposed Wear Leveling Techniques
Fellow Sets with Static Reserve Part (FSSRP)
- Architecture
- Operation
- Limitation
Similarly, the write bit included in each block in the NP portion of the cache is represented by bmn (line 3). Then the write counter of the set (Wi) in which the write is performed is incremented (lines 22 to 24).
Fellow Sets with Dynamic Reserve Part (FSDRP)
- Architecture
- Operation
The write bit (bij) and the shift bit (rij) of the cache location in which the block is located are reset (lines 36 to 38). In the second case, a request to write (arrow 3) to the block (path-6) belonging to the NP section is redirected (arrow 4) to the window of the RP section (W in2) of the lightly written set (set-4). in our case) of the other group.
Experimental Setup
In addition to these cycles, an additional cycle is required in FSDRP to search for the block in the appropriate window of the other sets in the cogroup. Note that we considered the fast access mode of the NVSIM while modeling TGS to ensure the parallel access.
Results and Analysis
- Coefficient of Inter-Set Write Variation
- Reduction in Coefficient of Intra-Set Write Variation
- Lifetime Improvement
- Invalidation/flushes
- Energy Overhead
- Performance
- Storage and Area Overhead
The reduction in the coefficient of write variation between sets is due to a uniform write distribution across the different sets. These improvements by the proposed schemes are mainly due to the reduction in the coefficient of write variation between sets.
Parameter Comparison Analysis
Change in Interval (I )
Change in Group Size (m)
Change in Window or RP Size (r)
Change in Capacity (C)
Change in Associativity (A)
Recommended Values
Summary
Background and Motivation
Victim Cache
Motivation
Integration of Victim Cache With Non-Volatile Cache
Architecture
Operation
Weighted Least Recently Used Replacement Policy (WLRU) 186
Integration of Victim Cache with Hybrid Cache
Architecture
Access based Victim Block Placement (AVBP)
Working Example
Experimental Methodology
Simulator Setup
Workloads
Results and Analysis
Write Accesses
CPI Improvement
- Effect on NVM based Main Cache
- Effect on HCA based Main Cache
Execution Time Improvement
- Effect on NVM based Main Cache
- Effect on HCA based Main Cache
Energy Consumption
- Effect on NVM based Main Cache
- Effect on HCA based Main Cache
Miss Rate Improvement
- Effect on NVM based Main Cache
- Effect on HCA based Main Cache
Iso Area Analysis for HCA based Main Cache
Storage and Area Overhead
Parameter Comparative Analysis
Change in LLC size
Change in Number of VC entries (V n )
Change in Bias
- Impact on NVM based Main Cache
- Impact on HCA based Main Cache
Change in Interval (I )
Conclusion
Scope for Future Work
GEM-5
M5
Restrictions in M5
GEMS
CMP Architecture Supported by GEMS
Result Analysis
GEM-5 Limitations
Timing and Power Modeling Tools: CACTI and NVSIM
Benchmarks
PARSEC
- Benchmark Descriptions
SPEC CPU 2006
- Benchmark Descriptions
Simulation Procedure
Multi-threaded vs Multi-programmed Workloads
Benchmarks Used in Our Simulations
Benchmark Running Process
Comparing Different CMP Architecture
- Modern CMPs: design and floor-plan outlines
- Characteristics required at each level of cache
- Schematic view of SRAM cell
- Representational view of DRAM cell
- Representational view of PCRAM cell
- Representational view of ReRAM cell
- Hybrid cache bank organization
- Write counts across the cache set for the baseline STT/ReRAM
- Write counts inside the cache set for the different workloads in the
- Classification of write reduction techniques based on the type of
- Classification on wear leveling techniques
- General overview of contribution in chapter 3
- Percentage of private and shared blocks brought from the main
- Percentage of blocks having exclusive permission which are clean or
- Reuse count example
- Collected reuse distance for different workloads
- Overview of hybrid cache architecture along with the proposed block
- Organization of Reuse Distance Aware Write Intensity Predictor
- State Diagram showing the modified transactions with respect to
- Organization of set sampler