Design and Operation of the Reuse Distance Aware Write

3.3 Proposed Hybrid Cache Architecture

3.3.2 Design and Operation of the Reuse Distance Aware Write

Chapter 3. Reducing Write Cost by Dataless Entries and Prediction 67

• When a block receives the write-back response (PUTX, WB DATA) and the state of the block in the L2 cache is ST-D: In this case, the migration process of the block from STT to SRAM is executed in two steps.

First, the lookup operation for the invalid entry is performed in the SRAM region. If there is no invalid entry, the LRU victim is selected from the same cache set in the SRAM region and the write-back operation is scheduled to the next level of memory for the victim. Second, the write-back response will be redirected to the available entry in the SRAM region by transferring the tag from STT-RAM to SRAM region. Afterward, the block will be invalidated from STT-RAM. Once the write-back operation is performed, the state of the block in the SRAM region is changed to state SR-C. The state SR-C can either act as an S or I (in case of WB DATA) or M (in case of PUTX) state of MESI protocol. For simplicity purpose, we denoted this by state SR-C in the state diagram.

The benefits of these additional states (shown in fig. 3.7) is to represent the dataless entries (P) in the STT region and to show the states of MESI protocol in an abstract manner (I, ST-S, ST-C, ST-D, and SR-C) according to different regions of HCA. The overheads with the state diagram is to incorporate the dataless and intermediate state in the directory structure (that will add few bits and 5 cycles to search in the directory and the extra changes required to maintain the dataless entries) of the protocol. We have considered all these overheads in the simulations.

3.3.2 Design and Operation of the Reuse Distance Aware

Chapter 3. Reducing Write Cost by Dataless Entries and Prediction 68

Cache MetaData (CMD)

...

Train Read

Usage Read Overflow Write

Usage Write Overflow Reuse

counter RDT Pointer Reuse

Distance

Hashed

Address Read

Usage Read Overflow Write

Usage Write Overflow Pointer Reuse

Category Confid Counter Reuse Distance Table (RDT)

..

Figure 3.8: Organization of Reuse Distance Aware Write Intensity Predictor

incur one write to the STT region. One can avoid these writes if we can predict those blocks which eventually migrate to SRAM. To incorporate this mechanism, we propose a predictor based upon reuse distance and write intensity of blocks.

Depending on the reuse category: viz. short, medium or large the decision to redirect the write to SRAM region is taken during the first write-back for such blocks. In other words, when a dataless block receives the first write-back from L1 cache, the predictor directs to redirect it to SRAM if it is predicted to be heavily written block. Otherwise, the block is written to STT region and later migrates to SRAM if it incurs more writes. The performance of this policy depends on the prediction accuracy, which is analyzed in the experimental section 3.5.5.

The Reuse Distance Aware Write Intensity Predictor (RDAWIP) mechanism uses two additional data structures: the Reuse Distance Table (RDT) and the Cache MetaData (CMD). The composition of these two data structures is shown in Fig. 3.8. Note that the two structures used in the predictor are made up of SRAM.

The RDT is a table with a limited number of entries indexed using H3 hash function [114] on a subset of bits from the program counter (PC) and the byte offset from the memory reference address. The reason behind using the PC-offset combination is their high coverage and accuracy for different workloads [115, 116, 117].

Each entry in the RDT stores the reuse category and confidence counter apart from read/write counters like the Read Usage (RU), Read Overflow (RO) and the Write Usage (WU) and Write Overflow (WO). In particular, each entry of RDT

Chapter 3. Reducing Write Cost by Dataless Entries and Prediction 69 stores the read and write behavior of the hashed address according to the reuse distance.

The other data structure to be used in our technique is Cache MetaData (CMD).

The CMD is a table that stores the metadata information of each block in the cache. The entry of CMD contains many fields such as: read/write counters (Read Usage, Write Usage, Read Overflow and Write Overflow), train bit, RDT pointer and Reuse information (Reuse counter and Reuse Distance). The description and the use of these fields are mentioned below:

• The use of read/write counter is to capture the read/write behavior of the associated cache block during the execution.

• The reuse information fields are used to store the reuse distance of the block.

• The RDT pointer in the CMD is used to link the RDT entry with the associated cache block.

• The train field is used to identify the cache entry involved in initialization/update the corresponding RDT entry.

The use of CMD entry is to populate, update, and verify the data stored in the RDT as per the access during the live time of the block. When the block is evicted, the entries in the CMD are initialized to zero or reset.

3.3.2.1 Initialization Phase

The initialization phase of the RDT is started when the entry mapped by H3 hash function with PC-offset combination is not found in RDT. In this case, a new entry is created in the RDT, and the CMD of the block is mapped to this newly created entry. The address whose PC maps to an entry in the RDT is used to update the values of the counters. In other words, that address/block is responsible for initializing the RDT for prediction.

Chapter 3. Reducing Write Cost by Dataless Entries and Prediction 70 Whenever the block that is linked to the RDT entry incurs a read/write request the corresponding counters in the RDT entry are updated. In case the count reaches saturation, its corresponding overflow bit is set. Note that the reuse information of the block is maintained in its CMD. The initialization phase of the RDT stops when the block linked to an entry in the RDT is evicted from the cache. At this time, the reuse information of the CMD is used to fill the reuse category in the RDT.

3.3.2.2 Usage or Update phase

During execution, several hashed-PC addresses will map to a given RDT entry.

However, only one of them will be used to update the RDT. Note that if RDT block is under initialization phase, then other blocks that are mapped to the same entry will not be involved in the initialization phase. Once the initialization phase is over, the new block(s) mapping to the corresponding RDT entry will copy the data from RDT in the CMD. With each read and write request to the mapped block, the respective counters in the CMD are decremented accordingly. In the case, when the counter of the CMD reaches the saturation, and their associated read/write overflow bit is set then this block is assumed to be read/write-intensive and hence no more changes are required in RDT.

However, there is a possibility of updating the RDT when the read/write overflow is not set. In this case, when the count of read/write usage counter of the CMD exceeds the predicted value (becomes zero), we start the update phase. For this, we set the train bit and start updating the read/write counts (incremented or set the read/write overflow in case the count reaches saturation) of the RDT.

When the corresponding block is evicted, the RDT has the new information of the read/write count as per the behavior from the respective evicted entry. The removed CMD entry also maintains the reuse distance information that helps to verify the reuse category of the RDT. If the reuse category matches with the reuse information, the confidence counter is incremented. Otherwise, decremented. In

Chapter 3. Reducing Write Cost by Dataless Entries and Prediction 71 case the confidence counter becomes zero, we again categorize the RDT reuse category as per the reuse information stored in the CMD.

Dalam dokumen LiNoVo: Longevity Enhancement of Non-Volatile Caches by (Halaman 97-101)