2.3 Reducing Cache Leakage Consumption
2.3.1 Off-Line Techniques
Off-line cache tuning is also termed as static cache tuning, where designers usually evaluate the applications’ structures and system configuration before execution, in order to determine the cache configuration. This configuration remains unchanged during the process execution. Mostly, cache sizes and other configuration parame- ters are determined off-line through analytical formulations [43, 44, 45, 46, 47, 48,
Chapter 2. Background 38 49, 50, 51, 52] or by evaluating the compilation process [53, 54, 55]. The analyti- cal formulation for cache size determination uses mathematical models which can spontaneously compute the cache misses [4]. This cache size prediction at design time reduces the run-time cache tuning overhead. Hence, in this section we will discuss about some off-line cache size prediction and tuning techniques.
2.3.1.1 Application Structure Based
The spatial and temporal behaviours of the applications determine its run-time cache access patterns, which are actually profiled from the application’s branch &
control instructions’ characteristics. In their work, Ghosh et. al. [47] generated cache miss equations to summarise the cache access behaviour of the loop and its variables. However, the direct computation of cache misses from these equations is an NP-Complete problem. Later in 2004, Vera et. al. proposed an approximation algorithm to estimate the cache misses during execution [44]. But both of these methods have limited applicability in case of applications having perfectly nested loops in their source code. In another work [45], Presburger formulas are used for further extension of the model [47] to include some non-linear array structures.
Based upon this, Kim et. al., in their work [56] proposed an estimation of energy efficient memory model (as an extension of [45]) for some video encoding appli- cations. This model shows 70−80% accuracy with respect to simulation results.
At this same time-frame, Harper et. al. proposed an approximation model that estimates cache misses for any loop structure [43], whereas a near optimal cache size determination was done by an another algorithm proposed later [46] through loop statistics extraction. But, this model has very limited application in modern multi-tasking environment, where thread/task interleaving and unpredicted loop structures are very common. In fact, the models proposed in [47, 44, 45, 4] are complex enough to implement in practical environment, hence, these methods are not widely used now-a-days [57].
Chapter 2. Background 39 2.3.1.2 Memory Access Trace driven
Most of the analytical formulations that estimate/determine (sub-)optimal cache size are almost similar like the earlier ones stated in section 2.3.1.1. However, these methods are usually formulated based upon the memory reuse distance between two successive accesses at the same memory location. Ding and Zhong predicted the whole program behaviour with 94% accuracy by analysing the data reusabil- ity [58]. Their algorithm analyses the LRU stack distance, pattern recognition and distance based sampling to predict the memory footprint of the process, however, the idea of reuse distance was earlier incepted by Mattson et. al. in 1970 [51].
Later, for the CMP architecture, a different reuse distance prediction model was proposed in [59], which uses the accesses related to the threads while consider- ing the inter-thread contention at the shared cache space. This model although predicts the cache misses for individual threads but, does not consider the cache contention. Later in 2009, a fine grained cache contention was predicted [49] for a multi-threaded environment with a prediction accuracy of 92%. This model also considers system-wide CPI for predicting cache accesses in absence of cache contention. In another study [48], authors have predicted the cache miss rates in a CMP with a high prediction accuracy of 98%, by exploiting the hardware performance counters while running an application on real processor cores. A low- cost profiling technique was further developed to analyse the full execution with a guaranteed precision [52]. All of these developed techniques consider the cache contention models but data sharing among the threads was given lesser impor- tance. In a latter work by Ding and Chilimbi [60], thread interleaving and cache contention are studied together while predicting cache usages with an accuracy of more than 90%. Shi et. al. further studied the data replication across the private and shared caches [61] during prediction of cache access behaviour, having less than 9% error margin using only about 8% of simulation time. But all of these discussed methods are suffering in case of some modern applications where cache access behaviour changes as the execution proceeds.
Chapter 2. Background 40 2.3.1.3 Compiler Profiling/OS level approaches
Apart from analytical prediction of cache behaviour for energy reduction, compiler can also assist to reduce leakage consumption by anticipating the cache require- ment through code analyses. Zhang et. al. in 2002 designed a compiler based leakage reduction technique for instruction caches [54], that reduces leakage more than 40% on an average. The whole technique is divided into two parts: (a) Con- servative approach, that detects dead instructions and puts those cache lines in a low leakage mode; and (b) Optimistic approach, that puts the cache lines into low leakage mode when the next instruction is detected as dead. The cache leakage power consumption is reduced by putting the lines into both state preserving and state destroying modes whichever is necessary. In another work [55], code has been restructured at compile time for array based and pointer-intensive applications for lowering data cache energy, which is beneficial mostly for the heavy stand alone applications. This policy saves 59% leakage with a performance penalty close to 4%.
Reddy and Petrov designed a profile based off-line algorithm [62] that identifies a beneficial cache partitioning. During context switching OS selects the proper configuration that has to be kept unchanged upto the next context switching. The cache parts are activated or deactivated according to the selected configuration.
A 40−80% of leakage saving has been reported in this article. Wang and Mishra, also accumulate the execution profile from the prior executions and heuristics are generated accordingly [63]. These heuristics are used at run-time with proper selection of active cache portions. This method is proposed for a multi-tasking soft real-time environment, with having a significant leakage reduction of 32− 49%, while maintaining performance. In 2011, Wang et. al. proposed a better algorithm [53] which judiciously finds a cache configuration by analysing static profiling of the applications those executed earlier. This policy works in both private as well as shared caches, and saves leakage upto 30%. As leakage power has a quadratic relationship with temperature, hence, Noori et. al. developed a Temperature Aware Configurable Cache (TACC) [64] that shrinks cache during
Chapter 2. Background 41 high temperature and expands the same in low temperature with a slight penalty in performance. The reduced cache temperature noticeably reduces the cache leakage, by more than 60% for higher level caches at its best.
Most of the off-line techniques mentioned above reduce leakage consumption of the caches, however they have some limitations [4]:
• Most of the time, off-line techniques need prior information about the appli- cations that have to be executed.
• Analytical methods are complex and may not be a good predictor for some modern applications where cache accesses may change for its every occasion of executions.
• Compiler based methods need detailed analysis of the source code which may slow down the compilation process.
• Additionally, violation in Locality of Reference for modern long running applications demands on-line algorithm for better anticipation of the future cache footprints to save leakage accordingly.
Hence, architects move to the on-line solution for better tuning of caches during execution.