Chapter 2
Background
As mentioned in Chapter 1, the LLCs in modern CMPs are usually shared and the largest in size. Apart from its remarkable leakage consumption, the performance of a CMP largely depends upon the performance of its LLC [11, 36]. The main aim of this thesis is to design cache based energy/thermal efficient techniques by con- sidering performance as a system wide constraint. Hence, before discussing about state-of-the-art energy/thermal efficient techniques developed over the decades, we initially discuss some preliminary concepts/results regarding shared LLCs, rel- evant to our work.
Chapter 2. Background 26
Figure 2.1: Variable bank usages across different benchmarks in a TCMP as shown in Figure 1.2(a).
2.1.1 Bank Level Granularity
The runtime cache accesses across the banks are unevenly distributed in a multi- banked SNUCA based cache architecture. For four PARSEC [6] applications, we derived the bank accesses shown in Figure 2.1 for a TCMP architecture having 16 L2 banks. We have also plotted the same for three PARSEC applications in Figure 2.2 for a CCMP having 64 L2 banks. These figures show noticeable diversities in cache accesses across the applications at a certain time-stamp during execution.
(a) black16 (b) body16
(c) fluid16
Figure 2.2: Non uniform distribution of cache bank accesses in a CCMP like Figure 1.2(b).
According to Figures 2.1 and 2.2, each application has different cache space re- quirement and the cache banks that receive frequent data accesses are also not fixed across the applications. Therefore, cache bank usages can only be known at run-time as to which bank(s) is(are) the mostly accessed while which are utilised
Chapter 2. Background 27 the least. As all the cache banks are neither fully nor evenly utilised, it would be helpful to shutdown selective cache banks, specifically, the least used ones to save cache power consumption. Note that in the above discussion, the bank usages are calculated by using the formula:
L2BankU sage(i, j) = totalAccesses(i, j)
totalL1M isses(j) ×100% (2.1) where,iimplies L2 bank-id andj is the remap-period. So, the termL2BankU sage(i, j) represents bank usage percentage for bank-idi in jth interval.
The non-uniform cache access patterns discussed above are based on SNUCA architecture, where mapping of a particular cache block is always fixed to a par- ticular bank. The accessing and allocation strategies are changed in the case of a DNUCA architecture, where a cache block is moved dynamically towards the core requesting it repeatedly. Tiled DNUCA [18] with its different variations are proposed recently to mitigate the performance bottleneck of SNUCA based LLCs.
Although, DNUCA helps in performance enhancement, but, their implementation is critical in real hardware. Additionally, significant data movement across the cache banks can increase the dynamic power consumption of the on-chip circuitry.
However, detailed discussions on this topic remain out of scope of this thesis.
2.1.2 Set Level Granularity
Similar to the access patterns across cache banks, it has been observed that the accesses within a cache bank are also unevenly distributed among the sets [34, 37].
This uneven distribution increases conflict misses in the heavily accessed sets.
Figure 2.3 shows an example of how the conflict misses of each set varies in a bank in the case of our TCMP architecture. As seen in the figure, some sets are used heavily while some other sets remain underused. Such non-uniform load distribution reduces the total utilisation of the cache by increasing conflict and capacity misses at the higher utilised cache sets.
Chapter 2. Background 28
Figure 2.3: Set usage profile of a bank in a TCMP. In case of CCMP too, this non-uniform pattern exists.
2.1.3 Dynamic Associativity Management (DAM)
DAM is a technique that can be used to handle the non-uniform load distributions among the cache sets [37, 34]. In DAM based techniques, the associativity of each set can be managed dynamically without changing the actual size of the cache.
The heavily used set can use the idle ways of underused sets, thereby effectively increasing its associativity. This will in-turn reduce the conflict as well as capacity misses inside the cache sets. Several DAM based approaches already exist in literature [37, 34, 38]. CMP-SVR [34] is one such DAM based approach that has less hardware overhead compared to other similar techniques. It improves the performance of the cache by reducing the number of conflict misses in the cache.
The non-uniform distribution of memory accesses is a major cause for higher con- flict misses in LLC. Associativity of the conventional set-associative caches cannot be adjusted dynamically because of the static one-to-one mapping between the tag entries and data lines. Note that, all of these techniques can be implemented in multi-banked SNUCA caches. In this thesis, we use CMP-SVR [34] to improve the performance of some heavily loaded cache banks, during and after cache resizing.
2.1.3.1 CMP-SVR
Here, the ways of each set are divided into two parts: Normal sTorage (NT) and Reserve sTorage (RT). NT behaves same as conventional cache and RT takes 25%
to 50% ways from each set. The sets are divided into multiple groups; each group is called as fellow-group. The sets within a fellow-group are called fellow-sets. A
Chapter 2. Background 29 heavily used set can use the idle reserve ways of its fellow-sets. In this way a heavily used set can distribute its load to some other underused sets. Such way sharing improves the utilisation of the cache and hence improves performance.
Each fellow-group has its own RT section which is created by the reserve ways of all the fellow-sets within the fellow-group. A fellow-set can use any RT locations of its fellow-group but cannot use any location outside its fellow-group. Searching the entire RT, reserved for a particular fellow-group, directly from the cache is an expensive operation as it requires us to search all the sets within the fellow-group.
To overcome this problem an additional tag-array, SA-TGS is used for RT. In SA-TGS, each RT location (L) has a corresponding entry, which contains the tag address of the block currently residing in L. An example of CMP-SVR is given in Figure 2.4.
Figure 2.4: An example of CMP-SVR
CMP-SVR does not search RT directly, instead it searches the SA-TGS which contains a dedicated location for each corresponding RT location. If a block is not available in its dedicated cache set (i.e. in NT) then the block may be available in RT. In CMP-SVR, tag matching for a block in NT and SA-TGS (not in RT) is done simultaneously. If the tag is found in NT then it is a direct hit and if the tag matches in SA-TGS then it is an indirect hit. In the case of an indirect hit, the block is moved from RT to NT. Moving a block from RT to NT is easy, provided NT has free (invalid) way, otherwise it swaps with the LRU block of NT.
During replacement, instead of removing a victim (LRU) block completely from the cache, it moves to RT. Moving a victim block (V) to RT is easy if RT has a free
Chapter 2. Background 30 (invalid) location; otherwise it has to replace the LRU block of RT. However, an experimental evaluation with a full system simulation shows CMP-SVR improves overall system performance by 8%, while reducing miss rate by 28%.
2.1.4 Violation in Locality of Reference
Although cache accesses exploit Locality of Reference, but, in the case of long running applications, this property violates. The change in access patterns over long time-span for a set of PARSEC applications are shown in Figure 2.5 for a TCMP architecture. For each application, 4 banks out of 16 have been randomly selected to maintain clarity in the figures. Here, X-axis represents uniform time- intervals. Figure 2.6 depicts the changes in cache access patterns in the case of a CCMP. The access patterns are shown along y-axis for the cache banks over different epochs for black16, body16 and fluid16, respectively. In this case, we have taken 5 banks (along the x-axis) from each of the benchmarks, to show heavily used, moderately used and least used banks in each epoch. Epochs imply a time-span of 10 million cycles, just after the warm up (Epoch-1), at the middle of execution (Epoch-2), and at the end (Epoch-3). The figures show that, for all the applications, accesses to a bank change over the three different epochs. A heavily used bank can become a lightly used one later, or vice versa.
Finally, from the above discussions, following observations can be listed:
1. Diversity exists in cache access behaviour across the applications with sig- nificant changes during execution.
2. Locality of Reference with respect to bank id may not be exploited over a long-run.
3. Access behaviour for the banks are diverse in nature and also change during execution.
Chapter 2. Background 31
(a) freq1 (b) swap1
(c) body1 (d) ferret1
(e) vips1 (f) body16
(g) swap4 (h) ferret4
(i) freq16 (j) black4
(k) fluid16
Figure 2.5: Temporal Change in Bank Usages for different applications in a TCMP having 16 banks.
Chapter 2. Background 32
(a) black16 (b) body16 (c) fluid16
Figure 2.6: Change in cache bank access behaviour over time in a CCMP having 64 banks.
From a performance perspective, these observations indicate cache bank turn-off during less utility and turn-on when in greater demand, while saving cache leakage consumption.