by double or triple [8]. V-Way [8] used this concept and allowed to increase associativity up to twice the normal associativity (TDR1=2). Though V-Way consumes less energy than CMP-VR, its storage overhead is more than CMP-VR (cf. Table 5.8). This motivated us to design new DAM based technique, called CMP-SVR, with minimum energy overheads. The performance of CMP-SVR is almost same as that of CMP-VR.
CMP-SVR significantly reduces the hardware overhead (especially energy) of CMP- VR but it cannot improve the performance more than CMP-VR. Hence our next goal is to enhance the performance of CMP-SVR. For this we propose another DAM based technique called FS-DAM, which provides better performance than both CMP-VR and CMP-SVR with almost same hardware overheads as in CMP- SVR.
Throughout this chapter the heavily used sets are termed as H-Sets and lightly used sets are termed as L-Sets. The procedure of classifying the sets either as L-set orH-Set is discussed later in this chapter.
The rest of the chapter is organized as follows. Next section gives the detail de- scription about CMP-SVR. The experimental analysis of CMP-SVR is given in Section 5.3. Limitations of CMP-SVR and the possible improvements are dis- cussed in Section 5.4. FS-DAM is discussed in Section 5.5. Section 5.6 gives the experimental analysis of FS-DAM. The summary of the chapter is given in Section5.7.
In worst case a set can only use all the reserve ways of its fellow-sets. Since H-Sets require limited number of additional ways, such restriction of CMP-SVR does not degrade the performance as compared to CMP-VR. The CMP-VR can be termed as a CMP-SVR having only one fellow-group. Set balancing cache [21] proposed a similar technique but it allows only two sets to share ways together. CMP-SVR can be configured for different fellow-group sizes.
The additional tag-array, called SA-TGS, in CMP-SVR is set associative instead of fully associative. Each entry in SA-TGS has one-to-one mapping with a corre- sponding location in RT (same as in CMP-VR) and hence no forward/backward pointers are required as were needed in [8] [10]. Since both the cache and SA-TGS are set-associative the rest of the chapter uses the term cacheSet to represent the sets of cache and tgsSet to represent the sets of SA-TGS. Similarly, cacheWay and tgsWay are used. Also the terms set and way represent the cacheSet and cacheWay respectively, unless otherwise specified.
ALGORITHM 1: SVRFlwGrp divides the cacheSets into fellow-groups in CMP-SVR.
Input: S←− total number of cacheSets in the cache.
Input: F ←− fellow-group size: total number of cacheSets in each fellow-group.
Output: list of all fellow-groups.
1 T←−S/F; // Total number of fellow-groups in cache.
2 List<Index> fellow-group; // Each fellow-group is a List of cacheSets.
3 List<fellow-group> all-groups;// List of fellow-groups.
4 fellow-group-index←0
5 for cacheSet-index←0 to S−1 do
/* Add cacheSet (cacheSet-index) to fellow-group
(fellow-group-index). */
6 all-groups.get(fellow-group-index).add(cacheSet-index)
7 fellow-group-index++
8 if fellow-group-index == T then
9 fellow-group-index ←0
10 end
11 end
12 return all-groups
The fellow-groups in CMP-SVR are statically decided. Starting from set-0, a set is assigned to each fellow-group. When all the fellow-groups are assigned one set each, the process is repeated for the remaining unassigned sets. The algorithm
used to create the fellow-groups is given in Algorithm 1. Now consider an A-way set associative cache havingS number of cacheSets, R number of cacheWays from each cacheSet reserved for RT and F as fellow-group size. The fellow-group sizes (F) are in powers of 2. The structure of the SA-TGS is:
Number of entries in SA-TGS =R∗S (5.1)
Associativity of SA-TGS (At) =R∗F (5.2) Number of tgsSets (S0) = S/F (5.3)
Each SA-TGS entry is statically mapped to a location in RT. An SA-TGS entry having tgsSet number astgss and tgsWay number as tgsw is statically mapped to the RT location calculated by the following two equations.
cacheSet index = ((tgsw/R)∗S0) +tgss (5.4)
cacheWay index = (A−R) + (tgsw%R) (5.5)
Each fellow-group has a separate RT section and one dedicated tgsSet in SA-TGS to manage the RT. To search the RT section, the dedicated tgsSet is searched.
Since the fellow-groups are created based on Algorithm 1, given any cacheSet index si; 0 ≤ si < S, the corresponding tgsSet index can be identified by the Equation 5.6.
tgsSet index =si%S0 (5.6)
The mapping between RT locations and the SA-TGS entries are statically fixed.
Given a cacheSet indexsi; 0≤si < S and cacheWay index aswj; (A−R)≤wj <
A, the corresponding tgsWay index can be calculated by the following equation.
tgsWay index = (si/S0)∗R+ (wj −(A−R)) (5.7)
Note that, all the division operations in the above formulas are integer divisions (e.g. 1/2 = 0,3/2 = 1, etc.) and ‘%’ is the modulo operator used in high level
programming languages. The set and way numbering of both cache and SA-TGS starts from 0.
Figure 5.1: CMP-SVR: way distribution, fellow sets and associative mapping into SA-TGS.
Example: An example of a 4KB 8-way set associative cache (A = 8) and its corresponding SA-TGS is shown in Figure 5.1. Assuming block size as 64B the cache has 8 sets (S = 8). Considering reserve ways per set (R) as 2 and fellow- group size (F) as 2, the total number of tgsSets (S0) and tgsWays (At) can be calculated as S0 = S/F = 4 and At = R ∗F = 4. There are four fellow-groups f1, f2, f3 and f4. Each cacheSet is labeled with the fellow-group id in which it belongs. The letters (a, b, . . . , p) in RT as well as in SA-TGS are labels to show the one-to-one mapping between a RT location and its corresponding SA-TGS location; they do not represent any block content.
Both CMP-VR and CMP-SVR have same data management policy as discussed in Section4.3.2. The difference between CMP-VR and CMP-SVR is the structure of the additional tag array (TGS for CMP-VR and SA-TGS for CMP-SVR) and the one-to-one mapping policy with RT. To search the RT section CMP-SVR only searches the corresponding tagSet in SA-TGS while in case of CMP-VR the entire fully associative TGS has to be searched.
5.2.1 Additional Search Time
Since indirect hits require to access the cache twice: first simultaneously with SA-TGS search and second after the tag is found in SA-TGS, the time required for indirect hits is twice the time required for a direct-hit. The time to detect a miss remains same. The detail discussion was given in section 4.3.3.2 during the description of CMP-VR.
5.2.2 Implementation of CMP-SVR in TCMP
As mentioned in Section 2.3.1 that in case of TCMP based architectures, the DAM based policies are implemented on each bank separately, hence the concept of CMP-SVR discussed above also does the same. It is applied to each bank separately and the process is transparent from outside the bank. Logically it is also possible to apply CMP-SVR on some selective banks without applying on all the banks. But we have not considered such designs for our experiments. Since CMP-SVR is a local utilisation enhancement technique other inter-bank utilisation enhancement policies like T-DNUCA (cf. Chapter 6), Co-Operative Caching [12], HK-NUCA [9] etc can be combined with it.