be a bit more expensive but the hit in regular storage takes less time. Also the maximum dynamic associativity increase of heavily used sets is more in CMP-VR than V-way [8].
location (array index) the corresponding RT location can be directly calculated.
The number of entries in TGS is S×R, where S is the number of sets and R is the number of reserve ways in each set.
Figure 4.3: Example of a bank in CMP-VR.
Figure4.3 demonstrates the structure of an LLC bank and its associated TGS. In order to facilitate block search and replacement in RT, the TGS and RT mapping needs to be maintained consistent. An LRU list is maintained for the TGS to get the LRU entry to be replaced with newer ones. For example, if the index i in TGS is the LRU entry, then the corresponding block in RT will be replaced by the new block, evicted from NT. In other words, a global LRU replacement policy is maintained for RT.
4.3.2 Operations of CMP-VR
All the cache-hits within the NT are treated as in regular bank, without any changes. The modifications are only required for the misses in NT. The necessary changes are discussed below:
1. On NT miss (block not in NT): Search the TGS. Upon a tag match in TGS identify the corresponding block in RT and move it to NT. If tag match fails in TGS, it is a cache-miss.
2. Move a blockB from RT to NT: If free space (way) is available in the desired set of NT, then moveB in that free space. Otherwise, swapB with the LRU block in the corresponding set of (NT).
3. On Cache Miss: Bring block from main memory and store it in the desired set of NT.
4. Store a newly arrived block B into NT: If free space (way) is available in the desired set of NT then placeB into the free space. Otherwise make space in NT by evicting its LRU block. This evicted block moves to RT.
5. Store blockV (evicted from RT) into RT : Search the TGS. If space available in RT then moveV into the free space of RT. Otherwise select a victim block (V0) from RT (based on LRU) and replace it with V. The new victim V0 is evicted from the cache bank and written back to the main memory (if required). Update TGS entry by overwriting the tag of V0 by the tag of V.
Coherence action of invalidation for shared blocks in L1 is initiated only if the block is evicted from RT (i.e. from the chip). In other words, the RT is transparent to the L1s, sharing blocks residing in RT. Also the usage of NT in one bank is completely transparent to the other banks and hence other inter-banking data management policies can be implemented on top of it.
4.3.3 Additional TaG Storage (TGS)
The above description gives an idea of what CMP-VR is and how it works. As mentioned above, CMP-VR has an extra tag-array called TGS, which is searched on every cache miss in NT. The TGS access time overhead should not hamper the performance gain of CMP-VR. The details regarding access time of TGS as well as that of CMP-VR are discussed below.
4.3.3.1 Structure of the TGS
The structure of TGS is similar to a fully associative TLB (translation lookaside buffer). For searching TGS, all the entries in it are compared simultaneously.
Comparing all the tag entries simultaneously needs a lot of comparators and may
lead to energy overhead. To reduce the overheads of TGS it is divided into multiple segments of 16-ways. All the segments are searched simultaneously, hence the total search time of TGS is the access time of each segment. The detailed analysis about the energy overhead of CMP-VR is given in Section 4.4.4.2.
4.3.3.2 Access time of CMP-VR
This section discusses about the time to access a block within the bank, after the request reaches the bank. The request may come from a separate tile and according to TCMP architecture the routing time is different based on the distance between the home tile and the requesting tile. But the routing concepts of CMP-VR and the baseline design are same. Hence, NoC routing time of CMP-VR has not been discussed in this section. The actual cache access time isNoC routing time+bank access time.
In the case of standard cache architecture [2], for each cache bank, the address of the requested block is provided as input to a decoder, which then activates a wordline in both the data array as well as the tag array (not to be confused with TGS). The content of the entire set are placed on the bitlines, which are then sensed. The multiple tags read out of the tag array are compared with the input address to check for a match. The multiplexers are then appropriately selected to forward the data to the requesting core. The logical organization of a cache bank is shown in Figure 4.4. The bank access time in such an architecture depends on both tag array and data array and it is termed as the normal bank access time (nbat).
The tag array and TGS are not the same entity. The former is a part of normal cache architecture as mentioned in [2] while the latter is the proposed additional tag storage required for RT. Figure4.5explains the flow of a block (sayB) request in CMP-VR. The request is done by an L1 cache (say Lx1) when it suffers a cache miss. Considering the static bank mapping policy of baseline TCMP the block maps to a fixed L2 bank (sayLy2). The details of address decoding, tag comparison and multiplexing are abstracted out. The routing details about how the request
Figure 4.4: A logical organization of cache bank [2].
reaches Ly2 from the requesting Lx1 are also ignored. The flow only shows what happens when the request arrives at Ly2. The CMP-VR is applied to each bank separately.
Figure 4.5: Request flow in CMP-VR after the request reaches the home tile.
++Placing a block in cache memory may need to move another block to RT.
The block movement and replacement procedures are discussed in Section4.3.2.
When a request arrives, all the three components i.e. tag array, data array and TGS are searched simultaneously. Tag array and data array are part of normal cache (in CMP-VR, it means NT) and in the figure it mentioned asRT searching.
Components Parameters
No. of tiles 16
Processor UltraSPARCIII+
L1 I/D cache 64KB, 4-way
LLC size 2MB and 4MB
bank size 128KB 4-way and 256KB 4-way
Memory 1GB, 4KB/page
Table 4.2: System Parameters.
The TGS access time in worst case is considered to be equal tonbat†. If the block is found in NT then it is a direct-hit and the hit time is NoC routing time+nbat.
Since TGS takes nbat as worst case access time, it finishes searching in parallel with NT searching. Now if the tag is not found in TGS then it is a cache miss and the request has to go to the main memory. However, if the tag is found in TGS, then it is anindirect-hit and the block needs to be read from the data array; this in worst case takes another nbatamount of time. So in the case of an indirect-hit, the cache access time (including both RT and NT) isNoC routing time+2×nbat.