L2-bank: 256KB-4way L2-bank: 512KB-4way
F=2 F=4 F=8 F=2 F=4 F=8
Area overhead (in %) 3.380 3.397 3.486 3.128 3.518 3.551
Table 5.9: The area overhead of FS-DAM over the baseline design. Rx: x%
ways per set reserve for RT, F: fellow-group size.
also shows improvement while compared with V-Way and SBC. The static en- ergy overhead of FS-DAM over baseline is just 1.73% while the dynamic energy overhead is 2.19%.
Dynamic NUCA Framework for Tiled CMPs
In Chapter 2 it has been mentioned that the banks of TCMP are not used uni- formly. Some banks are heavily loaded while some others used very lightly. Such utilisation issue is referred as global utilisation issue (cf. Section 2.3.2). Global utilisation can be improved by uniformly distributing the loads among the banks.
SNUCA cannot move blocks from one bank to another hence such distributions are not possible with T-SNUCA. DNUCA can move block from one bank to another hence load distribution among the banks is possible in DNUCA based designs.
This chapter proposes a DNUCA based TCMP architecture, called T-DNUCA, to enhance the global utilisation by distributing the loads among multiple banks.
The main contribution of T-DNUCA is : (a) distributes loads among multiple banks, (b) heavily used blocks are migrated to the closer banks and (c) better placement policy to reduce LLC access time. Simulation results show that T- DNUCA improves performance by 31.11% and 6.59% in terms of MPKI and CPI respectively as compared to T-SNUCA.
6.1 Introduction
SNUCA has fixed mapping policy and a block always maps to a particular bank.
Such a mapping policy makes the block searching process easier and faster. Al- though SNUCA design is very simple, it has performance issues due to the cache access latencies. When a core requests a block which is far away from the core, then there is no mechanism to bring the block closer. DNUCA allows dynamic mapping of the cache blocks among the banks. The banks are divided into groups calledbanksets and a block can be placed in any bank within a given bankset. Such dynamic nature facilitates migration of heavily used blocks closer to the requesting cores.
DNUCA allows a block to be placed in any bank within a bankset, thus necessi- tating the entire bankset to be searched before declaring a cache miss. The three basic mechanisms for bankset search are: incremental search, multi-cast search and mixed search [17]. Additional smart searching techniques also exist to minimize the search time [1,9]. But optimizing bankset searching time is still an open issue [5]. The details regarding DNUCA block search is discussed in Section 2.1.2.1.
DNUCA can distribute the load among the banks of same bankset (peer-banks).
Though most of the DNUCA designs have only migration facility to move blocks from one bank to another, the flexibility of placing a block in any peer-bank makes it possible to move blocks from a heavily loaded bank to another bank.
Both SNUCA and DNUCA described above are well explored for centralised shared CMPs. In Chapter 2we referred them as C-SNUCA and C-DNUCA respectively.
But as mentioned in Section 2.2.3.2 DNUCA is less explored for TCMPs. The TCMP architecture and some common terms related to TCMP are already dis- cussed in Section 2.2.3. For completeness, Figure 6.1 shows the baseline TCMP (cf. Section 3.3.5) . The shared L2 cache is divided into 16 banks and each tile has a bank. The terms “local bank” and “remote bank” w.r.t. a core represent the local (within the same tile) and remote banks respectively, for that particular core.
Figure 6.1: An example of TCMP. This is a SNUCA based TCMP or T- SNUCA. The TCMP is considered as baseline for our proposed architectures.
Figure 6.2: Distribution of hits among the local banks and remote banks for T-SNUCA.
Due to its distributed shared banks, TCMP can be designed based on the concept of SNUCA and DNUCA. In Chapter 2 we referred them as T-SNUCA and T- DNUCA respectively. T-SNUCA is already implemented and well explored [38,62]
compared to DNUCA based designs for TCMP. Due to its static mapping policy, T-SNUCA has some disadvantages. It has more number of accesses to the remote banks than the local bank as the blocks are statically mapped and cannot be moved (migrated) into a closer bank. The requesting core (tile) has to forward the request to the bank where the block maps. Therefore, T-SNUCA has higher average cache access latency. Figure6.2shows the number of remote bank accesses of T-SNUCA as compared to the local bank accesses. The non-uniform load distribution among the T-SNUCA banks is shown in Figure 6.3.
From the above discussion it is clear that a TCMP with lesser remote bank accesses and having block migration capability among the banks can be a better option as compared to T-SNUCA. DNUCA has some advantages over SNUCA and hence, implementing DNUCA concept for TCMP can reduce the problems of T-SNUCA.
The dynamic nature of DNUCA can move a block among the banks. Therefore, heavily used blocks can be migrated closer to the requesting tile (core). It would
Figure 6.3: Comparison of bank usages in T-SNUCA.
also be feasible to spill the victim block (during replacement) of one bank into another bank, thus giving the victim block another chance to remain in the cache.
Such interbank movement of victim blocks can distribute the load of a heavily used bank among the other underused banks.
Issues that need to be handled differently in TCMP based DNUCA:
Implementing DNUCA for TCMP is not identical to the original DNUCA proposed for centralised CMPs (CCMPs). In TCMP every bank is associated with a core, which is not the case in CCMPs. The core in each tile performs execution and hence the heavily used blocks by the core have to be placed nearer to the core.
Within a bankset, each bank is associated with a core at the same time being accessible by other cores. These nearby cores would benefit loading data in such a bank compared to other banks of a bankset. Now consider gradual migration of a block within a bankset. As this block moves through intermediate tiles, it will evict some blocks from these intermediate nodes. Note that, these evicted blocks may have been cached by other nearby cores.
Placement of a newly fetched block is also an issue. Most of the original DNUCA policy places a newly fetched block into the farthest bank and gradually tries to move it closer [18, 1]. The centralised LLC structure has no core associated with the central banks, by gradual migration they try to settle the shared blocks into the central banks to enable faster access from all the cores. But TCMP has core associated with all the banks, hence settling shared blocks into central bank may remove the important blocks from that bank. Also the DNUCA problems like bankset search and migration control must also be handled properly such that
they should not nullify the advantages gained over T-SNUCA. Using smart search policies to speedup the bankset searching are also expensive in terms of area, storage and energy consumption [5].
Our proposed T-DNUCA uses a hybrid search policy to search a block within the bankset. In its first attempt it only searches the bank within the bankset that is closest from the requesting core. If there is a miss then search request is multicast to the remaining banks of the bankset. Migration is allowed to move heavily used bank directly into the closest bank of the requesting core. The newly fetched blocks are placed to the closest bank of the requesting core. Earlier experiment [1] shows that most of the blocks requested by a core are private to the requesting core even though the application running is a shared memory based application.
Placing the incoming blocks to the nearest bank reduces the average bank access time. Also since most of the cache hits can be served by the closest bank the multicast search requirement is less and hence one can afford not using any smart searching techniques. The requirement of migration is also less due to the same reason.
The rest of the Chapter is organized as follows. The detail description of proposed T-DNUCA is given in the next Section. Experimental analysis is given in Section 6.3. Section 6.4 gives the summary of the Chapter.