6.3 T-DNUCA: Experimental Analysis
6.3.3 Comparison Among the Different T-DNUCA Configurations 131
The performance gained by M1C3 is not much higher than M1C1. Table6.3shows the percentage of blocks permanently removed from the cache due to cascading replacement. It shows that on average, M1C3 removes only 8.65% lesser blocks than M1C1. But M1C3 distributes these blocks among all the peer-banks and in- creases the access time. Therefore the benefits gained by higher cascading number is spent on higher communication cost. Also such block distribution increases the network traffic and hence consumes more energy. Hence we suggest that M1C1 is the best configuration to run T-DNUCA.
Benchmarks M1C1 M1C3 Diff. between M1C1 and M1C3
frrt 34% 11% 23.49
blck 93% 90% 3.33
mx1 80% 67% 12.77
mx2 94% 88% 5.62
Average – – 8.65
Table 6.3: The percentage of evicted blocks permanently removed from cache (L2) during the cascading replacement process.
6.4 Summary
In this chapter we proposed a dynamic NUCA design for tile based CMPs. In case of centralised CMPs the DNUCA performs better than SNUCA with smarter block searching techniques [5]. Static NUCA (SNUCA) has a fixed address mapping policy whereas dynamic NUCA (DNUCA) allows blocks to relocate nearer to the processing core at runtime. SNUCA is well understood and explored for tiled CMPs whereas the same is not the case for DNUCA. Since there are architectural differences in centralised CMPs and TCMPs, the T-DNUCA design is different than the original DNUCA design. Each tile in TCMP is associated with a core which is not the case in centralised CMPs.
Different configurations of T-DNUCA based on migration and cascading replace- ment are compared with T-SNUCA and are found to outperform T-SNUCA. With migration enabled and maximum cascading number (M1C3) the T-DNUCA im- provements over T-SNUCA are 31.11% and 6.59% in terms of MPKI and CPI respectively. The proposed migration and placement policies lead to considerable improvements in performance.
Mechanism for Block Distribution and Fast Searching in T-DNUCA
Our first DNUCA based TCMP (T-DNUCA) is discussed in the previous chap- ter. T-DNUCA improves performance by better global utilisation and smart ini- tial placement policy. This chapter proposes a new DNUCA variants for better utilisation and access time, called TLD-NUCA. It improves the performance of T-DNUCA by better global utilisation and reduced on-chip communication.
7.1 Introduction
T-DNUCA discussed in the previous chapter has the following DNUCA properties:
• It divides the banks into multiple banksets; a block can be placed in any bank within a particular bankset.
• A heavily used block can be migrated from one bank to another within the same bankset.
In addition to these two basic DNUCA properties T-DNUCA can distribute the loads of a heavily used bank to the other underused banks. Also smart placement
policy has been used to reduce the hit time. For a particular core, C, the closest bank from a bankset (bs) is called the home-bank for C in bs. Here the closest means minimum Manhattan distance. A core has a home-bank in each bankset. If C needs to access a block which belongs to bs then C must first send the request to its home-bank in bs. If the block is found in the home-bank then it is sent to C, otherwise the request is multicast to all the remaining banks of bs. If any of the bank has the block then it sends the block directly to C. Otherwise, a miss occurs and home-bank brings the block from the main memory before sending it to C. Though the two-level access requires some additional time but due to the use of efficient placement and migration policy the average hit time of T-DNUCA is less than T-SNUCA. Both these polices try to reduce the need for the two-level search.
Although T-DNUCA improves performance as compared to T-SNUCA it can be further improved on the following aspect:
• The communication with the home-bank is mandatory in T-DNUCA. Since all the communications are done through the NoC, minimizing mandatory communication can reduce average hit time.
• The multicast search used is expensive in terms of energy consumption.
• The loads can be distributed among the banks within the bankset but there is no mechanism to balance the loads among multiple banksets.
This chapter proposes another technique called TLD-NUCA for better LLC util- isation and performance. TLD-NUCA considers a single bankset for the entire LLC for better utilisation of the banks. Single bankset means a block can be placed in any bank within the LLC. The expensive multicast search of T-DNUCA is replaced with a centralised tag lookup mechanism. TLD-NUCA has an addi- tional Tag Lookup Directory (TLD) placed in the central tiles. This TLD stores the information about all the blocks in the LLC. Each entry in TLD stores the tag address of a block and the bank-id where it currently resides. Single bankset makes the local-bank as home-bank hence the mandatory communication required
in T-DNUCA between the requesting core and home-bank is not required. Experi- mental analysis found that TLD-NUCA gives better performance than T-SNUCA as well as T-DNUCA.
As compared to T-DNUCA the average improvements of TLD-NUCA are 6.5% and 16.8% in terms of CPI and MPKI respectively. The improvements compared to baseline (T-SNUCA) are 13% (CPI) and 42.9% (MPKI). The energy consumption of TLD-NUCA as compared to the baseline is 3.08% more while the storage con- sumption is 10.5% more. Though TLD-NUCA has additional energy requirements its improved performance reduces the EDP by 11%, compared to the baseline.
The purpose of global utilisation enhancement and DNUCA based TCMP are already discussed in Chapter 2 and Chapter 6. Hence we start this chapter with a motivational discussion about why TLD-NUCA is required design after T-DNUCA. The rest of the chapter is organized as follows. The motivation be- hind TLD-NUCA are discussed in next section. Section 7.3 discuses the proposed TLD-NUCA. Experimental analysis of TLD-NUCA are given in Section 7.4. Fi- nally, summary of the chapter is given in Section 7.7.
Figure 7.1: Load distribution among the banksets of T-DNUCA. The T-DNUCA architecture considered for this experiments has four banksets:
bs0, bs1, bs2 andbs3.