Experimental Analysis of CMP-SVR - Effective Utilisation of LLCs by Managing Associativity, Pla

5.2.1 Additional Search Time

Since indirect hits require to access the cache twice: first simultaneously with SA-TGS search and second after the tag is found in SA-TGS, the time required for indirect hits is twice the time required for a direct-hit. The time to detect a miss remains same. The detail discussion was given in section 4.3.3.2 during the description of CMP-VR.

5.2.2 Implementation of CMP-SVR in TCMP

As mentioned in Section 2.3.1 that in case of TCMP based architectures, the DAM based policies are implemented on each bank separately, hence the concept of CMP-SVR discussed above also does the same. It is applied to each bank separately and the process is transparent from outside the bank. Logically it is also possible to apply CMP-SVR on some selective banks without applying on all the banks. But we have not considered such designs for our experiments. Since CMP-SVR is a local utilisation enhancement technique other inter-bank utilisation enhancement policies like T-DNUCA (cf. Chapter 6), Co-Operative Caching [12], HK-NUCA [9] etc can be combined with it.

Component Parameters

No. of tiles 16

Processor UltraSPARCIII+

L1 I/D cache 64KB, 4-way

LLC size, bank size 4MB, 256KB 4-way

Memory 1GB, 4KB/page

CMP-SVR: reserve ways per set (R) 2

CMP-SVR: fellow-group size (F) 4

Table 5.1: System Parameters

(a) MPKI

(b) CPI

Figure 5.2: Normalized performance comparison of CMP-SVR with baseline design.

benchmarks from Parsec benchmark suite has been used for the simulation. The benchmarks are: flud, face, frt16, bd16, vp4, frq, blck and mix2. The detail description about the benchmarks are given in section 3.3.2.

5.3.1 Simulation Results and Analysis

We first compare our proposed method with baseline TCMP. The number of L2 misses in CMP-SVR and in baseline can be calculated as follows:

• Total L2 misses in CMP-SVR =

Total L1 miss - (Total direct-hit + Total indirect-hit).

• Total L2 misses in baseline = Total L1 miss - Total L2 hit.

We consider the total size of the L2 cache as 4MB, hence the size of each bank is 4MB/16=256KB. The more detailed configurations regarding both baseline as well as CMP-SVR are given in table 5.1.

Figure 5.2 shows the performance comparison of CMP-SVR with the baseline design. Both graphs in the figure show the results normalized to the corresponding values of the baseline design.

Figure5.2(a)shows that CMP-SVR gets 7.86% to 28.8% reduction in MPKI with an average of 20.61%. In other words, on average, 20.61% times the blocks are found in RT, resulting in an indirect-hit, whereas they would be termed as misses in the baseline design. Figure 5.2(b) shows the performance comparison in terms of CPI. It shows that in case of CMP-SVR, CPI improves by 1.04% to 9.45% with an average improvement of 6.76%.

5.3.2 Hardware Overheads

The additional storage (compared to the baseline) required in CMP-SVR depends on the size of SA-TGS. Each SA-TGS entry stores a tag address and a validity bit. For instance consider the cache shown in Figure 5.1. It is a 4KB 8-way set associative cache having block size of 64 bytes. Both reserve ways per set and fellow-group size are given as 2. The SA-TGS shown in the figure has 4 ways and 4 sets, hence total 16 entries (details given in Section 5.2). Considering 36-bit address space, each cache block has a tag address of 27 bits. Each SA-TGS entry stores 27(tag)+1(validity)=28 bits. So the total storage requirement of SA-TGS is 16*28=448 bits = 56 bytes, which is only 1.36% of the total cache size. So the additional storage requirements in CMP-SVR is 1.36% of the baseline.

We use Cacti 6.0 [2] to find the power/energy consumption for both baseline and CMP-SVR. All the changes done by CMP-SVR in TCMP are internal for the tiles and has no changes in the NoC. The energy overhead in CMP-SVR is 1.8% as compared to the baseline while the same overhead in case of CMP-VR is more than 20%.

5.3.3 Comparison with CMP-VR

The result of CMP-SVR has been compared with CMP-VR. The configuration used for CMP-VR is same as mentioned in Table5.1(except the fellow-group size, as CMP-VR has no fellow group concept). Both CMP-SVR and CMP-VR has been compared with same sized baseline. Table 5.2 shows the improvements of CMP-VR as well as CMP-SVR over the baseline design. CMP-SVR has much lesser energy overhead compared to CMP-VR. In particular CMP-VR has 20%

energy overhead whereas in case of CMP-SVR the overhead is 1.8%. The storage overheads of both the architecture is almost same. The performance of CMP-SVR is slightly less than CMP-VR.

Cache Design Improvements (%)

Energy Overheads (%)

CPI MPKI

CMP-VR 7.75 21.55 23.63

CMP-SVR 6.76 20.61 1.8

Table 5.2: Improvements and energy overhead of CMP-SVR and CMP-VR over the baseline.

5.3.4 Analysis of the Results

In particular CMP-SVR with set-associative SA-TGS has much lesser energy overhead (1.8%) as compared to CMP-VR (23.63%). CMP-SVR shows 6.5% improvement over baseline, which is slightly less than CMP-VR.

Dalam dokumen Effective Utilisation of LLCs by Managing Associativity, Placement and Mapping (Halaman 114-117)