Impact of Set Sampling - Results and Analysis

3.5 Results and Analysis

3.5.8 Impact of Set Sampling

We compared the result of policy with set sampling (denoted by S) with the procedure without set sampling (N). The results are presented on the given metrics (already explained in the Section 3.5). In case of a sampler, we run each multi-programmed workload for 1 billion instructions after warming up to 1 billion instructions. The reason behind to increase the period of warm-up phase is because of the lesser availability of sampled set entries involved in initialization of the RDT. As the sampled sets are very less, the RDT will take longer time to learn the behavior of different patterns. Note that figures are only presented for quad-core.

Chapter 3. Reducing Write Cost by Dataless Entries and Prediction 86

0 0.2 0.4 0.6 0.8 1 1.2

N S N S N S N S N S N S N S N S N S N S

Normalized LLC Writes

STT-RAM SRAM

Figure 3.19: Normalized write counts against N.

3.5.8.1 Write Accesses

Fig. 3.19 presents the normalized LLC writes for policy N and S for the quad-core.

Our policy with set sampling (S) increases the writes in STT region by 2.21% in dual-core. While, in quad-core, the increase in writes in STT is 8%. This marginal increase in writes is mainly due to slow learning of the different patterns in the RDT. Thus, the only possible way to reduce the writes in STT region in case of quad-core is by increasing the learning period. This concludes that the policy with set sampling maintains the same write accesses as the policy without set sampling.

3.5.8.2 Energy Consumption

Fig. 3.20 shows the normalized energy consumption by the policy with set sampling (S) against the policy without set sampling (N). In dual-core, the policy with set sampling (S) incurs a nominal increase in energy with respect to STT and total region by 1.31% and 0.37% respectively. Whereas, the savings in the energy in the SRAM region is only 0.80%. The marginal increase in energy of the STT region is due to the rise in the number of writes. The respective values in quad-core are the increase of 6.3% in STT region, saving of 5.5% in SRAM and, the overall rise of 1.1%. This marginal increase can be easily avoided either by increasing the sample set entries or the initialization period.

Chapter 3. Reducing Write Cost by Dataless Entries and Prediction 87

0 0.2 0.4 0.6 0.8 1 1.2

N S N S N S N S N S N S N S N S N S N S Normalized LLC Energy

STT-RAM SRAM Predictor

Figure 3.20: Normalized energy consumption against N

0.5 0.6 0.7 0.8 0.9 1 1.1

Normalized CPI

Sampler

Figure 3.21: Normalized CPI of S against N

3.5.8.3 Performance

Normalized CPI for the policy with set sampling (S) against the policy without set sampling (N) is presented in Fig. 3.21. The policy with the sampler (S) for dual-core and quad-core maintains the same performance with the policy without sampler (N). We observe a very less degradation in CPI due to the increase of writes in STT region, but it is negligible.

3.5.8.4 Misses Per Kilo Instruction

Fig. 3.22 presents the result of an increase in MPKI of policy with set sampling (S) against the policy without set sampling (N) for quad-core. The policy with set sampling increases the MPKI by 4.25% (for dual-core) and 6.81% (for quad- core) compared to the policy without set sampling (N). This increase in MPKI is mainly due to the untrained pattern in the RDT, which replaces the block

Chapter 3. Reducing Write Cost by Dataless Entries and Prediction 88

12%

16%

% MPKI Increase

Sampler

Figure 3.22: Increase in MPKI for S against N

0 20 40 60 80

S N S N S N S N S N S N S N S N S N

Training Over-estimation Under-estimation Correct estimation

Figure 3.23: Accuracy of RDAWIP with set sampling for quad core

unconditionally. However, if we increase the number of sample sets entries (that leads to large storage and area overheads) or increase the warm-up time, the MPKI values can be improved further.

3.5.8.5 Prediction Accuracy

Fig. 3.23 presents the accuracy of RDAWIP with set sampling for the quad-core.

Same as the previous subsection, the graph shows log (base 2) values of each entity and the stats shown in the Fig. 3.23 are collected during the 1 billion instruction after warming up the sampler for 1 billion instruction. From the graph, we can conclude that compared to policy without set sampling (N), the large number of the cache entries are going through the initialization or update phase in the policy with set sampling (S). This is because of the slow learning behavior of the RDT entry due to less availability of sampled set entries. The slow learning, in

Chapter 3. Reducing Write Cost by Dataless Entries and Prediction 89 turn, affects the prediction accuracy due to unstable entries stored in the RDT.

Compared to the policy without set sampling (N), out of 2²⁴ evicted blocks, 2¹⁹ (dual) and 2²⁰ (quad) evicted blocks are underestimated and 2²¹ (dual) and 2²² (quad) evicted blocks are overestimated. Whereas, the 2²³ (for dual and quad- core) evicted blocks are correctly estimated. However, if we increase the sampler size or initialization period, we can even get better prediction accuracy by trading off the area and storage overhead.

3.5.8.6 Storage Overhead

The set sampling used with the proposed policy (S) reduces the storage and area overhead compared to the policy without set sampling (N). The storage overhead with respect to hybrid L2 cache is 0.28% (for dual-core) and 0.275% (for quad- core). Similarly, the area overhead is 1.02% (for dual-core) and 0.79% (For quad- core). These less overhead are due to the limited sample set CMD entries (used to train the RDT) (which are reduced by a factor of 32 compared to policy without set sampling). Whereas, the rest of the entries in the CMD contains only Train (1 bit) and RDT pointer (9 bit).

3.5.8.7 Discussion

As it is evident from the sections 3.5.8.1, 3.5.8.2, 3.5.8.4 and 3.5.8.5, the set sampler deteriorate the metrics and prediction accuracy compared to N. However, the significant improvement in the storage and area overhead (nearly 14% in the dual and quad-core) by the use of sampling makes it worth for the embedded system having limited area.

Chapter 3. Reducing Write Cost by Dataless Entries and Prediction 90

Core Metrics Ref. Policy P T M N S

Quad

Writes

RWHCA 34.4% 35.8% 34.9% 36.2% 35.9%

WI 31.2% 32.6% 31.6% 33.1% 32.8%

P - 0.64% 0.42% 2.74% 2.05%

Energy

RWHCA 32.1% 33.2% 32.4% 35% 33.5%

WI 22.8% 24.2% 23.2% 26.1% 24.3%

P - 1.26% 0.47% 3.71% 1.37%

MPKI

RWHCA 13.8% 12.1% 15.6% 15.1% 14.3%

WI 15.3% 13.5% 17.0% 16.4% 15.7%

P - -1.41% 3.57% 3.02% 2.93%

Octa

Writes

RWHCA 32.6% 33.9% 33.5% 34.8% 33.6%

WI 26.6% 28.1% 27.6% 29% 27.7%

P - 2% 1.3% 3.2% 1.5%

Energy

RWHCA 22.1% 24.4% 23.2% 25.2% 23.3%

WI 18.1% 20.5% 19.3% 21.3% 19.3%

P - 1.52% 0.15% 2.6% 1.34%

MPKI

RWHCA 1.95% 0.40% 5.93% 4.23% 4.0%

WI 18.8% 17.4% 25.2% 23.6% 22.3%

P - -2.2% 10.4% 8.65% 3.8%

Table 3.9: Percentage improvement values for multi-threaded workloads on quad-core and octa-core

3.5.9 Analysis on Multi-threaded Workloads with Larger

Dalam dokumen LiNoVo: Longevity Enhancement of Non-Volatile Caches by (Halaman 115-120)