Chapter 3. Simulation Framework 72 at the expense of a significantly increased encoding and decoding time. Motion compensation is used for removing the data redundancy. X264 is very flexible and is used for different requirements like video conferencing to HD movie distribution.
Moreover, H.264/AVC encoding is also required for the next-generation HD DVD or Blu-ray video players.
Chapter 3. Simulation Framework 73 and they will run on the same core until completion of the execution. For example, by using 4 copies of vips, we can make a multi-programmed benchmark in which each copy of vips can be bounded on 4 cores while considering a 16 core CMP.
Note that, every single PARSEC application is a multi-threaded one, hence, in multi-programmed environment also each application has multiple threads where resources or codes are shared during execution. For multi-programmed bench- marks the term benchmark represents the combined workload.
3.3.2 Used Benchmark Applications
Based upon the multi-threaded benchmarks provided by the PARSEC we made different combinations for experiments. They are either a single PARSEC bench- mark or a combination of more than one benchmark applications. Table 3.3 gives the details of the benchmarks we used in our experimental analysis. Also note that, all benchmarks mentioned in the table are not used for verification of all of our proposed or prior architectures.
3.3.3 Executing Benchmarks
For executing a multi-threaded benchmark each benchmark on the target machine is run upto the starting of the ROI. Once the ROI is reached the benchmark stops automatically because of the inserted magic instruction. Reaching ROI implies that the initialisation of the benchmark is over and all the threads are spawned.
Once all threads are created, the benchmark is run further for 50 million cycles to warm-up. The warm-up period is necessary for avoiding the compulsory misses in caches and also allowing the NoC architecture to settle properly. After warming- up, the Ruby profiler is made clear through a command and from this point the actual execution begins. Few benchmarks are executed upto its termination i.e.
until the ROI completion, whereas another set of benchmarks are executed upto a fixed number of cycles. Note that, the number of execution cycles varies across the benchmarks but is never less than 800 million (unless the case of termination).
Chapter 3. Simulation Framework 74 Multi-threaded benchmarks
blackscholes (black), bodytrack (body) ferret (ferret), fluidanimate (fluid), freqmine (freq)
swaptions (swap),vips and X264 (x264) Multi-programmed benchmarks
Benchmark Details
black4 4 copies of black.
ferret4 4 copies of ferret.
fluid4 4 copies of fluid.
freq4 4 copies of ferq.
swap4 4 copies of swap.
vips4 4 copies of vips.
black16 16 copies of black.
body16 16 copies of body.
ferret16 16 copies of ferret.
fluid16 16 copies of fluid.
freq16 16 copies of ferq.
swap16 16 copies of swap.
vips16 16 copies of vips.
Table 3.3: List of all the multi-threaded and multi-programmed benchmarks used for the experiments in this thesis.
This number of cycles are termed as Simics cycles, which is 4× than the Ruby cycles of GEMS. Hence, 800 million Simics cycles implies 200 million Ruby cyles.
The execution process that is fixed for a benchmark is being maintained for all the architectures being compared.
To execute a multi-programmed benchmark the very initial step is to load all the applications (belonging to the benchmark) one by one. Each application is then executed until its ROI. Once threads have been spawned, they are bounded at some cores. Note that, each program may have multiple number of threads and the threads can be bounded with the cores assigned to each application. However, after thread binding the execution of that application remains paused. This bind- ing process is now repeated with the other applications. Once thread binding is done for all the applications, all of them are simultaneously resumed from their corresponding ROI. Rest of the processes, like warm-up and running policies are same like multi-threaded benchmarks. The thread binding for multi-programmed benchmarks are also done through the Solaris commands as we mentioned earlier.
Chapter 3. Simulation Framework 75
3.3.4 Comparing Different CMP Architectures
For performance analysis of our proposed TCMP and CCMP architecture we com- pare its performance with other existing architectures in terms of IPC, energy consumption, EDP, running temperature, implementation overheads etc. In or- der to do so, we have designed all of our TCMP and CCMP architectures on GEMS+Simics (full-system simulator) and then we ran PARSEC benchmarks on top of them. We record different statistics during the execution of each bench- marks as discussed in Section 3.1.2.2. Based on this statistics, we compare the performances of two architectures.
Usually, an architecture is engineered with different configurations and design choices, for example, with various cache sizes, associativities, block sharing capa- bilities etc. Details regarding this will be provided in the relevant chapters/sections later whenever required. For all the architectures with different configurations, the process of running a particular benchmark is kept same to maintain uniformity.
The individual results for each of the benchmarks are reported and the geometric mean (average) of all those are derived in our result sections.
3.3.5 Our Architectural Models
The entire thesis considers SNUCA based TCMP and CCMP as our baseline archi- tecture which are already implemented in GEMS for different coherence protocols, cache sizes and replacement policies. The architectures are shown in Figure 1.2.
MESI-CMP protocol is used as coherence protocol in our cache hierarchy. We use different cache sizes and associativities for the experimental versatility. Our first three contributions use TCMP as baseline whereas last one uses CCMP as its baseline architecture. The detailed configurations used in our simulations are provided separately whenever required in the subsequent chapters.
Mostly we perform our experiments by considering L2 as shared LLC having a size of 2MB, 4MB, 8MB and 16MB. Use of even larger caches may not be judgmental
Chapter 3. Simulation Framework 76 according to the size of inputs provided by the PARSEC benchmarks. Multi- banked LLCs are sliced into equal sized banks. For example, in our baseline design an 8MB LLC (L2) has a bank of size 512KB, where we have 16 banks. The associativity is same across the LLC banks and is maintained uniformly.
Chapter 4
Static Energy Reduction by Performance Linked Dynamic Cache Resizing (DiCeR)
In this chapter, we are going to discuss about the dynamic tuning of LLC size, which is a promising option for reducing the cache leakage in modern CMPs.
Towards this, our policy dynamically shuts down or turns on cache banks based upon the system performance and the banks’ usage statistics. In addition with savings in leakage energy, shutting down of a cache bank remaps its future requests to another active bank, called as target bank. The proposed technique is evaluated on three different implementation policies.