As discussed in the previous section, full-system simulator like GEMS+simics can execute real workloads on the simulated architectures. Different results are col- lected from these executions based on which the performance of the architecture is analyzed. As CMP architectures are new, the hardware manufacturers or re- searchers face difficulty to get the testing for the architecture that represent real- world behavior accurately.
The Princeton Application Repository for Shared-Memory Computers (Parsec) [3]
is a benchmark suite composed of multi-threaded applications. The applications can be used to evaluate and develop next-generation CMPs. It was collabora- tively created by Intel and Princeton University to drive research efforts on future computer systems. Parsec is freely available and is used for both academic and industrial research. Some major objectives of Parsec are:
• Focus on multithreaded applications.
• Different input size for each workload.
• Programs from different real-world problems.
The benchmarks that were publicly available before Parsec are application spe- cific and mostly available in un-parallelized version [3]. The version 2.1 of Parsec Benchmark Suite has 12 workloads and each of the workload is multithreaded and parallelized. The applications are chosen from different real-world areas like fi- nance, media processing, computer vision, enterprise service and animation physics etc. Table 3.1 gives detail description about the Parsec workloads. Multithreaded applications shares/exchanges data among the threads. The data usages details of the benchmarks are given in Table 3.2. Both the tables are taken from [3], which also describe each property in detail. The workloads are also called programs, applications or benchmarks.
Program/
Benchmarks
Application Domain
Parallelisation Working- Model Granularity Set
blackscholes Financial Analysis data-parallel coarse small bodytrack Computer Vision data-parallel medium medium
canneal Engineering unstructured fine unbounded
dedup Enterprise Storage pipeline medium unbounded
facesim Animation data-parallel coarse large
ferret Similarity Search pipeline medium unbounded fluidanimate Animation data-parallel fine large freqmine Data Mining data-parallel medium unbounded streamcluster Data Mining data-parallel medium medium swaptions Financial Analysis data-parallel coarse medium vips Media Processing data-parallel coarse medium
x264 Media Processing pipeline coarse medium
Table 3.1: The inherent key characteristics of Parsec benchmarks. Detail descriptions are given in [3].
Program/
Benchmarks
Data Usage Sharing Exchange
blackscholes and swaptions low low
bodytrack and freqmine high medium
canneal, dedup, ferret and x264 high high facesim, fluidanimate, streamcluster and vips low medium
Table 3.2: The data usage behavior of Parsec benchmarks. Detail description is given in [3].
Each application has different size of input sets: small, medium, large, etc. User can run the applications with any input size based on the requirement and archi- tectural demand.
Another popular benchmark suite used for CMP is SPEC-CPU 2006 [98]. One of the main reason for choosing Parsec is that it is freely available. SPLASH-2 [101]
was a widely used benchmark suite for CMP architecture in the last decade. But due to its smaller input sizes it cannot be used for the current large sized LLCs.
3.2.1 Benchmark Description of Parsec
This section describes the properties of some Parsec benchmarks used in our work for performance comparison of different CMP based architectures. The detail description about the benchmarks is given in [3].
3.2.1.1 blackscholes
The blackscholes application is used to perform the financial analysis. It is an Intel’s recognition, mining and synthesis (RMS) benchmark to analytically calcu- late the prices for a portfolio of European options with the Black-Scholes partial differential equation (PDE). Blackscholes requires to solve different variety of PDE for their application in financial analysis. The program is divided into multiple concurrent threads where each thread represents a work unit of the portfolio.
3.2.1.2 bodytrack
This benchmark tracks the three dimensional view of human body with multiple cameras. It uses an annealed practice filter to track 3D view using an edge and foreground silhouette. In bodytrack benchmark for an input video which contains many frames, a frame is selected at time stamp t and computes its likelihood.
The likelihood is a degree of the 3D body model alignment with the foreground and edges in the images. The value of likelihood is computed by using the two attributes of an image named as the foreground map and the edge distance map.
Bodytrack has a persistent thread pool. The main thread sends the task to the thread pool whenever it reaches a parallel kernel. The main thread has to wait for the working threads to finish their execution before proceeding further.
3.2.1.3 facesim
This benchmark is an Intel RMS application and originally developed by Stanford University. It is an animation based application which takes a human face and a time sequence of muscle activations as input and computes a visually realistic ani- mation of the modeled face. It simulates the underlying physics to get the visually realistic result. Human faces in particular are observed with more attention from users than other details of a virtual world, making their realistic presentation a key element for animations.
3.2.1.4 ferret
This application is used for content based similarity search of rich text data in Internet search engines. Rich text data includes audio, images, video, 3D shapes etc. It uses ferret toolkit [102] for searching. It has six modules. The first and last module are serial while the remaining four modules are parallel. The first module are used as a input and the last module are used as a output.
3.2.1.5 fluidanimate
The main reason for including this workload in Parsec benchmark is due to increas- ing importance of real time animation and the physical simulations for computer games. It is an Intel RMS application based on Smoothed Particle Hydrodynamics (SPH) method [103]. Fluidanimate uses a five kernel to simulate an incompress- ible fluid for interactive animation purposes. Fluidanimate generates an output by interpreting and discovering the surface of incompressible fluid.
3.2.1.6 freqmine
The freqmine application is used for Frequent Itemset Mining (FIMI) [104] with an array based version of the Frequent Pattern-growth method. It is an Intel RMS benchmark which was originally developed by Concordia University. FIMI is the basis of Association Rule Mining (ARM) which is a common data mining problem for areas like protein sequences, market data and log analysis etc. It is included in Parsec because of its increasing demand in data mining techniques. It is parallelized with OpenMP and uses three kernels executes in parallel.
3.2.1.7 swaptions
The main reason for including this benchmark is due to the increasing impor- tance of Partial Differential equation(PDE) and the Monte Carlo simulation. It is used to price a portfolio of Swaptions by using the Heath-Jarrow-Morton (HJM)
framework [105]. It is an Intel RMS workload. The behavior of HJM model is non Markovian which prevents it to solve PDE in order to compute the prices.
Therefore, Swaptions employs a Monte Carlo simulation. The program stores all the portfolio in the swaptions array. Each entry of the array represent a derivative.
Swaption divides the array into the number of block which is equal to the number of thread. It assigns each block to a particular thread. In order to compute a price it iterates through all the swaptions and calls the function HJM Swaption Blocking.
3.2.1.8 vips
The application includes fundamental image operations such as transformation and convolution. It is based on the VASARI Image Processing System [106].
The VARSI system is able to construct multi-threaded image processing pipelines transparently on the fly. The image transformation pipeline of the vips benchmark has 18 stages. It is implemented in the VIPS operation im benchmark. All the 18 stages in Vips are implemented in the following kernels:
• Crop- This kernel removes the 100 pixels form all the edges.
• Shrink- This kernel shrinks the image by 10% by applying the matrix trans- formation.
• Adjust white point and shadows- This kernel brightens the white point and pull down the shadows in order to improve the visual quality of an image.
• Sharpen- This kernel enlarges the edges of an output image. It removes the blurring and gives better overall appearance of an output image.
3.2.1.9 x264
It is an H.264/AVC (Advanced Video Coding) video encoder. It includes the new features in encoding such as increased sample bit depth precision, higher- resolution color information, variable block-size motion compensation (VBSMC) or
context-adaptive binary arithmetic coding (CABAC). It allows the H.264 encoders to generate a higher output quality with a lower bit-rate at the expense of a significantly increased encoding and decoding time. It uses motion compensation technique to remove the data redundancy. The application is very flexible and used for different requirements like video conferencing to HD movie distribution.
The H.264/AVC encoding is also required for the next-generation HD DVD or Blu-ray video players.