An experimental analysis using hardware scaling - IRD India

(1)

International Journal of Electrical, Electronics and Computer Systems (IJEECS)

________________________________________________________________________________________________

ISSN (Online): 2347-2820, Volume -2, Issue-8,9 2014 15

MapReduce Performance: An experimental analysis using hardware scaling

1S. Narayanswamy Iyer, ²Prateek Joshi, ³Gaurav Phadke, ⁴Pratyush Kulwal Department of Computer Engineering, VIIT, Pune

Email: [email protected] Abstract— In this age, data is everything. Organizations

like Facebook generate amounts of data close to 500 TB[1], that requires processing. The solution to this problem of handling Big Data, as it is called, is provided by the programming paradigm known as MapReduce.

MapReduce was developed by Google for its own data processing needs, but has since been adapted and modified by numerous organizations such as Yahoo to perform their own data processing and number crunching. MapReduce is a flexible framework and it is supports development on many languages and platforms. Apache Hadoop is an open source distributed computing architecture that provides libraries for running jobs based on MapReduce. The aim of our work is to find out how the performance of MapReduce algorithms scales up or down in a computing cluster, which comprises of homogenous commodity hardware, using both standard and customized Mapper and Reducer classes.

Keywords— Hadoop, MapReduce, Cluster, Performance, Distributed computing.

I. INTRODUCTION

MapReduce was developed by Google as an answer to its problem of creating a reverse index of the Web, through which search results could be fetched faster.

The results obtained were exemplary, and in 2004 Google had a research paper published to highlight the features of MapReduce[3], and has since obtained a patent for the MapReduce technology. In this paper, we have attempted to analyze the results obtained from MapReduce programs by adding successive addition of nodes to a cluster of commodity hardware. An introduction to MapReduce has been presented in section I. Section II deals with Apache Hadoop and the specifications of our test cluster. In section II, we will present our findings about the performance of MapReduce using the standard PiEstimator example bundled with all Hadoop distributions. Section III will consist of results from our own MapReduce functions, executed on an aquired data set, followed by the conclusions and future scope for research.

II. MAPREDUCE : AN INTRODUCTION

The greatest strength of MapReduce is its capacity to process parallelizable problems on large, distributed datasets by harnessing the processing capability of many nonspecialized machines in a cluster or grid.

MapReduce depends on two aspects of functional programming:

Map – Splitting the input into multiple blocks or splits and delegating them to the subsidiary or worker nodes to process.

Reduce – The nodes return the output of the Map phase to the master node, which aggregates or summarizes the results to display to the user.

MapReduce takes as input a set of key/value pairs.

These are specified by the programmer as per the data to be processed and the needed information. The output produced is also a set of key/value pairs, again according to the specifications mentioned in the Reduce function. To derive an analogy with relational databases, the Map function is a “group-by” clause, whereas Reduce is an aggregate function, like “average”. While this is a very rudimentary description of the working of MapReduce, more complex programs can be written using additional intermediate functions like Partitioners and Combiners, as well as chaining multiple MapReduce jobs together.

There are many advantages of MapReduce. It hides the details of its advanced functionality such as parallelization, fault-tolerance, locality optimization, and load balancing from users, by allowing coding in a variety of languages including C++, Java and Python. It is applicable to a wide variety of problems such as web mining, machine learning, decision support systems and data mining. MapReduce allows scaling across thousands of machines. This last factor is the one we aim to analyze and verify using experimental analysis.

III. THE HADOOP CLUSTER

Hadoop is a distributed computing architecture that focuses on and specializes in processing Big Data, usually semi-structured or unstructured, using MapReduce. It is an open-source distribution from the Apache Foundation and supports development on various platforms such as Java and Python. Hadoop is designed to run primarily as a distributed architecture and file system on a cluster of commodity computers, though pseudo-distributed (single-node) implementations are possible. Organizations like Yahoo

(2)

________________________________________________________________________________________________

have enormous clusters with more than 4000 nodes that handle around 400 petabytes[2] of data.

Here, we have used the open-source Apache Hadoop to perform experimental analysis on a large, semi- structured data-set using a small cluster comprising of low-end commodity machines. We have used this cluster to execute MapReduce programs. Using these results, we will attempt to draw conclusions as to how the performance of MapReduce scales with the usage of additional nodes in the used cluster.

The hardware used for the work presented in this paper comprised of 4 low-configuration, commodity workstations, each with a dual-core processor and 2 GB memory. All machines ran the Ubuntu 12.04 LTS operating system, and were connected by standard LAN cables. The necessary software included the latest distribution of Java (Java-7-Oracle), the latest stable release of Hadoop at the time (Hadoop-1.0.3), and SSH for communication between cluster nodes. All commands and options were run from the Terminal utility.

IV. RUNNING THE PIESTIMATOR EXAMPLE

The first code example that was executed and studied was the PiEstimator example. The code and Java classes for this example come bundled with the distribution of Hadoop, and make an exellent starting point for the study and analysis of the cluster. This is due to the Mapper and Reducer being written to accept the complexity of the job from the user. The code itself is based on the Quasi-Monte Carlo method[4]. The Mapper and Reducer classes perform their functions according to the following formula:

Mapper: Generate the points in a unit square and count points inside/outside the circle inclusive in the square.

Reducer: Accumulate points inside/outside results from the Mapper.

numTotal = numInside + numOutside. The value of numInside/numTotal is approximately the value of the term (Area of Circle)/ (Area of Square), where area of circle is pi/4 and area of unit square is 1. Then, pi value is estimated to be 4*(numInside/numTotal).

The example is executed by supplying 2 arguments: the number of maps, and the number of samples per map task. In this case, the example was run with 100 maps and 100 samples per map task.

Comparative Study:

The example was executed 4 times, each time with an increased cluster strength. i.e. A single node, followed by 2, 3 and then 4 nodes. The first time, the cluster comprised of one node running all the necessary daemons, in a pseudodistributed environment.

Subsequent additions were made in the form of the DataNodes, to which the NameNode could delegate work, thus causing a noticeable decrease in execution

times. The data pertaining to the completion time was provided by the example itself. Our findings are, as follows:

Sr.

No.

No. of nodes in cluster

Time taken by PiEstimator (seconds)

1 4 98.042

2 3 126.12

3 2 173.005

4 1 332.374

Supporting screenshots:

Graph:

Thus, it is clear from the data and statistics of the provided code, that the performance of MapReduce scales exponentially with the addition of nodes to the cluster. We then considered a different data set, wrote our own Mapper and Reducer classes and attempted to replicate the results observed above.

V. UTILIZING A NEW DATASET: OOKLA NETINDEX

We chose the public “Netindex” data set made available by Ookla[5]. This data set consists of a number of large files in “.csv” format, which can be processed using Hadoop and MapReduce. Specifically, the data file chosen was “city_isp_daily_quality.csv”. This data, while not being the usual “Big Data”, was sufficient and optimal for observing the performance difference and scaling over the small, low configuration cluster.

The file contains the vital statistics of the signal quality of Internet Service Providers in various cities and countries over different regions. The metrics provided by the data set are:

1. R-factor 2. Jitter 3. Packet Loss 4. Latency

5. No. of test executed

6. Difference between test site and server

(3)

________________________________________________________________________________________________

Data set screenshot:

These parameters were averaged and combined by a set of formulae into a single value for each ISP, called

“quality factor”:

Then this list of ISPs was sorted according to the quality factor. The processing required for this was split into 3 jobs, which were manually chained together internally:

1. Deriving the average of each of the fields

2. Performing the requisite calculations on each record 3. Calculating average of the quality factor for each ISP Findings:

This MapReduce program was run 4 times, each time with increased cluster strength, by the same method as the aforementioned PiEstimator example. The example was first run in a pseudo-distributed system. i.e. with one node running all the Hadoop daemons. Nodes were subsequently added to the working cluster to determine the increased performance.

The total time taken for the execution of all 3 MapReduce jobs was computed by the addition of values obtained from the default Hadoop Job tracker display daemon. The daemons and the ports used were as follows:

1. http://localhost:50030 - Jobtracker 2. http://localhost:50070 – Namenode 3. http://localhost:50060 - TaskTracker The results were, as follows:

Sr.

No.

No. of nodes in cluster

Time taken by Job (seconds)

1 4 192

2 3 234

3 2 337

4 1 496

Graph:

It is observed from the findings that the performance of the cluster for this job scales by a factor of 1.2-1.3 with the addition of each node. These results are obtained from the analysis of the timings of the experiment. The difference between the scaling factor of the PiEstimator example and this one can be attributed to differently written code and different data sets. The general pattern of the performance increasing with the addition of nodes, remains the same.

Future Scope:

Future research on this topic could include expansion of the cluster to 8 or 12 nodes to verify the scalability and to observe whether the ratio of the performance boost remains the same over the addition of further nodes. We would further like to mention that these jobs were run on a minimalistic set up, with almost no focus on optimization of cluster and MapReduce configuration parameters. If the same examples were to be run on an optimized cluster, the results observed, though following the same general trend, would show marked increase in performance.

We also intend to work further on the Netindex data set from Ookla, as well as various other available data sets to observe the performance differences brought in by increasing the data size and complexity.

VI. CONCLUSION

We can thus conclude that the Hadoop architecture and the MapReduce programming paradigm are highly scalable, even for small data sets and with low-end, commodity hardware, small-scale clusters. A secondary inference points us to the fact that Hadoop works best on large-scale distributed computing clusters with numerous nodes, which allows for

extremely fast computations, and optimum usage of the strengths of MapReduce algorithm. The true power of such technology, when harnessed with the proper hardware, against an appropriately large data set, with optimized code must be extremely great. It is thus also clear to us, as to why Hadoop, despite its relative novelty, is such a promising technology in the field of data mining and distributed computing.

REFERENCES

[1] http://gigaom.com/2012/08/22/facebook-is- collecting-your-data-500- terabytes-a-day/

[2] http://www.informationweek.com/development/

database/yahoo-andhadoop-in-it-for-the-long- term/240002133

[3] http://code.google.com/p/haloop/source/

browse/trunk/src/examples/org/apache/hadoop/ex amples/PiEstimator.java?r=387

[4] http://www.netindex.com/source-data/

