A Big Data Efficient Accessing Approach with MapReduce

(1)

International Journal of Electrical, Electronics and Computer Systems (IJEECS)

________________________________________________________________________________________________

ISSN (Online): 2347-2820, Volume -4, Issue-8, 2016 1

A Big Data Efficient Accessing Approach with MapReduce

1Nilofar Begum, ²A Ananda Shankar

1,2School of Computing and Information Technology REVA University, Bengaluru

Abstract: virtual shuffling model of data processing and analysis has gained upper value in research. The proposed system in this paper discuss on virtual shuffling and time complexity analysis under the ratio of MapReduce properties. The system is highlighted with multi key word search environment on live twitter handler for data acquisition. The trail is successfully conducted for more than 250 independent and logically dependent keywords and thus the results is achieved below.

I. INTRODUCTION

Big data technologies are important in providing more accurate analysis, which may lead to more concrete decision making result in greater operational efficiencies, cost reductions, and reduced risk for the business. To harness the power of big data, it requires the infrastructure that can manage and process huge volumes of structured and unstructured data in real time and can protect data privacy and security. There are various technologies in the market from different vendors including Amazon, IBM, and Microsoft etc., to handle big data.

Hadoop is an Apache open source software written in java that allows distributed processing of large datasets across clusters of computer using simple programming models. The Hadoop framework applications work in an environment that provides distributed storage and computation across clusters of computer. Hadoop is designed to scale up from single server to thousands of machines, each offering local computations and storage.

MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient processing of a large amount of data(multi tera byte data sets), on large clusters of commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an apache open source framework.

The Hadoop distributed file system (HDFS) is based on the Google File System (GFS) and provides a distributed file system that is designed to run commodity hardware.

It has many similarities with the existing distributed system. However, the difference from other distributed file systems is significant. It is highly fault tolerant and it is designed to be deployed on low-cost hardware. It provides high throughput access to application data and is suitable for applications having large datasets. Apart

from the above mentioned two core components, hadoop framework also includes the following two modules.

The growing demand of data requesting and downloading has seen an increased dependency of 75%

from 2010 to 2015 according to American Statistical Society, this challenge is increased in social data accumulation and heap formation for large similar data sets, thus the system is motivated to retrieve virtual concept of memory segmentation and alignment.

Primary data request models of acquiring data from internet were based on physical download and thus reduces the overall system performance and increases the amount of memory consumed with increased time consuming. Thus to this over growing challenge, we have proposed our system for data virtualization and cloud based dependency creation under Hadoop environment cluster.

II. SYSTEM DESCRIPTION

The proposed system is designed and developed under HADOOP clustering environment with following architecture. The porposed system it consist of two major modules, a HADOOP cluster under single node environment and social networking cloud environment.

The system is started with input collection of keywords to search.

The search keywords are aligned based on type of input arguments and selects the pattern of extraction and packs the data under interfering cum communication port for data transfer. On receiving the input keywords to the social server (here we have considered Tweeter) it unpacks and retrieves the keyword at cloud interfacing environment. Under remote cloud / server environment the processing of searching and optimizing the overall search output is seen and masked. The server packets the system indexing values for each tweet on searched positive with a threshold value for positive and negative tweet analysis.

(2)

________________________________________________________________________________________________

Fig 1: System Architecture Diagram.

III. MATHEMATICAL MODELING and PROOF OF CONCEPT

The developed system has a mathematical description as below.

1. Dataset/ keyword collection and analysis Data collection and keyword generation is the process of acquiring the data sample and this the relative shape data sets are highlighted in the set S, such that the minimal number of keywords should be 1 and maximum of 5 such that each keyword is related to its equivalent data sets fetched from the input.

2. Connection Establishment S= {s₁,s₂,s₃,…….sn}= ^choice_j=1 Sj

Packet[ ^choice_j=1 (Sj)]

The parameters on connection establishment include the overall segmentation for keywords and analysis, this step involves the connection establishment to the tweeter data strings under active internet domain.

3. Tweeter Data handler.

analyse ( ^choice_j=0 sj )

Fetch tweets T= (t₁, t_2,t_3,……tn/ T_iϵ€ String(Sj))T= { t₁, t₂,t₃,……. tn}

D = ^CHOICEJ=0 (₀^N ^TN) ⊆ (^SJ)

The tweeter handler is initialized and acquires the data sets from virtually shuffling the parameters. The data fetched is virtually aligned and hence the overall parameters are reshuffled and sent under indexing parameters.

4. Tweeter Indexed File Offline processing D_V=( ^CHOICE_J₌₁ (₀^N ^TN) ⊆ (SJ))I

WHERE I:INDEX/I={I₁, I_2,I_3,…… InⱯ IⱯIn}

The selection of data sets are aligned and hence the overly system is virtually aligned and segregated. This is achieved under an indexing and thus the indexing is

fetched via a internet services under clustered environment.

5. Decision Making A = {(₀ⁿ _Sj−ti^Ii )_dt^d(keywords)}

If ^d

dt keyword ⊆ Positive else negative

The process of analysis and decision making is performed offline under environment of HADOOP, the basic MAPREDUCE function is aligned and performs the system overall benefits. The keyword of decision making is analyzed and positivity is computed.

IV. RESULTS AND OBSERVATIONS

The system is formulated under an active state of internet and thus the system requirement is made freezable and thus the design is continued under modulation and simulation. Based on the results of simulation the following outcomes are generated.

The performance ratio s analyzed and thus the result are follows

Fig 1: Single Module Analysis

Result analyzed is performed under overall segment and thus the system behavior can be seen and disused. The results shown in Fig 1, retrieves’ the data sets for keyword v/s time consumed for data response, the attributes include time, negative and positive. The time consummation from the graph demonstrates that the overall system behavior is made moderate, from the graph we can analyze, time for retrieving maximum tweets are comparatively low to that of unique keywords under search.

Time Complexity analysis

The below graph in fig demonstrates the overall time required in analyzing the tweets. The process of analysis is achieved with total number of time delay encountered in retrieving the tweets and segregating it into positive and negative tweets. The maximum delay of shuffling and result processing is retrieved in 25Sec for 500 tweets amongst which 483 are analyzed as positive and rest as negative. The time complexity compared to typical system is drastically reduced upto max of 38Sec as per our trials of 250 tweets sample search.

0 500 1000 1500

sachin serena larry anil tv9 prakash

n p time tweets

(3)

________________________________________________________________________________________________

Detailed Time V/s Tweet Graph

The graph showed below projects the ratio of analyzing time and tweet complexity from our proposed system.

The time ratio is maintained constant for a ranged tweeter data and thus we can also analyses the system complexity under following scenario

1. Least Number of Tweets:

The least number of tweets retrieved is five (5) with a time delay of 2 sec, this is under worst case scenario.

The time complexity is improved as the system avoids waiting time for tweet searches which are untended and shared.

2. Maximum Number of Tweets:

The system under same instance of simulation has produced maximum number of tweets of ratio 594:15, i.e., the system has achieved to retrieve 594 successful tweets from the system under a time constrain of 15 Sec.

V. CONCLUSION

The proposed system is designed for extracting data from online big data repository for analysis and retrieving the same under virtual terminology, i.e. the system is designed for retrieving values and process online instead of downloading at local system. The time consummation in retrieving the tweets is drastically reduced and thus the performance is enhanced.

The system is analyzed and verified for 200 samples under single, double and multiple modes, thus the system performance is effecting and has higher analysis of response time, the proposed system is programmed from MapReduce loading balancing technique, thus the ratio of efficiency is improvised under this proposed system.

REFERENCES

[1] G. Ananthanarayanan, S. Kandula, A. G.

Greenberg, I. Stoica, Y. Lu, B. Saha, and E.

Harris, “Reining in the outliers in Map-Reduce clusters using Mantri,” in Proc. 9th USENIX Symp.

[2] Oper. Syst. Design Implementation (OSDI), R.H.

Arpaci-Dusseau and B. Chen, eds., Oct. 4–6, 2010, pp. 265–278. [Online]. Available:

http://www.usenix.

org/event/osdi10/tech/full_papers/osdi10_procee dings.pdf.

[3] P. Balaji, S. Aameek, L. Ling, and J. Bhushan,

“Purlieus: Localityawareresource allocation for MapReduce in a cloud,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal. (SC’11), Nov. 2011, pp. 58:1–58:11.

[4] M. Zaharia, D. Borthakur, J. SenSarma, K.

Elmeleegy, S. Shenker, and I. Stoica, “Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling,” in Proc. 5th Eur. Conf. Comput. Syst. (EuroSys’10), 2010, pp. 265–278. [Online]. Available:

http://doi.acm.org/10.1145/1755913.1755940.

[5] M. Zaharia, A. Konwinski, A. D. Joseph, R. H.

Katz, and I. Stoica, “Improving MapReduce performance in heterogeneous environments,” in Proc. 8th USENIX Symp. Oper. Syst. Design Implementation (OSDI), Dec. 8–10, 2008, pp.

29–42. [Online]. Available:

http://www.usenix.org/events/osdi08/tech/full_

papers/zaharia/zaharia.pdf.

[6] G. Mackey, S. Sehrish, J. Lopez, J. Bent, S.

Habib, and J. Wang, “Introducing Map-Reduce to high end computing,” in Proc. 3^rdPetascale Data Storage Workshop Held Together with Supercomput.,Nov. 2008, pp. 1–6.

[7] Y. Wang, X. Que, W. Yu, D. Goldenberg, and D.

Sehgal, “Hadoop acceleration through network levitated merge,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal. (SC’11), Nov. 2011, pp. 57:1–58:10.

[8] T. Condie, N. Conway, P. Alvaro, J. M.

Hellerstein, K. Elmeleegy, and R. Sears,

“MapReduce online,” in Proc. 7th USENIX Symp. Netw Syst. Design Implementation (NSDI), Apr. 2010, pp. 312–328.

[9] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J.

Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proc. 9th USENIX Conf. Netw.

Syst. Design Implementation (NSDI’12), 2012, p.

2. [Online]. Available:

http://dl.acm.org/citation.cfm?id=2228298.22283 01.

[10] X. Yang, Z. Yu, M. Li, and X. Li, “Mammoth:

Autonomic data processing framework for scientific state-transition applications,” in Proc.

ACM Cloud Autonomic Comput. Conf.

(CAC’13), 2013, pp. 13:1–13:10. [Online].

Available: http://doi.acm.org/10.1145/

2494621.2494633.

[11] J. Leverich and C. Kozyrakis, “On the energy (in)efficiency of Hadoop clusters,” ACM SIGOPS Oper. Syst. Rev., vol. 44, no. 1, pp. 61–

65, 2010.

(4)

________________________________________________________________________________________________

[12] W. Lang and J. M. Patel, “Energy management for MapReduce clusters,” J. Very Large Data Bases (VLDB), vol. 3, no. 1, pp. 129–139, 2010.

[Online]. Available:

http://www.comp.nus.edu.sg/∼vldb 2010/proceedings/files/papers/R11.pdf.

