View of ANALYSIS OF STUDENTS' PERFORMANCE USING MODIFIED K-MEANS ALGORITHM THROUGH MLT

(1)

Peer Reviewed and Refereed Journal (International Journal) ISSN-2456-1037

Vol.04,Special Issue 05, (ICIR-2019) September 2019, Available Online: www.ajeee.co.in/index.php/AJEEE

1

ANALYSIS OF STUDENTS' PERFORMANCE USING MODIFIED K-MEANS ALGORITHM THROUGH MLT

Dr. Rupesh Shukla, Dr Arpana Bharani

Shri Cloth Market Kanya Vanijya Mahavidhayalaya, Indore

Abstract- Big data technologies are important in providing more accurate analysis, which may lead to more concrete decision-making resulting in greater operational efficiencies, cost reductions, and reduced risks for the business. As a result of the rapid development in cloud computing, it's fundamental to investigate the performance of extraordinary Hadoop Map Reduce purposes and to realize the performance bottleneck in a cloud cluster that contributes to higher or diminish performance. It is usually primary to research the underlying hardware in cloud cluster servers to permit the optimization of program and hardware to achieve the highest performance feasible. Hadoop is founded on MapReduce, which is among the most popular programming items for huge knowledge analysis in a parallel computing environment. In this study, we reward a particular efficiency analysis, characterization, and evaluation of Hadoop MapReduce Word Count utility.

Keywords: Performance analysis, cloud computing, Hadoop Word-Count, Apriori, Map- Reduce.

1 INTRODUCTION

Cloud computing is based on five attributes: multi-tenancy (shared resources), massive scalability, elasticity, pay as you go, and self-provisioning of resources, it makes new advances in processors, Virtualization technology, disk storage, broadband Internet connection, and fast, inexpensive servers have combined to make the cloud a more compelling solution.

Impair calculating identifies a brand new Internet-based that solutions to raise the particular use in addition to distribution model, typically concerning the World-wide-web to deliver energetic in addition to scalable and infrequently virtualized resources1. Fig. 1 indicates the particular construction associated with impair calculating. Impair will be the huge calculating system automatically split into a lot of scaled-down subroutines, in addition to then by means of multiple server techniques composing of a substantial lookup, computation in addition to evaluation after the outcomes back towards consumer. With this technology, the particular remote control supplier can certainly throughout a few seconds to achieve tens associated with thousands as well as handle huge amounts of us dollars associated with data for you to and the "super computer" the same highly effective effectiveness associated with system solutions2. It is going to likely employ a major impact on the particular large storage space later on. Nowadays, the phrase “cloud computing”

may be a vital period on earth associated with IT.

2 HADOOP

Hadoop is an open source framework from Apache and is used to store process and analyze data, which are very huge in volume. Hadoop runs applications using the MapReduce algorithm, where the data is processed in parallel with others. In short, Hadoop is used to develop applications that could perform complete statistical analysis on huge amounts of data.

Hadoop Architecture

At its core, Hadoop has two major layers namely:



Processing/Computation layer (MapReduce),



Storage layer (Hadoop Distributed File System)

(2)

2

Figure1: Hadoop Architecture

3 MAP REDUCE

To take the advantage of parallel processing of Hadoop, the query must be in MapReduce form. The MapReduce is a paradigm, which has two phases, the mapper phase and the reducer phase. In the Mapper the input is given in the form of key value pair. The output of the mapper is fed to reducer as input. The reducer runs only after the mapper is over.

The reducer too takes input in key value format and the output of reducer is final output.

3.1 Steps in MapReduce

 Map takes a data in the form of pairs and returns a list of <key, value> pairs. The keys will not be unique in this case.

 Using the output of Map, sort and shuffle are applied by the Hadoop architecture.

This sort and shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of values associated with this unique key <key, list(values)>.

 Output of sort and shuffle will be sent to reducer phase. Reducer will perform a defined function on list of values for unique keys and Final output will<key, value>

will be stored/displayed.

HADOOP MapReduce (Distributed Computing)

HDFS (Distributed Storage)

YARN Framework Common Utilities

(3)

3

Reduce Input .

Figure 2: Map Reduce flow diagram

Figure 3: Map Reduce Steps

23445536545652

22 0 (‘28052627’,

21111111134546

76 4455365456522

221 (8.462170221654

3, 567570)345345(4

54 1111111345474

573 8.6407260969347

) 54676444222345

35 7437473274287

823 (“28052639”)

34533444444445 4234534534534 6

(‘28052627’, (8.4621702216543, 8.6407260969347) (“28052639”) (‘28052627’,

(8.4621702216543, 8.6407260969347) (“28052639”) (‘28052627’,

(8.4621702216543, 8.6407260969347) (“28052639”)

2. Get new application 1.Run

4.Submit application Client

JVM Client Node

Resource manager Node

3. Copy job resource

5.a Start container

8.Allocate resource

9.

Launch

Shared Filesystem (e.g.

HDFS)

Nodemanager node

11.Run Yarn Child

Child JVM MRappmaster

5.b Launch

10. Retrieve job resource

7. Retrieve input splits ^MapTask

Or ReduceTas

k 6.Initializ

e job

Nodemanager Nodemanager

Resourc e

Manager MapReduc

e Program JobClient

Input Map Shuffle

Output Reduce Output

(4)

4 3.2 Problem Domain

Cloud computing is the fastest horizontal and vertical growing area in terms of its users and number of services it offers. These services and users data will always be stored at some different location where the user loss controls over it. Gaining control and making the data secure will affects the performance and other applications of cloud providers. Thus, some mechanism needs to be defined which works as a balance between both provider and client.

The mechanism should be capable of providing the required security to the user by applying some encryption schemes and generating the key from the user’s end, which keeps the control of data to user always.

In cloud computing there is problem associated with whole life of cloud data. For storage three important aspects of data is, Data confidentiality, Data integrity and availability. Data encryption is used for confidentiality. Now, after this encryption data is sent to storage. Now, after the user supplies its key than the data is opened. Thus to provide user based security control for cloud provider is the primary objective of this work and can be achieved by Homomorphic encryption. Key management is another problem because the user is not expert to manage keys. The major challenges associated with big data are as follows:



Capturing data



Storage



Searching



Sharing



Transfer



Analysis



Presentation

To fulfill the above challenges, organizations normally take the help of enterprise servers.

3.3 Aim & Objectives

The main aim of this paper is to give implements of Hadoop map-reduce programming by giving them a hands-on experience in developing their Hadoop based Word-Count application. Hadoop MapReduce Word-Count example is a standard example. This paper emphasis on how to Implement Word-Count example code in MapReduce to count the number of occurrences of a given word in the input file.

4 LITERATURE SURVEY

Samneet Singh and Yan Liu[3] presents cloud provider architecture that explores a search cluster for information indexing and query. We advance leisure APIs that the info will also be accessed by way of different evaluation modules. This architecture makes it possible for extensions to integrate with application frameworks of both batch processing (such as Hadoop) and circulation processing (akin to Spark) of significant information. The evaluation outcomes are structured in Semantic Media Wiki pages in the context of the monitoring data source and the analysis system. This cloud structure is empirically assessed to assess its responsiveness when processing a tremendous set of knowledge files below node disasters.

JOSEPH A. ISSA,[4] offered a distinct performance analysis and analysis for Hadoop WordCount workload utilizing different processors similar to Intel‟s ATOM D525, Xeon X5690, and AMD‟s Bobcat E350. It also suggests that Hadoop WordCount is compute-sure workload in both map segment and scale down segment. The outcome exhibit that enabling HT and growing the number of sockets have a high impact on the Hadoop WordCount performance even as reminiscence velocity and capacity does now not have an impact on efficiency vastly.

Yaxiong Zhao, Jie Wu, and Cong Liu,[5] In this paper, author recommends a knowledge-conscious cache framework for big-data functions. In Dache, tasks publish their intermediate outcome to the cache manager. A project queries the cache supervisor before executing the specific computing work. A novel cache description scheme and a cache request and reply protocol are designed. We enforce Dache by means of extending Hadoop.

Test bed experiment results show that Dache tremendously improves the completion time of MapReduce jobs.

(5)

5 4.1 Proposed methodology

In this paper a methodology is introduced to word count in Hadoop. MapReduce Algorithm uses the following three main steps:



Map Function



Shuffle Function



Reduce Function 4.2 Map Function

Map Function is the first step in MapReduce Algorithm. It takes input tasks and divides them into smaller sub-tasks. Then perform required computation on each sub-task in parallel. This step performs the following two sub-steps:

Splitting step takes input DataSet from Source and divide into smaller Sub-DataSets.

Mapping step takes those smaller Sub-DataSets and perform required action or computation on each Sub-DataSet. The output of this Map Function is a set of key and value pairs as <Key, Value> .

Shuffle Function

It is the second step in MapReduce Algorithm. Shuffle Function is also known as “Combine Function”. It performs the following two sub-steps:

Merging which takes a list of outputs coming from “Map Function” and perform these two sub-steps on each and every key-value pair.

Sorting step takes input from Merging step and sort all key-value pairs by using Keys. This step also returns <Key, List<Value>> output but with sorted key-value pairs.

Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next step.

Reduce Function

It is the final step in MapReduce Algorithm. It performs only one step: Reduce step. It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform reduce operation.

Proposed Architecture

Word count is typical examples where Hadoop map reduce developers start their hands on.

This sample map reduce is intended to count the no of occurrences of each word in the provided input files.

(6)

6

The word count operation takes place in two stages a mapper phase and a reducer phase. In mapper phase first the test is tokenized into words then we form a key value pair with these words where the key being the word itself and value „1‟. For example consider the sentence

“tring tring the phone rings”

In map phase the sentence would be split as words and form the initial key value pair as

<tring,1>

<the,1>

<phone,1>

<rings,1>

In the reduce phase the keys are grouped together and the values for similar keys are added.

So here there is only one pair of similar keys „tring‟ the values for these keys would be added so the output key value pairs would be

<tring,2>

<the,1>

<phone,1>

<rings,1>

This would give the number of occurrence of each word in the input. Thus reduce forms an aggregation phase for keys. The point to be noted here is that first the mapper class executes completely on the entire data set splitting the words and forming the initial key value pairs. Only after this entire process is completed the reducer starts. Say if we have a total of 10 lines in our input files combined together, first the 10 lines are tokenized and key value pairs are formed in parallel, only after this the aggregation/ reducer would start its operation.

(7)

7 5 CONCLUSIONS

Map-Reduce has become an important platform for a variety of data processing applications.

Word Count Mechanisms in Map-Reduce frameworks such as Hadoop, suffer from performance degradations in the presence of faults. Word Count Map-Reduce, proposed in this paper provides an online, on-demand and closed-loop solution to managing these faults.

The control loop in word count mitigates performance penalties through early detection of anomalous conditions on slave nodes.

Anomaly detection is performed through a novel sparse-coding based method that achieves high true positive and true negative rates and can be trained using only normal class (or anomaly-free) data. The local, decentralized nature of the sparse-coding models ensures minimal computational overhead and enables usage in both homogeneous and heterogeneous Map-Reduce environments.

RELATED WORKS

1. JOSEPH A. ISSA, “Performance Evaluation and Estimation Model Using Regression Method for Hadoop WordCount”, Received November 19, 2015, accepted December 12, 2015, date of publication December 18, 2015, date of current version December 29, 2015.

2. Yaxiong Zhao, Jie Wu, and Cong Liu, “Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce ramework”,ISSNll10070214ll05/10llpp39-50 Volume 19,Number 1, February 2014

3. Samneet Singh and Yan Liu,“A Cloud Service Architecture for Analyzing Big Monitoring Data”,ISSNll1007- 0214ll05/10llpp55-70 Volume 21, Number 1, February 2016.

4. JOSEPH A. ISSA, “Performance Evaluation and Estimation Model Using Regression Method for Hadoop WordCount”, December 18, 2015.

5. Yaxiong Zhao, Jie Wu, and Cong Liu, “Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework”,ISSNll10070214ll05/10llpp39-50 Volume 19,