View of A COMPREHENSIVE STUDY ON BIG DATA ANALYSIS

(1)

Vol.04,Special Issue 07, (RAISMR-2019) November 2019, Available Online: www.ajeee.co.in/index.php/AJEEE

A COMPREHENSIVE STUDY ON BIG DATA ANALYSIS

1Harsha Chauhan, ²Naziya Hussain, ³Purnima Chourasiya School of Computers, IPS Academy, Indore, MP, India

Abstract:- The amount of data produced and communicated over the Internet is significantly increasing, thereby creating challenges for the organizations that would like to reap the benefits from analyzing this massive influx of big data. This is because big data can provide unique insights into, inter alia, market trends, customer buying patterns, and maintenance cycles, as well as into ways of lowering costs and enabling more targeted business decisions. Realizing the importance of big data business analytics (BDBA), we review and classify the literature on the application of BDBA on logistics and supply chain management (LSCM) – that we define as supply chain analytics (SCA), based on the nature of analytics (descriptive, predictive, prescriptive) and the focus of the LSCM (strategy and operations)

Keywords:- BDBA-Big Data Business Analytics, LSCM-Logistics and Supply Chain Management, SCA-Supply Chain Management, International Data Corporation, SSD-Solid State Drive, PCM-Phrase Change Memory, HDFS- Hadoop Distributed File System, ETL Extraction, Transformation and Loading.

1. INTRODUCTION

The amount of data produced and communicated over the Internet is significantly increasing, thereby creating challenges for the organizations that would like to reap the benefits from analyzing this massive influx of big data. This is because big data can provide unique insights into, inter alia, market trends, customer buying patterns, and maintenance cycles, as well as into ways of lowering costs and enabling more targeted business decisions.

Realizing the importance of big data business analytics (BDBA), we review and classify the literature on the application of BDBA on logistics and supply chain management (LSCM) – that we define as supply chain analytics (SCA), based on the nature of analytics (descriptive, predictive, prescriptive) and the focus of the LSCM (strategy and operations) Big data means the datasets which cannot be recognized, obtained, managed, analyzed, and processed by present tools.

Different definitions of big data have been given by different users of Big Data and different analyst of Big Data like research scholars, data analyst and technical practitioners. According to Apache Hadoop “Big data is a dataset which could not be captured, managed, and processed by general computers within an acceptable scope”. Over the past 20 years, data has increased in a large scale in various fields. Actually big data was defined in 2001 for the first time. Doug Laney defined the 3Vs model, i.e., Volume, Variety and velocity.

In spite of the fact that the 3Vs model was not used to define big data, Gartner and many other organizations, like IBM and Microsoft still uses the “3Vs” model to define big data. In the “3Vs” model, Volume means, the dataset is so big and large that it is very difficult to analyze; Velocity means the data collected and gathered so rapidly to utilize it to the Maximum; Variety shows different types of data like structured, semi-structured and unstructured data i.e. Audio, video, webpage, and text. IDC (International data Corporation), one of the most dominant leaders in the research fields of Big Data, is of different view about Big Data.

According to an IDC report of 2011 “Big Data technologies describe a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data, by enabling the high-velocity capture, discovery, and/or analysis” According to a report from International Data Corporation (IDC), in 2011, the overall created and copied data volume in the world was 1.8ZB (≈ 1021B), which increased by nearly nine times within five years [1].

This figure will double at least every other two years in the near future. Today Big Data draws a lot of attention towards IT world. Big data consists of different types of key technologies like Hadoop, Map Reduce, HIVE, PIG, Mongo DP, HBASE, Cassandra that work together to achieve the extracting value from data that would be previously considered dead.

(2)

Figure 1: Big data analytics 2. BIG DATA ANALYSIS

Big Data Analysis mainly involves analytical methods of big data, systematic architecture of big data, and big data mining and software for analysis. Data investigation is the most important step in big data, for exploring meaningful values, giving suggestions and decisions. Possible values can be explored by data analysis. However, analysis of data is a wide area, which is dynamic and is very complex.

Big Data Analysis is of four types:-

 Prescriptive analysis

 Predictive analysis

 Diagnostic analysis

 Descriptive analysis

Figure 2: Types of big data analysis 3. BIG DATA – CHARACTERISTICS

(3)

3.1 Complexity

Complexity measures the level of interconnectedness (potentially vast) and reliance in big data structures with the end goal that a little change (or mix of little changes) in one or a couple of components can yield expansive changes or a little change that swell crosswise over or course through the framework and considerably influence its conduct, or no change by any stretch of the imagination.

3.2 Data Value

Data value measures the convenience of data in deciding. Data science is exploratory and valuable in becoming more acquainted with the data; however "investigative science"

incorporates the prescient energy of big data. Client can run certain inquiries against the data put away and therefore can deduct critical outcomes from the sifted data acquired and can likewise rank it as per the measurements they require. These reports help these individuals to discover the business patterns as indicated by which they can change their methodologies.

3.3 Data Velocity

Velocity in Big data is an idea which manages the speed of the data originating from different sources. This trademark isn't being restricted to the speed of approaching data yet additionally speed at which the data streams and totaled.

3.4 Data Volume

The Big word in big data itself characterizes the volume. At display the data existing is in petabytes (1015) and should increment to zetta bytes (1021) in close-by future. Data volume measures the measure of data accessible to an association, which does not really have to claim all of it as long as it can get to it.

3.5 Data Variety

Data variety is a measure of the wealth of the data portrayal – text, images video, sound, and so on. Data being delivered isn't of single class as it incorporates the conventional data as well as the semi organized data from different assets like pages, Web Log Files, online networking destinations, email, records.

4. CHALLENGES IN BIG DATA ANALYTICS

Recent years big data has been accumulated in several domains like health care, public administration, retail, biochemistry, and other interdisciplinary scientific researches. Web- based applications encounter big data frequently, such as social computing, internet text and documents, and internet search indexing. Social computing includes social network analysis, online communities, recommender systems, reputation systems, and prediction markets where as internet search indexing includes ISI, IEEE Xplorer, Scopus, and Thomson Reuters etc.

Considering this advantages of big data it provides a new opportunities in the knowledge processing tasks for the upcoming researchers. However opportunities always follow some challenge to handle the challenges we need to know various computational complexities, information security, and computational method, to analyze big data. For example, many statistical methods that perform well for small data size do not scale to voluminous data.

 Data storage and analysis

 Knowledge discovery and computational complexities

 Theoretical challenges facing big data

 Information security

 Scalability and visualization of data

 Diagnostic challenges 4.1 Data Storage and Analysis

In recent years the size of data has grown exponentially by various means such as mobile devices, aerial sensory Technologies, remote sensing, radio frequency identification readers etc. These data are stored on spending much cost whereas they ignored or deleted finally

(4)

because there is no enough space to store them. Therefore, the first challenge for big data analysis is storage mediums and higher input/output speed. In such cases, the data accessibility must be on the top priority for the knowledge discovery and representation.

The prime reason is being that, it must be accessed easily and promptly for further analysis. In past decades, analyst use hard disk drives to store data but, it slower random input/output performance than sequential input/output. To overcome this limitation, the concept of Solid State Drive (SSD) and Phrase Change Memory (PCM) was introduced.

However the available storage technologies cannot possess the required performance for processing big data.

4.2 Knowledge Discovery and Computational Complexities

Knowledge discovery and representation is a prime issue in big data. It includes a number of sub fields such as authentication, archiving, management, preservation, information retrieval, and representation. There are several tools for knowledge discovery and representation such as fuzzy set [14], rough set [15], soft set [16], near set [17], formal concept analysis [18], principal component analysis [19] etc to name a few. Additionally many hybridized techniques are also developed to process real life problems. All these techniques are problem dependent.

Further some of these techniques may not be suitable for large datasets in a sequential computer. At the same time some of the techniques have good characteristics of scalability over parallel computer. Since the size of big data keeps increasing exponentially, the available tools may not be efficient to process these data for obtaining meaningful information. The most popular approach in case of large dataset management is data warehouses and data marts. Data warehouse is mainly responsible to store data that are sourced from operational systems whereas data mart is based on a data warehouse and facilitates analysis.

4.3 Theoretical Challenges Facing Big Data

One of the key set of challenges [46] faced in today‘s tight market is the need to find and analyze the required data at the least speed possible. However with exponentially growing amount of data, speed becomes a major issue as analyzing such sheer volumes of data in detail to find out required output becomes more and more tedious.

It is not only the quantity of data but also discovering the data according to the appropriateness of the project which is a Herculean task. Elimination of out-of-context data is an essential objective. Even if in-context data retrieved at a high speed is achieved, the quality of data may be compromised if it is not accurate or timely. As a result of this, appropriate results of the project may not be published.

4.4 Information Security

In big data analysis massive amount of data are correlated, analyzed, and mined for meaningful patterns. All organization shave different policies to safe guard their sensitive information. Preserving sensitive information is a major issue in big data analysis. There is a huge security risk associated with big data [24]. Therefore, information security is becoming a big data analytics problem.

Security of big data can be enhanced by using the techniques of authentication, authorization, and encryption. Various security measures that big data applications face are scale of network, variety of different devices, real time security monitoring, and lack of intrusion system [25], [26]. The security challenge caused by big data has attracted the attention of information security. Therefore, attention has to be given to develop a multi level security policy model and prevention system.

4.5 Scalability and Visualization of Data

One of the most important challenges for big data analysis techniques is its scalability and security. In the last decades researchers have paid attentions to accelerate data analysis and its speed up processors followed by Moore’s Law. For the former, it is necessary to develop sampling, on-line, and multi resolution analysis techniques. Incremental techniques have good scalability property in the aspect of big data analysis.

As the data size is scaling much faster than CPU speeds, there is a natural dramatic shift in processor technology being embedded with increasing number of cores. This shift in

(5)

processors leads to the development of parallel computing. Real time applications like navigation, social networks, finance, internet search, timeliness etc. requires parallel computing. The objective of visualizing data is to present them more adequately using some techniques of graph theory. Graphical visualization provides the link between data with proper interpretation.

4.6 Diagnostic Challenges

The primary testing questions are as:-

1) Imagine a scenario where data volume gets so substantial and fluctuated and it isn't known how to manage it.

2) Does all data should be put away?

3) Does all data should be dissected?

4) How to discover which data focuses is extremely vital? In what manner can the data be utilized to best favorable position?

Big data carries alongside it some immense scientific difficulties. The sort of investigation to be done on this gigantic measure of data which can be unstructured, semi organized or organized requires countless aptitudes. Also the sort of examination which is should have been done on the data depends profoundly on the outcomes to be acquired i.e. basic leadership. This should be possible by utilizing one of two systems: either join huge data volumes in examination or decide forthright which Big data is pertinent.

5. TOOLS FOR BIG DATA PROCESSING

Large numbers of tools are available for process big data. In this section, we discuss some current techniques for analyzing big data with emphasis on three important emerging tools namely Map Reduce, Apache Spark, and Storm. Most of the available tools concentrate on batch processing, stream processing and interactive analysis.

Most batch processing tools are based on the Apache Hadoop infrastructure such as Mahout and Dryad. Stream data applications are mostly used for real time analytic. Some examples of large scale streaming platform are Strom and Splunk. The interactive analysis process allows users to directly interact in real time for their own analysis.

5.1 Apache Hadoop and Map Reduce

The most established software platform for big data analysis is Apache Hadoop and Map reduce. It consists of hadoop kernel, Map Reduce, Hadoop Distributed File System (HDFS) and apache hive etc. Map Reduce is a programming model for processing large datasets is based on divide and conquer method. The divide and conquer method is implemented in two steps such as Map step and Reduce Step.

Hadoop works on two kinds of nodes such as master node and worker node. The master node divides the input into smaller sub problems and then distributes them to worker nodes in map step. Thereafter the master node combines the outputs for all the sub problems in reduce step. Moreover, Hadoop and Map Reduce works as a powerful software framework for solving big data problems. It is also helpful in fault-tolerant storage and high throughput data processing.

5.2 Apache Mahout

Apache mahout aims to provide scalable and commercial machine learning techniques for large scale and intelligent data analysis applications. Core algorithms of mahout including clustering, classification, pattern mining, regression, dimensionality reduction, evolutionary algorithms, and batch based collaborative filtering run on top of Hadoop platform through map reduce framework.

The goal of mahout is to build a vibrant, responsive, diverse community to facilitate discussions on the project and potential use cases. The basic objective of Apache mahout is to provide a tool for elevating big challenges. The different companies those who have implemented scalable machine learning algorithms are Google, IBM, Amazon, Yahoo, Twitter, and face book [36].

5.3 Apache Spark

Apache spark is an open source big data processing framework built for speed processing, and sophisticated analytics. It is easy to use and was originally developed in 2009 in UC

(6)

Berkeley’s AMP Lab. It was open sourced in 2010 as an Apache project. Spark lets you quickly write applications in java, scala, or python. In addition to map reduce operations, it supports SQL queries, streaming data, machine learning, and graph data processing. Spark runs on top of existing Hadoop Distributed File System (HDFS) infrastructure to provide enhanced and additional functionality.

Spark consists of components namely driver program, cluster manager and worker nodes. The driver program serves as the starting point of execution of an application on the spark cluster. The cluster manager allocates the resources and the worker nodes to do the data processing in the form of tasks. Each application will have a set of processes called executors that are responsible for executing the tasks. The major advantage is that it provides support for deploying spark applications in an existing hadoop clusters

5.4 Dryad

It is another popular programming model for implementing parallel and distributed programs for handling large context bases on dataflow graph. It consists of a cluster of computing nodes, and a user use the resources of a computer cluster to run their program in a distributed way. Indeed, a dryad user uses thousands of machines, each of them with multiple processors or cores. The major advantage is that users do not need to know anything about concurrent programming.

A dryad application runs a computational directed graph that is composed of computational vertices and communication channels. Therefore, dryad provides a large number of functionality including generating of job graph, scheduling of the machines for the available processes, transition failure handling in the cluster, collection of performance metrics, visualizing the job, invoking user defined policies and dynamically updating the job graph in response to these policy decisions without knowing the semantics of the vertices.

5.5 Storm

Storm is a distributed and fault tolerant real time computation system for processing large streaming data. It is specially designed for real time processing in contrasts with hadoop which is for batch processing. Additionally, it is also easy to set up and operate, scalable, fault-tolerant to provide competitive performances. The storm cluster is apparently similar to hadoop cluster. On storm cluster users run different topologies for different storm tasks whereas hadoop platform implements map reduce jobs for corresponding applications.

There are number of differences between map reduce jobs and topologies.

The basic difference is that map reduce job eventually finishes whereas a topology processes messages all the time, or until user terminate it. A storm cluster consists of two kinds of nodes such as master node and worker node. The master node and worker node implement two kinds of roles such as nimbus and supervisor respectively. The two roles have similar functions in accordance with job tracker and task tracker of map reduce framework.

Nimbus is in charge of distributing code across the storm cluster, scheduling and assigning tasks to worker nodes, and monitoring the whole system. The supervisor complies tasks as assigned to them by nimbus. In addition, it start and terminate the process as necessary based on the instructions of nimbus. The whole computational technology is partitioned and distributed to a number of worker processes and each worker process implements a part of the topology.

5.6 Apache Drill

Apache drill is another distributed system for interactive analysis of big data. It has more flexibility to support many types of query languages, data formats, and data sources. It is also specially designed to exploit nested data. Also it has an objective to scale up on 10,000 servers or more and reaches the capability to process peta bytes of data and trillions of records in seconds. Drill use HDFS for storage and map reduce to perform batch analysis.

5.7 Jaspersoft

The Jaspersoft package is an open source software that produce reports from database columns. It is a scalable big data analytical platform and has a capability of fast data visualization on popular storage platforms, including Mango DB, Cassandra, Redis etc. One important property of Jaspersoft is that it can quickly explore big data without Extraction,

(7)

Transformation, and Loading (ETL). In addition to this, it also have an ability to build powerful hypertext markup language (HTML) reports and dashboards interactively and directly from big data store without ETL requirement. These generated reports can be shared with anyone inside or outside user’s organization.

5.8 Splunk

In recent years a lot of data are generated through machine from business industries.

Splunk is a real-time and intelligent platform developed for exploiting machine generated big data. It combines the up-to-the-moment cloud technologies and big data. In turn it helps user to search, monitor, and analyze their machine generated data through web interface. The results are exhibited in an intuitive way such as graphs, reports, and alerts.

Splunk is different from other stream processing tools.

6. APPLICATIONS IN LEARNING

Big Data techniques can be used in a variety of ways in learning analytics as listed below [8]:-

 Performance Prediction- Student's performance can be predicted by analyzing student's interaction in a learning environment with other students and teachers

 Attrition Risk Detection- By analyzing the student's behavior, risk of students dropping out from courses can be detected and measures can be implemented in the beginning of the course to retain students.

 Data Visualization- Reports on educational data become more and more complex as educational data grow in size. Data can be visualized using data visualization techniques to easily identify the trends and relations in the data just by looking on the visual reports.

 Intelligent feedback- Learning systems can provide intelligent and immediate feedback to students in response to their inputs which will improve student interaction and performance.

 Course Recommendation- New courses can be recommended to students based on the interests of the students identified by analyzing their activities. That will ensure that students are not misguided in choosing fields in which they may not have interest.

 Student skill estimation - Estimation of the skills acquired by the student

 Behavior Detection- Detection of student behaviors in community based activities or games which help in developing a student model

 Grouping & collaboration of students

 Social network analysis

 Developing concept maps

 Constructing courseware

 Planning and scheduling 7. SUGGESTION FOR FUTURE

The amount of data collected from various applications all over the world across a wide variety of fields today is expected to double every two years. It has no utility unless these are analyzed to get useful information. This necessitates the development of techniques which can be used to facilitate big data analysis.

The development of powerful computers is a boon to implement these techniques leading to automated systems. The transformation of data into knowledge is by no means an easy task for high performance large-scale data processing, including exploiting parallelism of current and upcoming computer architectures for data mining.

8. CONCLUSIONS

Big Data Analysis mainly involves analytical methods of big data, systematic architecture of big data .The development of big data will increase the use of latest technology and from future point of view it is more useful. The communication system continuously increasing challenges today so it is good to use big data here. We even provided with the most efficient platform for big data which is HADOOP. Large numbers of tools are available for process big data. Recent year’s big data has been accumulated in several domains.

(8)

REFERENCES

1. Chahal, Dr & Gulia, Preeti. (2016). Big Data Analytics, Research Journal of Computer and Information Technology Sciences E-ISSN 2320 – 6527 Vol. 4(2), 1- 4, February (2016)

2. Zakir, Jasmine, Big Data Analytics Issues in Information Systems Volume 16, Issue II, pp. 81-90, 2015.

3. R.V. Gandhi, Ch. Rathan Kumar, P. Vamshi Krishna, Big Data: Issues And Challenges, I Journals:

International Journal of Software & Hardware Research in Engineering ISSN-2347-4890 Volume 5 Issue 7 July, 2017

4. Acharjya, D. P., & Ahmed, K. (2016). A survey on big data analytics: challenges, open research issues and tools. International Journal of Advanced Computer Science and Applications, 7(2), 511-518.

5. Mukherjee, S., & Shaw, R. (2016). Big data–concepts, applications, challenges and future scope.

International Journal of Advanced Research in Computer and Communication Engineering, 5(2), 66-74.

6. Gameil Saad Hamzh Ali , Dr.A.Nithya, Challenges and Open Research Issues and Tools on Big Data Analytics, International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) Volume 6, Issue 11, November 2017, ISSN: 2278 – 1323

7. MK. Kakhani, S. Kakhani and S. R.Biradar, Research issues in big data analytics, International Journal of Application or Innovation in Engineering & Management, 2(8) (2015), pp.228-232.

8. A. Gandomi and M. Haider, Beyond the hype: Big data concepts, methods, and analytics, International Journal of Information Management, 35(2) (2015), pp.137-144.

9. C. Lynch, Big data: How do your data grow?, Nature, 455 (2008),pp.28-29.

10. X. Jin, B. W.Wah, X. Cheng and Y. Wang, Significance and challenges of big data research, Big Data Research, 2(2) (2015), pp.59-64.

11. R. Kitchin, Big Data, new epistemologies and paradigm shifts, Big Data Society, 1(1) (2014), pp.1-12.

12. C. L. Philip, Q. Chen and C. Y. Zhang, Data-intensive applications, challenges, techniques and technologies: A survey on big data, Information Sciences, 275 (2014), pp.314-347.

13. K. Kambatla, G. Kollias, V. Kumar and A. Gram, Trends in big data analytics, Journal of Parallel and Distributed Computing, 74(7) (2014),pp.2561-2573.

14. S. Del. Rio, V. Lopez, J. M. Bentez and F. Herrera, on the use of map reduce for imbalanced big data using random forest, Information Sciences, 285 (2014), pp.112-137.

15. MH. Kuo, T. Sahama, A. W. Kushniruk, E. M. Borycki and D. K. Grunwell, Health big data analytics:

current perspectives, challenges and potential solutions, International Journal of Big Data Intelligence,1 (2014), pp.114-126.

16. Z. Huang, A fast clustering algorithm to cluster very large categorical data sets in data mining, SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.

17. R. Nambiar, A. Sethi, R. Bhardwaj and R. Vargheese, A look at challenges and opportunities of big data analytics in healthcare, IEEE International Conference on Big Data, 2013, pp.17-22.