A Survey Paper on Big Data Challenges

(1)

________________________________________________________________________________________________

1Shalini G, ²Niveditha K B, ³Priyanka G N

1Assistant Professor, Department of Computer Science and Engineering, Dr.T.Thimmaiah Institute of Technology, Kolar Gold Fields-563210, Karnataka, India.

2-3 Student ,Department of Computer Science and Engineering, Dr.T.Thimmaiah Institute of Technology, Kolar Gold Fields-563210, Karnataka, India.

1[email protected] ,²[email protected] ,³[email protected]

ABSTRACT: With the advent of internet of things(IOT) and web 2.0 technologies ,the amount of data that is travelling across the internet today is not only large but complex as well. Companies,institutions ,healthcare systems etc all of them use files of data which are further used for creating reports in order to ensure continuity regarding the services that they have to offer. This process results as a challenge for software developers and companies that provide IT infrastructure.

The challenge is how to manipulate as impressive volume of data that has to be securely delivered through the internet and reach its destination intact. This paper surveys the challenges that Big Data creates.

KEYWORDS: Big Data, Data storage, Security, Scalability.

I. INTRODUCTION

As the current technology enables us to efficiently store and query large datasets, the focus is now on techniques that make use of the complete data set, instead of sampling. This has tremendous implications in areas like pattern recognition, machine learning and classification, to name a few. Therefore, there are a number of requirements for moving beyond standard data mining techniques:

 a robust exploratory establishment to have the capacity to select an adequate method or design ;

 a new algorithm;

 a technology platform and adequate development skills to be able to implement it;

 a genuine ability to understand not only the data structure (and the usability for a given processing method), but also the business value.

As a result, building multi-disciplinary teams of “Data scientists” is often an essential means of gaining a competitive edge. More than ever, intellectual property and patent portfolios are becoming essential assets. One of the obstacles to widespread analytics adoption is a lack of understanding on how to use analytics to improve the business [1].

The term “Big Data” was first introduced to the computing world by Roger Magoulas from O’Reilly media in 2005 in order to define a great amount of data that traditional data management techniques cannot manage and process due to the complexity and size of this data.

A study on the Evolution of Big Data as a Research and Scientific Topic shows that the term “Big Data” was present in research starting with 1970s but has been comprised in publications in 2008.[3]

“Big Data” is a term encompassing the use of techniques to capture, process, analyse and visualize potentially large datasets in a reasonable timeframe not accessible to standard IT technologies. By extension, the platform, tools and software used for this purpose are collectively called “Big Data

technologies”.[1]

Fig.1: Sources of Big Data

Figure 1 shows the various sources from where the data is collected. In general, Big Data refers to the collection of large and complex datasets which are difficult to process using traditional database management tools or data processing applications. These are available in structured, semi-structured, and unstructured format in petabytes and beyond. Formally, it is defined from 3Vs to 4Vs. 3Vs refers to volume, velocity, and variety.

Volume refers to the huge amount of data that are being generated everyday whereas velocity is the rate of growth and how fast the data are gathered for being analysis. Variety provides information about the types of data such as structured, unstructured, semi structured

(2)

________________________________________________________________________________________________

etc. The fourth V refers to veracity that includes availability and accountability. The prime objective of big data analysis is to process data of high volume, velocity, variety, and veracity using various traditional and computational intelligent techniques.

The following Figure 2 refers to the definition of big data. However exact definition for big data is not defined and there is a believe that it is problem specific.

This will help us in obtaining enhanced decision making, insight discovery and optimization while being innovative and cost-effective [2].

Fig.2: Characteristics of Big Data

II. Literature Survey

A. Why Big Data?

Information increases rapidly at a rate of 10x every five years From 1986 to 2007, the international capacities for technological data storage, computation, processing, and communication were tracked through 60 analogues and digital technologies in 2007, the capacity for storage in general-purpose computers was 2.9 × 1020 bytes (optimally compressed) and that for communication was 2.0 × 1021 bytes. These computers could also accommodate 6.4 × 1018 instructions per second.

However, the computing size of general-purpose computers increases annually at a rate of 58%. In computational sciences, Big Data is a critical issue that requires serious attention. Thus far, the essential landscapes of Big Data have not been unified.

Furthermore, Big Data cannot be processed using existing technologies and methods.

Therefore, the generation of incalculable data by the fields of science, business, and society is a global problem. With respect to data analytics, for instance, procedures and standard tools have not been designed to search and analyze large datasets. As a result, organizations encounter early challenges in creating, managing, and manipulating large datasets. Systems of data replication have also displayed some security weaknesses with respect to the generation of multiple copies, data governance, and policy. These policies define the data that are stored, analyzed, and accessed.

They also determine the relevance of these data. To process unstructured data sources in Big Data projects, concerns regarding the scalability, low latency, and performance of data infrastructures and their data

centers must be addressed. In the IT industry as a whole, the rapid rise of Big Data has generated new issues and challenges with respect to data management and analysis. Five common issues are volume, variety, velocity, value, and complexity according to. In this study, there are additional issues related to data, such as the fast growth of volume, variety, value, management, and security. Each issue represents a serious problem of technical research that requires discussion. Hence, this research proposes a data life cycle that uses the technologies and terminologies of Big Data. Future research directions in this field are determined based on opportunities and several open issues in Big Data domination.[4]

B. Importance of Big Data

The main importance of Big Data consists in the potential to improve efficiency in the context of use a large volume of data, of different type. If Big Data is defined properly and used accordingly, organizations can get a better view on their business therefore leading to efficiency in different areas like sales, improving the manufactured product and so forth.

Big Data can be used effectively in the following areas:

 In information technology in order to improve security and troubleshooting by analyzing the patterns in the existing logs;

 In customer service by using information from call centers in order to get the customer pattern and thus enhance customer satisfaction by customizing services;

 In improving services and products through the use of social media content. By knowing the potential customers preferences the company can modify its product in order to address a larger area of people;

 In the detection of fraud in the online transactions for any industry;

 In risk assessment by analyzing information from the transactions on the financial market.

In the future we propose to analyze the potential of Big Data and the power that can be enabled through Big Data Analysis.[3]

C. Applications

The application of big data in various sectors is discussed as follows.

Healthcare

Data analysts obtain and analyze information from multiple sources to gain insights. The multiple sources are electronic patient record; clinical decision support system including medical imaging, physician's written notes and prescription, pharmacy and laboratories;

clinical data; and machine generated sensor data. The integration of clinical, public-health and behavioral data

(3)

________________________________________________________________________________________________

helps to develop a robust treatment system, which can reduce the cost and at the same time, improve the quality of treatment. Rizzoli Orthopedic Institute in Bologna, Italy analyzed the symptoms of individual patients to understand the clinical variations in a family. This helped to reduce the number of imaging and hospitalizations by 60% and 30%, respectively.

The data from the sensors are monitored and analyzed for adverse event prediction and safety monitoring.

Artemis, a system developed by Blount et al, monitors and analyzes the physiological data from sensors in the intensive care units to detect the onset of medical complications, especially in the case of neo-natal care.

The real-time analysis of a huge number of claims requests can minimize fraud.

Telecommunication

Low adoption of mobile services and churn management are few of the most common problems faced by the mobile service providers (MSPs). The cost of acquiring new customer is higher than retaining the existing ones.

Customer experience is correlated with customer loyalty and revenue .In order to improve the customer experience, MSPs analyze a number of factors such as demographic data (gender, age, marital status, and language preferences), customer preferences, household structure and usage details (CDR, internet usage, value- added services (VAS)) to model the customer preferences and offer a relevant personalized service to them. This is known as targeted marketing, which improves the adoption of mobile services, reduces churn, thus, increasing the revenue of MSPs. The company analyzes the CDR data to identify the call patterns to offer different plans to customers. The services are marketed to the customers through a call or text message. Their responses are recorded for further analysis.

Financial Firms

Currently, capital firms are using advanced technology to store huge volumes of data. But increasing data sources like Internet and Social media require them to adopt big data storage systems. Capital markets are using big data in preparation for regulations like EMIR, Solvency II, Basel II etc, anti-money laundering, fraud mitigation, pre-trade decision-support analytics including sentiment analysis, predictive analytics and data tagging to identify trades. The timeliness of finding value plays an important role in both investment banking and capital markets; hence, there is a need for real-time processing of data.

Retail

Evolution of e-commerce, online purchasing, social- network conversations and recently location-specific Smartphone interactions contribute to the volume and the quality of data for data-driven customization in retailing. Major retail stores might place CCTV not only to observe the instances of theft but also to track the flow of customers. It helps to observe the age group,

gender and purchasing patterns of the customers during weekdays and weekends. Based on the purchasing patterns of the customers, retailers group their items using a well-known data mining technique called Market Basket Analysis, so that a customer buying bread and milk might purchase jam as well. This helps to decide on the placement of objects and decide on the prices .They collect the click stream data, observe behavior and recommend products in the real time.

Law Enforcement

Law enforcement officials try to predict the next crime location using past data i.e., type of crime, place and time; social media data; drone and Smartphone tracking.

Researchers at Rutgers University developed an app called RTM Dx to prevent crime and is being used by police department at Illinois, Texas, Arizona, New Jersey, Missouri and Colorado. With the help the app, the police department could measure the spatial correlation between the location of crime and features of the environment.

A new technology called facial analytics that examines images of people without violating their privacy. Facial analytics is used to check child pornography. This saves the time of manual examination. Child pornography can be identified by integration of various technologies like Artemis PhotoDNA by comparing files and image hashes with existing files to identify the subject as adult or child. It also identifies the cartoon based pornography.

New Product Development

There is a huge risk associated with new product development. Enterprises can integrate both external sources, i.e., twitter and Face book page and internal data sources, i.e., customer relationship management (CRM) systems to understand the customers' requirement for a new product, to gather ideas for new product and to understand the added feature included in a competitor's product. Proper analysis and planning during the development stage can minimize the risk associated with the product, increase the customer lifetime value and promote brand engagement. Ribbon UI in Microsoft 2007 was created by analyzing the customer data from previous releases of the product to identify the commonly used features and making intelligent decisions.

Banking

The investment worthiness of the customers can be analyzed using demographic details, behavioral data, and financial employment. The concept of cross-selling can be used here to target specific customer segments based on past buying behavior, demographic details, sentiment analysis along with CRM data.

Energy and Utilities

Consumption of water, gas and electricity can be measured using smart meters at regular intervals of one hour. During this interval, a huge amount of data is

(4)

________________________________________________________________________________________________

generated and analyzed to change the patterns of power usage. The real-time analysis reveals energy consumption pattern, instances of electricity thefts and price fluctuations.

Education

With the advent of computerized course modules, it is possible to assess the academic performance real time.

This helps to monitor the performance of the students after each module and give immediate feedback on their learning pattern. It also helps the teachers to assess their teaching pedagogy and modify based on the students‟

performance and needs. Dropout patterns, students requiring special attention and students who can handle challenging assignments can be predicted. Beck and Mostow studied the student reading comprehension using intelligent tutor software and observed that reading mistakes reduced considerably when the students re-read an old story instead of a new story.

Other sectors

With increasing analytics skills among the various organizations, the advantage of big data analytics can be realized in sectors like construction and material sciences.[5]

III. BIG DATA TOOLS

The great amount of data collected can be classified into useful trends and patterns. Thus it must be preserved, studied and processed. Following are some of the majorly used Big Data taming tools:

a) Hadoop

Hadoop is a popularly used open-source data analysis tool. It is implementation of Map Reduce for the analysis of large datasets. Hadoop uses a distributed user-level file system, to manage storage resources across the cluster. The file system is called HDFS, and is written in Java. It is designed for portability across heterogeneous hardware and software platforms.

Hadoop runs on the Map Reduce model. In this, computation is divided into a map function and a reduce function. The map function takes a key/value pair and produces one or more intermediate key/value pairs. The reduce function then takes these intermediate key/value pairs and merges all values corresponding to a single key.[6]

Figure 3 explains the architecture of a Hadoop cluster.

The client has transactions with the cluster. The cluster consists of cluster machines. Each of these cluster machines comprises of Map Reduce agent and HDFS node. The cluster will also have a name node.

Fig.3: Hadoop Cluster b) Google charts

The Google chart is basically an API tool. It is a free software. It lets people easily create a chart from some data and embed it in a web page. Google creates a PNG image of the required chart from data and formats parameters in an HTTP request. It supports line, bar, pie, and radar charts. Also Venn diagrams, scatter plots, maps, Google-o-meters, and OR codes are supported.

For example, data about the oceans is provided. The Google charts tool will convert the data into simple diagram format like the figure 4.

Fig4: Pie chart for oceans on the earth.

c) SAP's HANA

SAP HANA Enterprise 1.0 is an in-memory computing appliance that combines SAP database software with pre-tuned server, storage, and networking hardware from one of several SAP hardware partners.[6] It supports real-time analytic and transactional processing.

The distinctive features of HANA include:

 SAP's in-memory computing studio

 Sybase Replica Server 15

 SAP Host Agent 7.2

 Runs the SUSE Linux Enterprise Server 11 SP1 operating system

d) Grid Gain

Grid Gain is the leading provider of the open source In- Memory Data Fabric. It offers the most comprehensive in-memory computing solution. It helps equip the real- time enterprise with a new level of computing power. It enables high-performance transactions, real-time streaming and ultra-fast analytics in a single, highly

(5)

________________________________________________________________________________________________

scalable data access and processing layer. Grid Gain enables customers to predict and innovate ahead of market changes. Grid Gain architecture can be explained in the figure 5. The Grid Gain In-Memory Data Fabric provides a unified API that spans all key types of applications like Java, .NET or C++, and connects them with multiple data stores.

Fig.5: Grid Gain architecture e) Splunk

Splunk is another analytics tool. It creates an index of the data as if the data was a book or a block of text.

Although databases also build indices, Splunk's approach resembles more to a text search process. This indexing is highly flexible. Splunk tool is already tuned to a particular application, making it easier to make out the log files. The index helps correlate the data in these and several other common server-side scenarios.

Splunk will take text strings and search around in the index. Splunk finds the URLs one wish to find and packages them into a timeline built around the time stamps it discovers in the data. The Splunk software architecture is explained in the below diagram. The data is fetched from the web servers and carried to the Splunk tool. This processed data then is transferred to the analytics database. The analyzed data is then transported to OLAP engine.

Fig.6: Splunk Architecture

f) Jaspersoft BI suite

Jaspersoft package is one of the open source software used for producing reports from database columns. It is one of the most leading software for Business Intelligence. This software is well-polished and is used to turn SQL tables into PDFs for better examination of

data. Jasper Reports Serve offers software to take up data from major storage platforms, namely Mongo, Cassandra, Redis, Riak, CouchDB, and Neo4j.

Jaspersoft not just offers particularly new ways to look at the data, but more of sophisticated ways to access data stored in new locations. Once data is retrieved from these sources Jaspersoft converts them into lucid tables and graphs, thus making complex stuff easier. The reports are quite sophisticated and interactive helping one drill down into various aspects of it.[6]

Fig.7: Jaspersoft Architecture

IV. Challenges in Big Data

Big data due to its various properties like volume, velocity, variety, variability, value and complexity put forward many challenges. These challenges are grouped into four categories as follows:

 Data challenges

 Process challenges

 Integration challenges

 Management challenges.

Data challenges:

 Volume:

The volume of data, especially machine-generated data, is exploding, how fast that data is growing every year, with new sources of data that are emerging. For example, in the year 2000, 800,000 petabytes (PB) of data were stored in the world, and it is expected to reach 35 zettabytes (ZB) by 2020 (according to IBM).Social media plays a key role: Twitter generates 7+ terabytes (TB) of data every day. Face book, 10 TB. Mobile devices play a key role as well, as there were estimated 6 billion mobile phones in 2011.

 Variety, Combining Multiple Data Set:

More than 80% of today’s information is unstructured and it is typically too big to manage effectively. What does it mean?It used to be the case that all the data an organization needed to run its operations effectively was structured data that was generated within the

(6)

________________________________________________________________________________________________

organization. Things like customer transaction data, ERP data, etc. Today, companies are looking to leverage a lot more data from a wider variety of sources both inside and outside the organization. Things like documents, contracts, machine data, sensor data, social media, health records, emails, etc. The list is endless really.

A lot of this data is unstructured, or has a complex structure that’s hard to represent in rows and columns.

And organizations want to be able to combine all this data and analyze it together in new ways.

 Velocity

Shilpa Lawande of Vertica defines this challenge nicely:

“as businesses get more value out of analytics, it creates a success problem— they want the data available faster, or in other words, want real-time analytics. And they want more people to have access to it, or in other words, high user volumes.” One of the key challenges is how to react to the flood of information in the time required by the application.

 Veracity, Data Quality, Data Availability Who told you that the data you analyzed is good or complete? Paul Miller mentions that “a good process will, typically, make bad decisions if based upon bad data. E.g. what are the implications in, for example, a Tsunami that affects several Pacific Rim countries? If data is of high quality in one country, and poorer in another, does the Aid response skew ‘unfairly’ toward the well-surveyed country or toward the educated guesses being made for the poorly surveyed one?”[7]

 Data Discovery

This is a huge challenge: how to find high-quality data from the vast collections of data that are out there on the Web.

 Quality and Relevance

The challenge is determining the quality of data sets and relevance to particular issues (i.e., the data set making some underlying assumption that renders it biased or not informative for a particular question).

 Data Dogmatism

Analysis of Big Data can offer quite remarkable insights, but we must be wary of becoming too beholden to the numbers. Domain experts—and common sense—

must continue to play a role. For example, “It would be worrying if the healthcare sector only responded to flu outbreaks when Google Flu Trends told them to.”

 Scalability

Shilpa Lawande explains: “techniques like social graph analysis, for instance leveraging the influencers in a social network to create betteruser experience are hard problems to solve at scale. All of these problems combined create a perfect storm of challenges and opportunities to create faster, cheaper and better

solutions for Big Data analytics than traditional approaches can solve.”[7]

INTEGRATION CHALLENGE Table 1: Integration challenges

Challenge Description 1.The

uncertainty of data

management.

One disruptive facet of big data management is the use of a wide range of innovative data management tools and frame works designs are dedicated to supporting operational and analytical processing. The NoSQL frameworks are used that differentiate it from traditional relational database management systems and are also largely designed to fulfill performance demands of big data applications such as managing a large amount of data and quick response times.

2.Talent gap in

Big Data The new tools evolved in this sector can range from traditional relational database tools with some alternative data layouts designed to maximize access speed while reducing the storage footprints, NoSQL data management frame works, in- memory analytics, and as well as the board hadoop ecosystem.

The reality is that there is lack of skills available in the market for big data technologies. The typical expert has also gained experience through tool implementation and its use as a programming model, apart from the big data management aspects.

3.Getting Data into Big Data structure

It might be obvious that the intent of a big data management involves analyzing and processing a large amount of data. There are many people who have raised expectations considering analyzing huge data sets for a big data platform.

They also may not be aware of the complexity behind the transmission, access, and delivery of data and information from wide range of resources and then loading these data in a big data platform.

4.Syncing Across Data sources

Once you import data into big data platforms you may also realize that data copies migrated from a wide range of sources on

(7)

________________________________________________________________________________________________

different rates and schedules can rapidly get out of the synchronization with the originating system. This implies that the data coming from one source is not out of date as compared to the data coming from another source. The traditional data management and data warehouse, the sequence of data transformation, extraction and migrations all arise the situation in which there are risks for data to become unsynchronized.

5.Extracting information from the data in

Big Data

integration

The most practical use cases for big data involve the availability of data, augmenting existing storage of data as well as allowing access to end-user employing business intelligence tools for the purpose of the discovery of data. It also becomes a challenge in big data integration to ensure the right time data availability to the data consumers.

6.Miscellaneous

challenges The ability to merge data that is not similar in source or structure and to do so at a reasonable cost and in time. It is also a challenge to process a large amount of data at a reasonable speed so that information is available for data consumers when they need it. The validation of data set is also fulfilled while transferring data from one source to another or to consumers as well.[8]

Process challenges:

“It can take significant exploration to find the right model for analysis, and the ability to iterate very quickly and ‘fail fast’ through many (possible throw away) models—at scale—is critical.” (Shilpa Lawande) According to Laura Haas (IBM Research), process challenges with deriving insights include [7]:

 Capturing data

 Aligning data from different sources (e.g., resolving when two objects are the same)

 Transforming the data into a form suitable for analysis

 Modeling it, whether mathematically, or through some form of simulation

 Understanding the output, visualizing and sharing the results, think for a second how to display

complex analytics on a iPhone or a mobile device.

Management challenges:

“Many data warehouses contain sensitive data such as personal data. There are legal and ethical concerns with accessing such data. So the data must be secured and access controlled as well as logged for audits.”

The main management challenges are

• Data privacy

• Security

• Governance

• Ethical

The challenges are: Ensuring that data are used correctly (abiding by its intended uses and relevant laws), tracking how the data are used, transformed, derived, etc., and managing its lifecycle.[7]

V. Conclusion

The rate at which data is being created in the digital world; big data analytics and analysis have become more relevant. This paper provides big data concepts, its tools, its importance, applications, and challenges involved. The big data technology seems to be reaching a mature stage and can be useful as a base for the development of the future technologies that will change the world as we see it, like the IOT or on-demand services and that is the reason why Big data is, after all, the future

References:

1. Big Data Process Analytics: A Survey, Sameera Siddiqui, Deepa Gupta International Journal of Emerging Research in Management

&Technology ISSN: 2278-9359 (Volume-3, Issue-7) july 2014

2. A Survey on Big Data Analytics: Challenges, Open Research Issues and Tools, D.P.Acharjya, Kauser Ahmed P (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 7, No. 2, 2016

3. Perspectives on Big Data and Big Data Analytics, Elena Geanina ULARU, Florina Camelia PUICAN, Anca APOSTU, Manole VELICANU. Database Systems Journal vol.

III, no. 4/2012

4. Big Data: Survey, Technologies, Opportunities, and Challenges Nawsher Khan, Ibrar Yaqoob, Ibrahim Abaker Targio Hashem, Zakira Inayat, Waleed KamaleldinMahmoud Ali, Muhammad Alam, Muhammad Shiraz and Abdullah Gani Hindawi Publishing Corporation ,e Scientific World Journal Volume 2014, Article ID 712826, 18 pages http://dx.doi.org/10.1155/2014/712826

(8)

________________________________________________________________________________________________

5. Big Data: Challenges, Opportunities and Realities Abhay Kumar Bhadani , Dhanya Jothimani

6. Big Data: Tools and Applications, Sofiya Mujawar, Soha Kulkarni International Journal of Computer Applications (0975 – 8887) Volume 115 – No. 23, April 2015

7. Big Data: Challenges and Opportunities Roberto V. Zicari

8. The 6 challenges of big data integration http://www.flydata.com/the-6-challenegs-of- big-data-integration

