Big Data Paper

(1)

Big Data

Amanda Marquardt

March 26, 2017

ACG4401C

(2)

Executive Summary

Big Data is a growing topic that brings up many challenges and opportunities for

information technology advancement. Today, data is growing every second from

everything we do and systems to support these masses of data are becoming more and

more capable of handling larger sets of data to return more useful information. Storage

of data is particularly cheap and parallel processing has been one of many alternative

technologies implemented using multiple processing servers to process Big Data

effectively, efficiently and at an affordable cost to companies.

As data continues to grow and Big Data analytics gains more popularity, legalities in

regards to Big Data also become a concern. The legal system does not provide any

clear boundaries concerning ownership, contractual rights, or privacy standards which

proves as a challenge for Big Data analytics. Implementing laws and regulations could

serve as a benefit to avoid future issues in Big Data development.

Big Data is not a clearly definable term and what constitutes as “Big Data” is not held to

a definable standard; however, it is agreed that Big Data consists of both structured and

unstructured data. A large majority of existing data consists of unstructured data with

social media is the leading contributor to unstructured data today. Almost everyone

around the world today uses some sort of social media making the data is procures of

extreme value to businesses and individuals in regards to business, finance, and many

other areas. Big Data is a big topic with many contributing factors bringing opportunity

and challenges. Information technology development to support the massive amounts of

data being created is essential to receive the most benefit and to give companies

(3)

Introduction

When hearing the term “Big Data” the general meaning many people would

assume is simply large amounts of data. Technically, there is still no standard

definition of Big Data; however, Big Data in short consists of data sets that are

too large to analyze using ordinary information system algorithms creating a

demand for more complex systems to manage these overwhelming data sets.

Until recently, data was produced only from human input, but now data is growing

constantly from data manually inputted by humans and data created by

computers.

The differentiating components of Big Data compared to ordinary data have been

referred to as the 4 V’s: volume, velocity, veracity, and variety. IBM has estimated

as of 2017, there are 2.5 quintillion bytes of data are created every day from

everything we do. This statement proves the existence of high volume and high

velocity in today’s data. Veracity has also proven itself present among data today.

The 2.5 quintillion bytes of data being produced every day are coming from a

multitude sources that transmit data such as email, social media, online

shopping, etc. The variety component refers to the presence of structured and

unstructured data. Structured data is data already organized into logical view

making information easily accessible such as excel worksheets. Unstructured

data consists of text and multimedia content such as photographs and music files

(4)

As the four V’s continue to grow among data, challenges as well as opportunities

to keep up with this constant expansion also continue to grow. Considering data

is constantly growing from everything we do, it seems as if the opportunities for

data management is limitless. We live in a data booming generation, but having

massive amounts of data is useless it cannot be transformed into information we

can make sense of to draw useful conclusions. It is essential Information

Technology continues to improve to gain the most benefit from the massive

amounts of data available to us. The ability to organize and analyze the massive

loads of data existing today can provide useful information to every profession in

every field from helping marketing specialists analyze and interpret sale trends to

assisting medical professionals around the world share their knowledge to cure

diseases and save lives. It is no question that Big Data is among us, but the only

way to benefit from it is by having access to affordable systems to control it all.

Parallel Processing

Before the term Big Data came about, the only source of data was the data

entered by employees. As technology evolved and the internet became

accessible to everyone, not only employees but, users and computers

themselves started generating data causing a huge rise is data accumulation

bringing us the term “Big Data”. Users have been the biggest contributor to the

expansion of data. From Facebook posts, Tweets, online shopping, many people

(5)

When data was much smaller and only entered by employees, relational

databases were used where data was entered into the processor. Now that there

is so much more data, it is overwhelming for a single server to process. It was

obvious that more storage would be necessary as we evolved into a world of “Big

Data” which lead to advanced systems consisting of more process servers

capable of storing more data.

Companies used to pay database vendors such as Oracle and IBM to manage

their data. Eventually, Google’s data became too large for their vendor to manage

leading them to the creation of MapReduce. MapReduce was an algorithm used

to break down their large database into smaller parts using multiple process

servers. This allowed more data to be stored and processed at a faster rate. This

was the start one method used today known as parallel processing.

Parallel processing is a popular processing method used today. It processes data

by bringing multiple processors to the data, as opposed to before when data was

brought to the servers. Companies are providing services to help implement

parallel processing infrastructures for businesses.

Hadoop is a popular servicer that makes parallel processing available to

companies. Typically, companies using over 10 terabytes of data receive the

greatest net benefit from Hadoop. Hadoop is considered an open source platform

meaning it does not cost anything to use, but experts will more than likely be

needed to manage the system.

(6)

The variety component of Big Data has become the most challenging issue

regarding analyzing Big Data because unstructured data is difficult to break down

and organize due to all the different sources of data it contains. Approximately

90% of a company’s data consists of unstructured data. As mentioned before,

unstructured data includes files containing text and multimedia components

including emails, photographs, music files, websites, etc. Although unstructured

data procures the majority of all existent data, it has presented the most difficulty

for the IT community to design software to analyze it. The challenge lies within

filtering relevant data from multiple mediums and then organizing it into

information of use. Challenges associated with analyzing unstructured data puts

a limit on information available to companies which in turn limits the availability of

decision making tools.

Social media is a popular and fast growing form of unstructured data. Facebook

currently holds over a billion users, active Twitter accounts are in the

multimillions, and there are over 400 million profiles on LinkedIn just to name a

few. Social media is constantly being used for a multitude of reasons such as for

networking, marketing, or personal use. These social media websites are

equipped to manage and store this big data; however, the volume of data

produced by social media has the potential to produce tons of beneficial

information for companies.

In early 2000, one method for predicting stock performances was by evaluating

the magnitude of messages on financial blogs mentioning specific companies.

(7)

lead to a rise in stock price the next day. This idea led to a study in started in

2012 (Sanger & Warin) between stock prices and Tweets. The study used 71

companies of the S&P500, gathering the number of times the name of the

company was mentioned on Twitter and the number of financial Tweets (“$”

before ticker, i.e $GOOG) posted regarding the companies and compared it to

the stock return prices intraday and overnight. After a years’ worth of data was

collected, it was concluded that the number of financial Tweet have a negative

correlation with overnight stock returns. Data from Twitter has been used to

study numerous areas, but the research involved to gather and analyze the data

is difficult and time consuming without software to assist with filtering and

compiling the data needed to conduct the study.

The number of Tweets collected by Sanger & Warin were procured from a

website called PeopleBrowsr. PeopleBrowsr is the largest Social Intelligence

Platform in the world. This website allows companies to create instantly large

networks via social media as well compiling data from their network into queries

to filter useful information. Social media analytics tools such as PeopleBrowsr are

becoming more popular and has been a huge step in the right direction for

challenges associated with unstructured data.

The Law & Big Data Analytics

In 2013, the website PeopleBrowr mentioned above encountered legal issues

(8)

data. PeopleBrowsr was paying Twitter $1 million a year for access to their

database; however, Twitter wanted control of their data and offer it exclusively to

other companies. PeopleBrowsr filed a complaint against Twitter for violation of

common state, California Unfair Competition Law, and claiming that data

obtained from Twitter is the main source of their business and to be refused

access would cause the business to cease. The case was directed to Federal

court as the violation was considered more aligned to violation of federal law, The

Sherman Act. The parties settled allowing PeopleBrowsr access to Twitter’s data

for the rest of that year, but then Twitter was granted full control of the data

causing PeopleBrowsr to purchase the data from other companies at a much

higher cost than the $1 million/ year they were paying directly to Twitter.

Big Data is a valuable commodity that can make or break a company. As seen in

the case with PeopleBrowsr and Twitter, there is a grey area regarding the legal

rights to access of data. Twitter was accused of being in violation of the Sherman

Act which brings up the issue of monopolizing Big Data. This case also brings up

contractual issues with Big Data. Twitter originally gave PeopleBrowsr licensing

rights to their data, but became an issue as to duration of the agreement. When

data holders provide data services for compensation, there are no clear legal

requirements regarding contracts or no regulation enforcing contracts associated

with Big Data.

In 2015, Radio Shack filed for Bankruptcy, and the company’s assets were

liquidated to pay their debt. Included in Radio Shack’s assets was their customer

(9)

privacy policies causing some of these customers to contest the sale of their

personal information. The Bankruptcy court allowed General Wireless to

purchase the customer records but with restrictions. Some of the limitations

included giving the customers notice of the sale with an option to opt-out, data

must be used in the same line of business as Radio Shack, etc.

Privacy is another issue that comes into play particularly regarding Big Data

analytics. With larger quantities of data brings greater value and more analysis

opportunities; but, some people are reluctant to sharing data fearing exposure of

the personal information. Although data analysts typically anonymize their

results, Big Data still becomes Big Data from input of data from sources

everywhere and from everyone. Big Data is of great value which addresses the

issue regarding whether people can seek compensation as to their valuable

contribution. The government has an ethical responsibility to protect citizen’s

privacy, but lack a definitive line as to what violates “invasion of privacy” in Big

Data analytics. Big Data analyses can provide great public good without causing

harm to citizens, but the law lacks a definitive line in regards to what extent the

legal system will protect Big Data analytics comparatively to the protection of

citizen’s personal data.

Mentioned are only a few issues regarding the law and Big Data, particularly Big

Data analytics. As information systems become more advanced and more data

can be processed into useful information, the legal issues will also continue to

grow. We are becoming increasingly more in control of data produced from

(10)

For Big Data analysis to continue to progress, it is essential for more legal

regulation to be put in effect.

Conclusion

Big data has given this generation new and exciting challenges to face. We have

been able to evolve from relational databases that brought data to a single

processor to a parallel processing system which intercorrelates multiple servers

to store more data at higher speeds. Social media has been proven to be of

extreme value in today’s Big Data analyses. Social media data is so widely used

that it has been used for many areas of research including stock returns;

however, unstructured data such as social media has been proven as a

challenge in Big Data analytics. Unstructured data consisting of so many sources

makes it difficult to filter all relevant data effectively. Big Data has also brought

concern regarding the legal systems involvement with Big Data. We have seen

that there are many issues at hand with Big Data and the legal system that leave

grey areas. Ownership of data, contractual requirements, and privacy issues are

a few concerns that have required legal action. The cases mentioned in this

paper involving data been decided on a case by case basis; however, whether

the legal system should enact clear guidelines in regards to Big Data and the

technicalities to protect the progression of Big Data analytics is worth

consideration.

(11)

1. Arthur, L (2013), “What is Big Data”, Forbes Magazine, https://www.forbes.com/ sites/lisaarthur/2013/08/15/what-is-big-data/#122b6075c85b

2. Talluri, Sushma. "Big Data using Cloud Technologies." Global Journal of

Computer Science and Technology 16.2 (2016).

3. Mishra, Devendra Kumar. "CHALLENGES WITH UNSTRUCTURED BIG DATA

ANALYSIS USING MACHINE LEARNING APPROACH: A REVIEW." Futuristic

Trends in Engineering, Science, Humanities, and Technology FTESHT-16 (2016):

130.

4. Šebalj, Dario, Ana Živković, and Kristina Hodak. "Big data: Changes in data

management." Ekonomski vjesnik/Econviews-Review of Contemporary Business,

Entrepreneurship and Economic Issues 29.2 (2016): 487-499.

5. Sanger, William, and Thierry Warin. "High Frequency and Unstructured Data in

Finance: An Exploratory Study of Twitter." Journal of Global Research in

Computer Science 7.4 (2016).

6. Allen, Anita L. "Protecting One's Own Privacy in a Big Data Economy." (2016). 7. Brooker, Phillip, Julie Barnett, and Timothy Cribbin. "Doing social media

analytics." Big Data & Society 3.2 (2016): 2053951716658060.

8. Kitchin, Rob, and Gavin McArdle. "What makes Big Data, Big Data? Exploring

the ontological characteristics of 26 datasets." Big Data & Society 3.1 (2016):

2053951716631130.

9. Zeno-Zencovich, Vincenzo, and Giorgio Giannone Codiglione. "Ten Legal

Perspectives on the Big Data Revolution'." (2016).

10.Pradhananga, Yanish, Shridevi Karande, and Chandraprakash Karande. "High

Performance Analytics of Big Data with Dynamic and Optimized Hadoop Cluster."

Advanced Communication Control and Computing Technologies (ICACCCT),

2016 International Conference on. IEEE, 2016.

(12)

12.Carter, Edward L., and Laurie Thomas Lee. "Information Access and Control in

an Age of Big Data." Journalism & Mass Communication Quarterly 93.2 (2016):

269-272.

13.McAfee, David. “Twitter, PeopleBrowsr Settle Dispute Over Data Access.”

Law360.com

14.Che, Dunren, Mejdl Safran, and Zhiyong Peng. "From big data to big data

mining: challenges, issues, and opportunities." International Conference on

Database Systems for Advanced Applications. Springer Berlin Heidelberg, 2013.