DATA-MINING CONCEPTS
1.6 FROM BIG DATA TO DATA SCIENCE
Data mining represents one of the major applications for data warehousing, since the sole function of a data warehouse is to provide information to end users for decision sup- port. Unlike other query tools and application systems, the data-mining process provides an end user with the capacity to extract hidden, nontrivial information. Such information, although more difficult to extract, can provide bigger business and scientific advantages and yield higher returns on“data-warehousing and data-mining”investments.
How is data mining different from other typical applications of a data warehouse, such as structured query languages (SQL) and online analytical processing tools (OLAP), which are also applied to data warehouses? SQL is a standard relational data- base language that is good for queries that impose some kind of constraints on data in the database in order to extract an answer. In contrast, data-mining methods are good for queries that are exploratory in nature, trying to extract hidden, not so obvious information. SQL is useful when we know exactly what we are looking for and we can describe it formally. We will use data-mining methods when we know only vaguely what we are looking for. Therefore these two classes of data-warehousing applications are complementary.
OLAP tools and methods have become very popular in recent years as they let users analyze data in a warehouse by providing multiple views of the data, supported by advanced graphical representations. In these views, different dimensions of data correspond to different business characteristics. OLAP tools make it very easy to look at dimensional data from any angle or to slice and dice it. OLAP is part of the spectrum of decision-support tools. Traditional query and report tools describewhatis in a data- base. OLAP goes further; it is used to answerwhycertain things are true. The user forms a hypothesis about a relationship and verifies it with a series of queries against the data. For example, an analyst might want to determine the factors that lead to loan defaults. He or she might initially hypothesize that people with low incomes are bad credit risks and analyze the database with OLAP to verify (or disprove) this assump- tion. In other words, the OLAP analyst generates a series of hypothetical patterns and relationships and uses queries against the database to verify them or disprove them.
OLAP analysis is essentially a deductive process.
Although OLAP tools, like data-mining tools, provide answers that are derived from data, the similarity between them ends here. The derivation of answers from data in OLAP is analogous to calculations in a spreadsheet; because they use simple and given-in-advance calculations, OLAP tools do not learn from data, nor do they create new knowledge. They are usually special-purpose visualization tools that can help end users draw their own conclusions and decisions, based on graphically condensed data.
OLAP tools are very useful for the data-mining process; they can be a part of it but they are not a substitute.
variety of sensors and new types of mobile devices, is often referred to asbig data.
Recent studies estimate an increase of annually created data from around 1.2 zetta- bytes in 2010 to 40 zettabytes in 2020. If this is a new concept for the reader, it means the following: 1 zettabyte = 103exabytes = 106petabytes.Big datamay be primary generated through five main types of data sources:
• Operational data comes from traditional transactional systems, where the assumption is that it includes monitoring streaming data often coming from large amount of sensors.
• Dark datais large amount of data that you already own, but do not use in cur- rent decision processes; it may include emails, contracts, and variety of written reports.
• Commercial datais available on the market and may be purchased from some companies, specialized social media, or even governmental organizations.
• Social datacoming from Twitter, Facebook, and other general social media and examples of the rapid growth of data are given in Table 1.1.
• Public datasuch as economic, sociodemographic, or weather data (Fig. 1.7).
Big data could be a new infrastructure for advancements of medical research, global security, logistics and transportation solutions, and identification of terrorism activities and also dealing with socio-economic and environmental issues.
Fundamentally,Big datameans not only a large volume of data but also other features that differentiate it from the concepts of“massive data”and“very large data.” The term Big datahas gained huge popularity in recent years, but it is still poorly defined. One of the most commonly cited definitions specify big data through the four following dimensions:“volume,” “variety,” “velocity,”and“veracity”(so-called 4V model):
1. Volumerefers to the magnitude of data. Real-world big data applications are reported in multiple terabytes and petabytes, and tomorrow they will be in exa- bytes. What may be deemed and impress asbig datatoday may not meet the TABLE 1.1. Big Data on the Web
Company Big Data
YouTube Users upload 100 hours of new videos per minute
Facebook More than 1.4 billion users communicating in 70+ languages Twitter 175 million tweets per day
Google 2 million search queries/minute processing 35 petabytes daily Apple 47,000 applications are downloaded per minute
Instagram Users share 40 million photos per day LinkedIn 2.1 million groups have been created Foursquare 571 new Web sites are launched each minute FROM BIG DATA TO DATA SCIENCE 19
threshold in the future. Storage capacities are increasing, and new tools are developing, allowing bigger data sets to be captured and analyzed.
2. Varietyrefers to the structural heterogeneity in a data set, including the use and benefits of various types of structured, semi-structured, and unstructured data.
Text, images, audio, and video are examples of unstructured data, which are dominant data types with more than 90% representation in today’s digital world. These different forms and quality of data clearly indicate that hetero- geneity is a natural property of big data and it is a challenge to comprehend and successfully manage such data. For instance, during the Fukushima nuclear disaster, when the public started broadcasting about radioactive mate- rials, a wide variety of inconsistent data, using diverse and uncalibrated devices, for similar or neighboring locations was reported—all this add to the problem of increasing variety of data.
3. Velocityrefers to the rate at which data are generated and the speed at which it should be analyzed and acted upon. Digital devices such as smartphones and variety of available and relatively cheap sensors have led to an unprecedented rate of data creation in real time. It requires new IT infrastructures and new methodologies supporting growing need for real-time analytics. Floods of dig- ital personalized data about customers, such as their geospatial location and buying behavior and patterns, can be used in real time for many companies to monitor and improve their business models.
4. Veracityhighlights the unreliability inherent in some sources of today’s digital data. The need to deal with this imprecise and uncertain data is important facet of big data, which is requiring adjustment of tools and applied analytics methodologies. The fact that one in three business leaders does not trust the information that they use to make decisions is a strong indicator that a good
40
Global data in zetabytes
35 30 25 20 15 10 5 0
2005 2007 2009 2011 2013 2015 2017 2019
Figure 1.7. Exponential growth of global data. From: http://s3.amazonaws.com/sdieee/
1990-IEEE_meeting_Jun_2016_Final-2.pdf.
20 DATA-MINING CONCEPTS
big data application needs to address veracity. Customer sentiments, analyzed through the Internet, are an example where the data is uncertain in nature, since they entail human judgment. Yet, they contain valuable information that could help businesses.
There are many businesses and scientific opportunities related tobig data, but at the same time new threats are there too. Big data market is poised to grow to more than
$50 billion in 2017, but at the same time more than 55% of big data projects failed!
Heterogeneity, ubiquity, and dynamic nature of the different resources and devices for data generation, and the enormous scale of data itself, make determining, retrieving, processing, integrating, and inferring the real-world data a challenging task. For the beginning we can briefly enumerate main problems with implementations and threats to these newbig datasolutions:
(a) Data breaches and reduced security, (b) Intrusion of user’s privacy,
(c) Unfair use of data,
(d) Escalating cost of data movement, (e) Scalability of computations, and (f) Data quality.
Because of these serious challenges, novel approaches and techniques are required to address these big data problems.
Although it seems thatbig datamakes it possible to find more useful, actionable information, the truth is that more data do not necessarily mean better analyses and more informative conclusions. Therefore, designing and deploying a big data mining system is not a trivial or straightforward task. The remaining chapters of this book will try to give some initial answers to these big data challenges.
In this introductory section we would like to introduce one more concept that is highly related tobig data. It is the new field ofdata science. Decision-makers of all kinds, from company executives and government agencies to researchers and scientists, would like to base their decisions and actions on the available data. In response to these multidisciplinary requests, a new discipline of big data science is forming. Data scien- tists are professionals who are trying to gain knowledge or awareness of something not known before about data. They need business knowledge; they need to know how to deploy new technology; they have to understand statistical, machine learning, and vis- ualization techniques; and they need to know how to interpret and present the results.
The name of data science seems to connect most strongly with areas such as data- bases and computer science in general, and more specific it is based on machine learn- ing and statistics. But many different kinds of skill are necessary for the profile, and many other disciplines are involved: skillful in communication with data users; under- standing the big picture of a complex system described by data; analyzing business aspects of big data application; knowing how to transform, visualize, interpret, and summarize big data; maintaining the quality of data; and taking care about security,
FROM BIG DATA TO DATA SCIENCE 21
privacy, and legal aspect of data. Of course there are very small number of experts who are good in all these skills, and therefore, we have always to make emphasis on the importance of multidisciplinary teamwork in big data environments. Maybe the following definition of a data scientists, which insists and highlights professional persistence, gives better insight:A data scientist is the adult version of a kid who can’t stop asking “Why?”. Data science is supporting discoveries in many human endeavors, including healthcare, manufacturing, education, cybersecurity, financial modeling, social science, policing, and marketing. It has been used to produce signif- icant results in areas from particle physics such as Higgs Boson, and identifying and resolving sleep disorders using Fitbit data, to recommender systems for literature, the- ater, and shopping. As a result of these initial successes and potential, data science is rapidly becoming an applied sub-discipline of many academic areas.
Very often there is confusion between concepts of data science, big data analytics, and data mining. Based on previous interpretations of a data science discipline, data mining highlight only a segment of data scientist’s tasks, but they represent very impor- tant core activities in gaining new knowledge frombig data. Although major innova- tions in data-mining techniques for big data have not yet matured, we anticipate the emergence of such novel analytics in the near future. Recently, several additional terms including advanced data analytics are introduced and more often used, but with some level of approximation, we can accept them as equivalent concepts with data mining.
The sudden rise of big data has left many unprepared including corporate leaders, municipal planners, and academics. The fast evolution of big data technologies and the ready acceptance of the concept by public and private sectors left little time for the discipline to mature, leaving open questions of security, privacy, and legal aspects of big data. The security and privacy issues that accompany the work of big data min- ing are challenging research topics. They contain important questions how to safely store the data, how to make sure the data communication is protected, and how to pre- vent someone from finding out our private information. Because big data means more sensitive data is put together, it is more attractive to potential hackers: in 2012 LinkedIn was accused of leaking 6.5 million user account passwords, while later Yahoo faced network attacks, resulting in 450,000 user ID leaks. The privacy concern typically will make most people uncomfortable, especially if systems cannot guarantee that their personal information will not be accessed by the other people and organizations. The anonymous, temporary identification and encryption are the representative technologies for privacy of big data mining, but the critical factor is how to use, what to use, when to use, and why to use the collected big data.
1.7 BUSINESS ASPECTS OF DATA MINING: WHY