BUKU DATA MINING (Concepts, Models, Methods, and Algorithms)

No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under section 107 or 108 of the 1976 United States Act of the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA fax or to online at www.copyright .com. Limitation of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in the preparation of this book, they make no representation or warranty as to the accuracy or completeness of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness for a particular purpose.

PREFACE

Lyu and Mehmet Akif Gulum helped me proofread the new edition and made numerous corrections and updates in the book's appendices. The new edition of the book is the result of the use of previous editions as textbooks in active teaching by many of my colleagues.

PREFACE TO THE SECOND EDITION

I believe that this book can serve as a valuable guide to the field for undergraduates, postgraduates, researchers and practitioners. I hope that the wide range covered will allow these readers to appreciate the scope of data mining's impact on modern business, science, and even society as a whole.

PREFACE TO THE FIRST EDITION

The premise of this book is that there are only a handful of important principles and issues in the field of data mining. One of our goals in writing this book was to reduce the hype associated with data mining.

DATA-MINING CONCEPTS

INTRODUCTION

Data mining is the search for new, valuable and non-trivial information in large amounts of data. Appendix B of the book provides a brief overview of typical commercial applications of data mining technology today.

DATA-MINING ROOTS

One of the greatest strengths of data mining is reflected in the wide range of methodologies and techniques that can be applied to a large number of problem sets. Therefore, new techniques for the identification of parameters have been developed and are now part of the spectra of data mining techniques.

DATA-MINING PROCESS

In practice, this usually means a close interaction between the data mining expert and the application expert. The model does not attempt to capture all possible routes through the data mining process.

FROM DATA COLLECTION TO DATA PREPROCESSING

The time component of data must be recognized explicitly from the data or implicitly from the manner of its organization. On the other hand, some data mining techniques are robust enough to support analyzes of datasets with missing values.

DATA WAREHOUSES FOR DATA MINING

If the data warehouse is available, the preprocessing stage in data mining is significantly reduced, sometimes even eliminated. OLAP tools are very useful for the data mining process; they may be part of it, but they are not substitutes.

FROM BIG DATA TO DATA SCIENCE

These different forms and quality of data clearly indicate that heterogeneity is a natural characteristic of big data and it is a challenge to understand and successfully manage such data. Very often there is confusion between concepts of data science, big data analytics and data mining.

Figure 1.7. Exponential growth of global data. From: http://s3.amazonaws.com/sdieee/

BUSINESS ASPECTS OF DATA MINING: WHY A DATA-MINING PROJECT FAILS?

A number of data mining projects have failed in recent years because one or more of these criteria were not met. The important characteristic of a data-mining process is the relative time spent completing each of the steps in the process, and the data is counterintuitive, as presented in Figure 1.8.

Figure 1.8. Effort in data-mining process.

ORGANIZATION OF THIS BOOK

Important legal constraints and guidelines and security and privacy aspects of data mining applications are also introduced in this chapter. Finally, the book has two appendices with useful background information for practical applications of data mining technology.

REVIEW QUESTIONS AND PROBLEMS

Determine whether or not each of the following activities is a data mining task. a) Distribution of the company's customers according to their age and gender. Do a Google search for "text data mining" and "text data mining". (a) Do you get the same top 10 search results. b) What does this tell you about the content component of the ranking heuristics used by search engines.

With the proliferation of online services and mobile technologies, the world has entered a multimedia era of big data. Paul Zikopoulos, Chris Eaton, Making Sense of Big Data: Analytics for Enterprise-Class Hadoop and Streaming Data, McGraw Hill Professional, 2011.

PREPARING THE DATA

REPRESENTATION OF RAW DATA

A larger radius is needed to enclose a portion of the data points in a high-dimensional space. These "curse of dimensionality" rules most often have serious consequences when dealing with a limited number of samples in a high-dimensional space.

Figure 2.1. Variable types with examples.

CHARACTERISTICS OF RAW DATA

Raw data is not always (very rarely in our opinion!) the best data set prepared for data mining. Data preparation is sometimes dismissed as a minor topic in the data literature and used only formally as a stage in a data mining process.

TRANSFORMATION OF RAW DATA

Normalizations
Data Smoothing
Differences and Ratios

The simple answer is that normalizations are useful for several different methods of data mining. The effects of relatively small transformations of input or output features are particularly important in the specification of the data-mining objectives.

MISSING DATA

In general, replacing missing values using a simple, artificial scheme of data preparation is speculative and often misleading. It is best to generate multiple solutions from data mining with and without features that have missing values and then analyze and interpret them.

TIME-DEPENDENT DATA

The resulting problem of high dimensionality is the price paid for precision in the standard representation of the time series data. This leads to a greater emphasis on recent data, potentially discarding the oldest portions of the time series.

TABLE 2.1. Transformation of Time Series to Standard Tabular Form (Window = 5)

OUTLIER ANALYSIS

The main limitations of the approach are time-consuming process and subjective nature of outlier detection. Statistical methods for multivariate outlier detection often indicate those samples that are located relatively far from the center of the data distribution.

Figure 2.6. Outliers for univariate data based on mean value and standard deviation.

REVIEW QUESTIONS AND PROBLEMS

Develop a software tool for detecting outliers if the data for preprocessing is given in the form of a flat file with non-dimensional samples. If samples exist for all possible combinations of attribute values, . a) What will be the number of samples in a data set, and.

Most of the prerequisite material is covered in the text, especially on linear algebra and probability and statistics. The outlier detection problem finds application in numerous domains where it is desirable to determine interesting and unusual events in the underlying generation process.

DATA REDUCTION

DIMENSIONS OF LARGE DATA SETS

Measurable quality: The quality of estimated results using a limited data set can be accurately determined. Recognizable quality – The quality of estimated results can be easily determined during the execution of the data reduction algorithm, before any data mining procedure is applied.

FEATURES REDUCTION

Feature Selection
Feature Extraction

Monotonicity – Algorithms are usually iterative and the quality of the results is a non-decreasing function of time and the quality of the input data. Diminishing returns – the solution improvement is high in the early stages (iterations) of the calculation and diminishes over time.

Figure 3.1. Feature selection approaches.

RELIEF ALGORITHM

In this example, the number of samples is low, so we use all samples (m=n) to estimate the feature scores. The algorithm uses few heuristics and is therefore efficient: the computational complexity is O(mpn).

ENTROPY MEASURE FOR RANKING FEATURES

The distribution of all similarities (distances) for a given set of data is a characteristic of the organization and order of the data in the ann-dimensional space. The proposed technique compares the entropy measure for a given data set before and after feature removal.

Figure 3.3. A tabular representation of similarity measures S. (a) Data set with three categorical features

PRINCIPAL COMPONENT ANALYSIS

Because λi's are sorted, most of the information about the data set is concentrated in a few initial principal components. For the Iris data, the first two principal components should adequately describe the characteristics of the data set.

Figure 3.4. The first principal component is an axis in the direction of maximum variance.

VALUE REDUCTION

When the number of bins is small, the nearest boundaries of each bin can be candidates for representatives in a given bin. First, the object's values are sorted so that the number of individual values can be counted after rounding.

FEATURE DISCRETIZATION: ChiMERGE TECHNIQUE

Starting atk= 1, rounding is performed for all values and the number of distinct values is counted. The degree of freedom parameter of the χ2 test is one less than the number of classes.

TABLE 3.5. A Contingency Table for 2 × 2 Categorical Data

CASE REDUCTION

In this case, a practical way to determine the required size of the data subset can be done as follows: In the first step, we select a small preliminary subset of samples of size m. These percentages are reasonable, but can be adjusted based on knowledge of the application and the number of samples in the data set.

REVIEW QUESTIONS AND PROBLEMS

If you use equal height discretization with 10 bins, what is the largest number of records that can appear in a bin. What is the largest number of records that could appear in a bin with uniform width discretization (10 bins). f) What about equal height discretization (10 bins).

The paper explains these results by identifying weaknesses of current nonlinear techniques and suggests how the performance of nonlinear dimensionality reduction techniques can be improved. This book is an easy-to-read, gentle introduction to the world of data science for people from a wide range of backgrounds.

LEARNING FROM DATA

LEARNING MACHINE

The corresponding risk function measures the accuracy of the learning machine's predictions about the system output. The user very often selects the most appropriate set of functions for the learning machine based on his/her knowledge of the system.

STATISTICAL LEARNING THEORY

Evaluating the model based on a set of approximation functions defined in the selected design element. Initially, only the first term of the approximation functions is used, and the corresponding parameters are optimized.

Figure 4.5. Behavior of the growth function G(n).

TYPES OF LEARNING METHODS

The parameters of the learning system are adjusted under the combined influence of the training samples and the error signal. This function can be visualized as a multidimensional error surface, with the free parameters of the learning system as coordinates.

COMMON LEARNING TASKS

The regression function in Figure 4.11b was created based on some predefined criteria built into a data technique. Fuzzy modeling and fuzzy decision making are steps that are very often included in the data mining process.

Figure 4.12a shows the initial data, and they are grouped into clusters, as shown in Figure 4.12b, using one of the standard distance measures for samples as points in an n- dimensional space

SUPPORT VECTOR MACHINES

This "sparse" representation can be seen as data compression in the construction of the classifier. There is one feature of the SVM optimization process that helps determine the steps in the methodology.

Figure 4.16. Linear separation in 2D space. (a) A decision plane in 2D space is a line

SEMI - SUPERVISED SUPPORT VECTOR MACHINES (S3VM) In recent decades, the ways of collecting data are more diverse, and the amount of data

This could be seen as performing clustering of unlabeled samples and then labeling the clusters by synchronizing these labels with the given labeled data. Continuity assumption – Unlabeled samples in n-dimensional space that are close to each other are more likely to share the same label.

Figure 4.27. Classification model using labeled and unlabeled samples. (a) Model based on only labeled samples

The time complexity of the algorithm is linear in the size of the training set since we have to calculate the distance of each training sample from the new test sample. Use the simple majority of the nearest neighbor category as the test sample classification prediction value.

Figure 4.30. k-Nearest neighbor classifier. (a) k = 1. (b) k = 4kNN: NEAREST NEIGHBOR CLASSIFIER 135

MODEL SELECTION VS. GENERALIZATION

If the number of samples is less, then the designer of the data-mining experiments must be very careful in splitting the data. With a smaller number of samples, the specific method of dividing the data begins to affect the accuracy of the model.

TABLE 4.1. Boolean Functions as Hypotheses

MODEL ESTIMATION

Previous measures have primarily been developed for classification problems where the output of the model is expected to be a categorical variable. More complex and global measures are needed to describe the quality of the model.

TABLE 4.2. Confusion Matrix for Two - Class Classification Model

IMBALANCED DATA CLASSIFICATION

Insurance Fraud Detection
Improving Cardiac Care

This approach effectively forces the decision region of the minority class to become more general. Domain knowledge in the REMIND system is quite simple as stated by the author of the system.

Figure 4.38. Generation of synthetic sampling SMOTE approach.

REVIEW QUESTIONS AND PROBLEMS

In addition, all patients found will be reviewed by a physician prior to implantation. Given the desired classCand populationP, lift is defined as:. a) The probability of class given population divided by the probability of a sample taken from the population.

Probability of population P given a sample taken from P. c) Probability of class C given a sample taken from population P. d) The probability of class C given a sample taken from a population P divided by the probability of C in the entire population P. The second part of the book covers theoretical explanations of data mining techniques that have their roots in disciplines other than statistics.

STATISTICAL METHODS

STATISTICAL INFERENCE
ASSESSING DIFFERENCES IN DATA SETS
BAYESIAN INFERENCE
PREDICTIVE REGRESSION

The most common and effective numerical measure of the center of a data set is the mean value (also called the arithmetic mean). A boxplot representation of the data set T based on the mean, variance, and min and max values.