An Overview

(1)

Data Mining:

Concepts and Techniques

(3

^rd

ed.)

— Chapter 3 —

Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign &

Simon Fraser University

(2)

Chapter 3: Data Preprocessing



Data Preprocessing: An Overview



Data Quality



Major Tasks in Data Preprocessing



Data Cleaning



Data Integration



Data Reduction



Data Transformation and Data Discretization



Summary

(3)

Data Transformation

 A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values

 Methods

 Smoothing: Remove noise from data

 Attribute/feature construction

 New attributes constructed from the given ones

 Aggregation: Summarization, data cube construction

 Normalization: Scaled to fall within a smaller, specified range

 min-max normalization

 z-score normalization

 normalization by decimal scaling

 Discretization: Concept hierarchy climbing

(4)

Normalization

 Min-max normalization: to [new_min_A, new_max_A]

 Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to

 Z-score normalization (μ: mean, σ: standard deviation):

 Ex. Let μ = 54,000, σ = 16,000. Then

 Normalization by decimal scaling

716 . 0 0 ) 0 0 . 1 000( , 12 000 , 98

000 , 12 600 ,

73   



A A

A

new max new min new min

min max

min

v ' v ( _  _ )  _



 

A

v A

v





  '

j

v v

'  10

^Where

j

is the smallest integer such that Max(|ν’|) < 1

225 . 000 1

, 16

000 , 54 600 ,

73  

(5)

Discretization

 Three types of attributes

 Nominal—values from an unordered set, e.g., color, profession

 Ordinal—values from an ordered set, e.g., military or academic rank

 Numeric—real numbers, e.g., integer or real numbers

 Discretization: Divide the range of a continuous attribute into intervals

 Interval labels can then be used to replace actual data values

 Reduce data size by discretization

 Supervised vs. unsupervised

 Split (top-down) vs. merge (bottom-up)

 Discretization can be performed recursively on an attribute

 Prepare for further analysis, e.g., classification

(6)

Data Discretization Methods



Typical methods: All the methods can be applied recursively



Binning



Top-down split, unsupervised



Histogram analysis



Top-down split, unsupervised



Clustering analysis (unsupervised, top-down split or bottom-up merge)



Decision-tree analysis (supervised, top-down split)



Correlation (e.g., 

²

) analysis (unsupervised, bottom-up

merge)

(7)

Simple Discretization: Binning

 Equal-width (distance) partitioning

 Divides the range into N intervals of equal size: uniform grid

 if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.

 The most straightforward, but outliers may dominate presentation

 Skewed data is not handled well

 Equal-depth (frequency) partitioning

 Divides the range into N intervals, each containing approximately same number of samples

 Good data scaling

 Managing categorical attributes can be tricky

(8)

Binning Methods for Data Smoothing

 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins:

- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34

* Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

(9)

Discretization Without Using Class Labels (Binning vs. Clustering)

Data Equal interval width (binning)

Equal frequency (binning) K-means clustering leads to better results

(10)

Discretization by Classification &

Correlation Analysis

 Classification (e.g., decision tree analysis)

 Supervised: Given class labels, e.g., cancerous vs. benign

 Using entropy to determine split point (discretization point)

 Top-down, recursive split

 Details to be covered in Chapter 7

 Correlation analysis (e.g., Chi-merge: χ²-based discretization)

 Supervised: use class information

 Bottom-up merge: find the best neighboring intervals (those having similar distributions of classes, i.e., low χ² values) to merge

 Merge performed recursively, until a predefined stopping condition

(11)

Concept Hierarchy Generation

 Concept hierarchy organizes concepts (i.e., attribute values)

hierarchically and is usually associated with each dimension in a data warehouse

 Concept hierarchies facilitate drilling and rolling in data warehouses to view data in multiple granularity

 Concept hierarchy formation: Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior)

 Concept hierarchies can be explicitly specified by domain experts and/or data warehouse designers

 Concept hierarchy can be automatically formed for both numeric and nominal data. For numeric data, use discretization methods shown.

(12)

Concept Hierarchy Generation for Nominal Data



Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts



street < city < state < country



Specification of a hierarchy for a set of values by explicit data grouping



{Urbana, Champaign, Chicago} < Illinois



Specification of only a partial set of attributes



E.g., only street < city , not others



Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values



E.g., for a set of attributes: { street, city, state, country }

(13)

Automatic Concept Hierarchy Generation



Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set



The attribute with the most distinct values is placed at the lowest level of the hierarchy



Exceptions, e.g., weekday, month, quarter, year

country

province_or_ state city

street

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values

(14)

Summary

 Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability

 Data cleaning: e.g. missing/noisy values, outliers

 Data integration from multiple sources:

 Entity identification problem

 Remove redundancies

 Detect inconsistencies

 Data reduction

 Dimensionality reduction

 Numerosity reduction

 Data compression

 Data transformation and data discretization

 Normalization

 Concept hierarchy generation

(15)

References

 D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM, 42:73-78, 1999

 A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996

 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003

 J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.

 H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:

Language, model, and algorithms. VLDB'01

 M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07

 H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), Dec. 1997

 H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining Perspective. Kluwer Academic, 1998

 J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003

 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999

 V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation, VLDB’2001

 T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001

 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.

Knowledge and Data Engineering, 7:623-640, 1995