Data Mining:
Concepts and Techniques
(3
rded.)
— Chapter 3 —
Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
Chapter 3: Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
Summary
Data Transformation
A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing
Normalization
Min-max normalization: to [new_minA, new_maxA]
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to
Z-score normalization (μ: mean, σ: standard deviation):
Ex. Let μ = 54,000, σ = 16,000. Then
Normalization by decimal scaling
716 . 0 0 ) 0 0 . 1 000( , 12 000 , 98
000 , 12 600 ,
73
A A
A A
A
A
new max new min new min
min max
min
v ' v ( _ _ ) _
A
v A
v
'
j
v v
' 10
Wherej
is the smallest integer such that Max(|ν’|) < 1225 . 000 1
, 16
000 , 54 600 ,
73
Discretization
Three types of attributes
Nominal—values from an unordered set, e.g., color, profession
Ordinal—values from an ordered set, e.g., military or academic rank
Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
Interval labels can then be used to replace actual data values
Reduce data size by discretization
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
Data Discretization Methods
Typical methods: All the methods can be applied recursively
Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g.,
2) analysis (unsupervised, bottom-up
merge)
Simple Discretization: Binning
Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing approximately same number of samples
Good data scaling
Managing categorical attributes can be tricky
Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
Discretization Without Using Class Labels (Binning vs. Clustering)
Data Equal interval width (binning)
Equal frequency (binning) K-means clustering leads to better results
Discretization by Classification &
Correlation Analysis
Classification (e.g., decision tree analysis)
Supervised: Given class labels, e.g., cancerous vs. benign
Using entropy to determine split point (discretization point)
Top-down, recursive split
Details to be covered in Chapter 7
Correlation analysis (e.g., Chi-merge: χ2-based discretization)
Supervised: use class information
Bottom-up merge: find the best neighboring intervals (those having similar distributions of classes, i.e., low χ2 values) to merge
Merge performed recursively, until a predefined stopping condition
Concept Hierarchy Generation
Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data warehouse
Concept hierarchies facilitate drilling and rolling in data warehouses to view data in multiple granularity
Concept hierarchy formation: Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior)
Concept hierarchies can be explicitly specified by domain experts and/or data warehouse designers
Concept hierarchy can be automatically formed for both numeric and nominal data. For numeric data, use discretization methods shown.
Concept Hierarchy Generation for Nominal Data
Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit data grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
E.g., only street < city , not others
Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values
E.g., for a set of attributes: { street, city, state, country }
Automatic Concept Hierarchy Generation
Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set
The attribute with the most distinct values is placed at the lowest level of the hierarchy
Exceptions, e.g., weekday, month, quarter, year
country
province_or_ state city
street
15 distinct values
365 distinct values
3567 distinct values
674,339 distinct values
Summary
Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
Entity identification problem
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
References
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM, 42:73-78, 1999
A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), Dec. 1997
H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining Perspective. Kluwer Academic, 1998
J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation, VLDB’2001
T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995