Data Preprocessing
Lecture 3/DMBI/IKI83403T/MTI/UI Lecture 3/DMBI/IKI83403T/MTI/UIYudho Giri Sucahyo, Ph.D, CISA y , , (yudho@cs.ui.ac.id)(y )
Faculty of Computer Science, University of Indonesia
Obj
ti
Objectives
`
Motivation: Why preprocess the Data?
`
Motivation: Why preprocess the Data?
`
Data Preprocessing Techniques
`
Data Cleaning
`
Data Integration and Transformation
`
Data Integration and Transformation
`
Data Reduction
University of Indonesia 2
Wh P
th D t ?
Why Preprocess the Data?
`
Quality decisions must be based on quality data
`
Quality decisions must be based on quality data
`
Data could be incomplete, noisy, and inconsistent
`
Data warehouse needs consistent integration of
quality data
q
y
`
Incomplete
L ki ib l i ib f i
` Lacking attribute values or certain attributes of interest
` Containing only aggregate data
` Causes:
` Not considered important at the time of entry ` Equipment malfunctions
` Data not entered due to misunderstanding
` Inconsistent with other recorded data and thus deleted
3
Wh P
th D t ? (2)
Why Preprocess the Data? (2)
`
Noisy
(having incorrect attribute values)
`
Noisy
(having incorrect attribute values)
` Containing errors, or outlier values that deviate from the expected
expected
` Causes:
` Data collection instruments used may be faulty
` Human or computer errors occuring at data entry
` Errors in data transmission
`
Inconsistent
` Containing discrepancies in the department codes the department codes used to categorize items
Wh P
th D t ? (3)
Why Preprocess the Data? (3)
` “Clean” the data by filling in missing values smoothing
` Clean the data by filling in missing values, smoothing noisy data, identifying or removing outliers, and resolving inconsistencies
inconsistencies.
` Some examples of inconsistencies:
` customer_id vs cust_id
` Bill vs William vs B.
` Some attributes may be inferred from others. Data cleaning including detection and removal of redundancies g g that may have resulted.
University of Indonesia 5
D t P
i
T
h i
Data Preprocessing Techniques
`
Data Cleaning
`
Data Cleaning
` To remove noise and correct inconsistencies in the data
`
Data Integration
` Merges data from multiple sources into a coherent data g p store, such as a data warehouse or a data cube
`
Data Transformation
`
Data Transformation
` Normalization (to improve the accuracy and efficiency of mining algorithms involving distance measurements E g mining algorithms involving distance measurements E.g. Neural networks, nearest-neighbor)
`
D t Di
ti ti
`
Data Discretization
`
Data Reduction
University of Indonesia 6
D t P
i
T
h i
(2)
Data Preprocessing Techniques (2)
` Data Reduction` Warehouse may store terabytes of data
` Complex data analysis/mining may take a very long time to run on the p y g y y g complete data set
` Obtains a reduced representation of the data set that is much smaller in p volume, yet produces the same (or almost the same) analytical results.
` Strategies for Data Reduction
` Strategies for Data Reduction
` Data aggregation (e.g., building a data cube)
` Dimension reduction (e.g. removing irrelevant attributes through
` Dimension reduction (e.g. removing irrelevant attributes through correlation analysis)
` Data compression (e.g. using encoding schemes such as minimum length encoding or wavelets)
` Numerosity reduction
` Generalization
D t P
i
T
h i
(3)
D t Cl
i
Mi
i
V l
Data Cleaning – Missing Values
1.
Ignore the tuple
1.
Ignore the tuple
` Usually done when class label is missing Æclassification
` Not effective when the missing values in attributes spread in
` Not effective when the missing values in attributes spread in different tuples
F ll h
l
ll d
f
bl ?
2.
Fill in the missing value manually: tedious + infeasible?
3.
Use a global constant to fill in the missing value
g
g
` ‘unknown’, a new class?
` Mining program may mistakenly think that they form an
` Mining program may mistakenly think that they form an interesting concept, since they all have a value in common Æ not recommended
not recommended
4.
Use the attribute mean to fill in the missing value
Æ
i
University of Indonesia
avg income
9
D t Cl
i
Mi
i
V l
(2)
Data Cleaning – Missing Values (2)
5
Use the attribute mean for all samples belonging to the
5.
Use the attribute mean for all samples belonging to the
same class as the given tuple
Æ
same credit risk
t
category
6.
Use the most probable value to fill in the missing value
` Determined with regression, inference-based tools such as Bayesian formalism, or decision tree inductiony
Methods 3 to 6 bias the data. The filled-in value may not be y correct. However, method 6 is a popular strategy, since:
` It uses the most information from the present data to predict missing values
` It uses the most information from the present data to predict missing values
`There is a greater chance that the relationships between income and the other attributes are preserved
University of Indonesia
attributes are preserved.
10
Data Cleaning –
N i
d
I
t
(I
i t
t)
D t
Noise and Incorrect (Inconsistent) Data
` Noise is a random error or variance in a measured variable
` Noise is a random error or variance in a measured variable.
` How can we smooth out the data to remove the noise?
` Binning Method
` Smooth a sorted data value by consulting its “neighborhood”, that is, the values around it.
` The sorted values are distributed into a number of buckets, or bins. ` Because binning methods consult the neighborhood of values, they
perform local smoothing.
` Binning is also uses as a discretizatin technique (will be discussed later)
11
Data Cleaning – Noisy Data
Bi
i
M th d
Binning Methods
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 26, 25, 28, 29, 34p ( ) , , , , , , , , , , , * Partition into (equidepth) bins of depth 3, each bin contains three values:
- Bin 1: 4, 8, 9, 15, , , - Bin 2: 21, 21, 24, 26 - Bin 3: 25, 28, 29, 34, , , * Smoothing by bin means:
- Bin 1: 9, 9, 9, 9, , , - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29, , ,
* Smoothing by bin boundaries:Æthe larger the width, the greater the effect - Bin 1: 4, 4, 4, 15, , ,
- Bin 2: 21, 21, 26, 26 - Bin 3: 25, 25, 25, 34, , ,
Data Cleaning – Noisy Data
Cl
t
i
Clustering
` Similar values are organized into groups or clusters
` Similar values are organized into groups, or clusters.
` Values that fall outside of the set of clusters may be
id d tli
considered outliers.
University of Indonesia 13
Data Cleaning – Noisy Data
R
i
Regression
` Data can be smoothed by y
` Data can be smoothed by
fitting the data to a function such as with
y
Y1
function, such as with regression.
Li i i l
Y1
` Linear regression involves finding the best line to fit
y = x + 1
Y1’
two variables, so that one
variable can be used to X1 x
predict the other.
` Multiple linear regression p g
Æ > 2 variables,
multidimensional surface
University of Indonesia
multidimensional surface
14
D t S
thi
D t R d
ti
Data Smoothing vs Data Reduction
`
Many methods for data smoothing are also methods
`
Many methods for data smoothing are also methods
for data reduction involving discretization.
`
Examples
` Binning techniquesg q Æreduce the number of distinct values per attribute. Useful for decision tree induction which repeatedly make value comparisons on sorted data.
` Concept hierarchies are also a form of data discretization that can also be used for data smoothng.g
` Mapping real price into inexpensive, moderately_priced, expensivep
` Reducing the number of data values to be handled by the mining process.
mining process.
D t Cl
i
I
i t
t D t
Data Cleaning - Inconsistent Data
` May be corrected manually
` May be corrected manually.
` Errors made at data entry may be corrected by
f i t l d ith ti d i d
performing a paper trace, coupled with routines designed to help correct the inconsistent use of codes.
D t I t
ti
d T
f
ti
Data Integration and Transformation
` Data Integration: combines data from multiple data stores
` Data Integration: combines data from multiple data stores
` Schema integration
` integrate metadata from different sources
` Entity identification problem: identify real world entities from y p y multiple data sources, e.g., A.cust-id ≡B.cust-#
` D t ti d l i d t l fli t
` Detecting and resolving data value conflicts
` for the same real world entity, attribute values from different sources are different
` possible reasons: different representations, different scales (feet
` possible reasons: different representations, different scales (feet vs metre)
University of Indonesia 17
D t T
f
ti
Data Transformation
`
Data are transformed into forms appropriate for mining
`
Data are transformed into forms appropriate for mining
`
Methods:
` Smoothing: binning, clustering, and regression
` Aggregationgg g : summarization, data cube construction
` Generalization: low-level or raw data are replaced by higher-level concepts through the use of concept hierarchiesp g p
` Street Æcity or country
` Numeric attributes of age Æyoung middle-aged senior ` Numeric attributes of age Æyoung, middle-aged, senior
` Normalization: attribute data are scaled so as to fall within a small specified range such as 0 0 to 1 0
small specified range, such as 0.0 to 1.0
` Useful for classification involving neural networks, or distance measurements such as nearest neighbor classification and clustering
University of Indonesia
measurements such as nearest neighbor classification and clustering
18
D t T
f
ti
(2)
Data Transformation (2)
N li i l d f ll i hi ll ifi d
` Normalization: scaled to fall within a small, specified range
` min-max normalization
A
` z-score normalization A
` normalization by decimal scaling A
Data Reduction – Data Cube Aggregation
` Data consist of sales per quarter for several years User
` Data consist of sales per quarter, for several years. User interested in the annual sales (total per year) Æ data can
b d h h l i d i h
be aggregated so that the resulting data summarize the total sales per year instead of per quarter.
` Resulting data set is smaller in volume, without loss of information necessary for the analysis task
information necessary for the analysis task.
` See Figure 3.4 [JH]
Di
i
lit R d
ti
Dimensionality Reduction
`
Datasets for analysis may contain hundreds of
`
Datasets for analysis may contain hundreds of
attributes, many of which may be irrelevant to the
i i t k
d
d t
mining task, or redundant.
`
Leaving out relevant attributes or keeping irrelevant
attributes can cause confusion for the mining
algorithm, poor quality of discovered patterns.
algorithm, poor quality of discovered patterns.
`
Added volume of irrelevant or redundant attributes
l
d
th
i i
can slow down the mining process.
`
Dimensionality reduction reduces the data set size by
removing such attributes from it.
University of Indonesia 21
Di
i
lit R d
ti
(2)
Dimensionality Reduction (2)
` The goal of attribute subset selection (also known as
` The goal of attribute subset selection (also known as
feature selection) is to find a minimum set of attributes such that the resulting probability distribution of the data classes is that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes.
` For d attributes, there are 2dpossible subsets.
` The best (and worst) attributes are typically determined using
` The best (and worst) attributes are typically determined using tests of statistical significance. Attribute evaluation measures such as information gaincan be used
such as information gaincan be used.
` Heuristic methods ` St i f d l ti
` Stepwise forward selection
` Stepwise backward selection (or combination of both)
` Decision tree induction
University of Indonesia ` Decision tree induction
22
Dimensionality Reduction (3)
E
l
f D
i i
T
I d
ti
Example of Decision Tree Induction
Initial attribute set: {A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}
D t C
i
Data Compression
` Data encoding or transformations are applied so as to
` Data encoding or transformations are applied so as to obtain a reduced or compressed representation of the original data
original data.
` Lossless data compression technique: If the original data
b d f h d d i h
can be reconstructed from the compressed data without any loss of information.
` Lossy data compression technique: we can reconstruct only an approximation of the original data.y pp g
` Two popular and effective methods of lossy data
compression: wavelet transformts and principal components
D t C
i
(2)
Data Compression (2)
O i i l D t C d
Original Data Compressed Data l l
lossless
Original Data Approximated
University of Indonesia 25
N
it R d
ti
Numerosity Reduction
` Parametric methods:
` Parametric methods:
` Assume the data fits some model, estimate model parameters, store only the parameters and discard the data (except store only the parameters, and discard the data (except possible outliers).
` Log-linear models: obtain value at a point in m-D space as the
` Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces. (see Slide 14)
` Non parametric methods:
` Non-parametric methods:
` No assume models
` Three major families: ` Clustering (see Slide 13) ` Histograms
` Sampling
University of Indonesia 26
N
it R d
ti
Hi t
Numerosity Reduction - Histograms
A l d d i 40
` A popular data reduction
technique 35
` Divide data into buckets and store average (sum) for
h b k 25
30
each bucket
` Partitionng rules: 20 25
` Equiwidth
` Equidepth 10
15
10000 30000 50000 70000 90000
27
N
it R d
ti
S
li
Numerosity Reduction - Sampling
` Allows a large data set to be represented by a much
` Allows a large data set to be represented by a much smaller random sample (or subset) of the data.
` Ch t ti b t f th d t
` Choose a representativesubset of the data
` Simple random sampling may have very poor performance in
th f k
the presence of skew
` Develop adaptive sampling methods
` Stratified sampling:
` Approximate the percentage of each class (or subpopulation of ) h ll d b
interest) in the overall database ` Used in conjunction with skewed data
Si l d l i h l (SRSWOR)
` Simple random sample without replacement (SRSWOR)
` Simple random sample with replacement (SRSWR)
N
it R d
ti
S
li
(2)
Numerosity Reduction – Sampling (2)
Raw Data Cluster/Stratified Sample
University of Indonesia 29
N
it R d
ti
S
li
(3)
Numerosity Reduction – Sampling (3)
Raw Data
University of Indonesia 30
Di
ti
ti
d
C
t
Hi
h
Discretization and Concept Hierarchy
`
Discretization
can be used to reduce the number of
`
Discretization
can be used to reduce the number of
values for a given continuous attribute, by dividing the
f th tt ib t i t i t
l I t
l l b l
range of the attribute into intervals. Interval labels
can then be used to replace actual data values.
`
Concept hierarchies
can be used to reduce the data
by collecting and replacing low level concepts (such as
by collecting and replacing low level concepts (such as
numeric values for the attribute age) by higher level
concepts (such as young middle aged or senior)
concepts (such as young, middle-aged, or senior).
Discretization and concept hierarchy
ti
f
i d t
generation for numeric data
` Binning
` Binning
` Histogram analysis
` Clustering analysis
` Entropy-based discretizationpy
E
l
f 3 4 5 l
Example of 3-4-5 rule
count
msd=1,000 Low=-$1,000 High=$2,000
Step 2: Step 1:
-$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
University of Indonesia 33
Concept hierarchy generation for
t
i
l d t
categorical data
` Categorical data are discrete data Have a finite
` Categorical data are discrete data. Have a finite
number of distinct values, with no ordering among the values Ex Location job category
values. Ex. Location, job category.
` Specification of a set of attributes:
country 15 distinct values
` Concept hierarchy can be automatically generated
province_or_ state 65 distinct values
based on the number of distinct values per attribute
city 3567 distinct values
in the given attribute set. The attribute with the most di ti t l i l d t
street 674,339 distinct values
distinct values is placed at the lowest level of the hierarchy
University of Indonesia 34
hierarchy.
C
l
i
Conclusion
`
Data preparation is a big issue for both warehousing
`
Data preparation is a big issue for both warehousing
and mining
`
Data preparation includes
` Data cleaning g
` Data integration and Data transformation
` Data reduction and feature selection
` Data reduction and feature selection
` Discretization
`
A l t
th d h
b
d
l
d b t till
`
A lot a methods have been developed but still an
active area of research
35
R f
References
` [JH] Jiawei Han and Micheline Kamber Data Mining:
` [JH] Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001.