Ericks
WHY DO WE NEED TO
PREPROCESS THE DATA?
Fields that are obsolete or redundant
Missing values
Outliers
Data in a form not suitable for data
mining models
Values not consistent with policy or
DATA CLEANING
Customer ID Zip Gender Income Age Marital Status
Transaction Amount
1001 10048 M 75000 C M 5000
1002 J2S7K7 F −40000 40 W 4000
1003 90210 10000000 45 S 7000
1004 6269 M 50000 0 S 1000
1005 55101 M 99999 30 D 3000
DATA CLEANING - ZIP
Hyancinthe, Quebec, Canada).Customer 1004? (The zip code is probably 06269, which refers to Storrs, Connecticut, home of the University of Connecticut)
DATA CLEANING - GENDER
Customer ID Gender
1001 M
1002 F
1003
1004 M
1005 M
DATA CLEANING - INCOME
entirely possible, especially when considering the customer’s zip code (90210, Beverly Hills), this value of income is nevertheless an outlier, an extreme data value.Customer 1004’s reported income of −$40,000 lies beyond the field bounds for income and therefore must be an error.
Customer 1005’s income of $99,999? Perhaps nothing; it may in fact be valid. But if all the other incomes are rounded to the nearest $5000, why the precision with customer 1005? Often, in legacy databases, certain pecified values are meant to be codes for anomalous entries, such as missing values. Perhaps
DATA CLEANING - AGE
customer 1001’s “age” of C probably reflects an earlier categorization of this man’s age into a bin labeled C. The data mining software will definitely not like this categorical value in an otherwise numerical field, and we will have to resolve this problem somehow.How about customer 1004’s age of 0? Perhaps there is a newborn male living in Storrs, Connecticut, who has made a transaction of $1000.
HANDLING MISSING DATA
Replace the missing value with some
constant, specified by the analyst.
Replace the missing value with the field
mean (for numerical variables) or the
mode (for categorical variables).
Replace the missing values with a value
MEAN (RATA-RATA)
X
i
X
i
= Kumpulan Data
Rata = ∑
---
N N = Jumlah Data
REPLACE MISSING VALUE
Customer 1001, Nilai C diganti
dengan rata-rata Age
MODE (MODUS)
Menggambarkan nilai yang paling sering muncul
dalam kumpulan data.
Data
= 1, 2, 1, 4, 3, 1, 5, 3, 1, 2
REPLACE MISSING VALUE
WITH MODE
Customer ID Gender
1001 M
1002 F
1003
1004 M
1005 M
Customer 1003, Isi Field Gender
IDENTIFYING MISCLASSIFICATIONS
Notice Anything Strange about This Frequency Distribution?
USA and France, have a count of only one automobile each. What is clearly happening here is that two of the records have been classified inconsistently with respect to the origin of manufacture.
To maintain consistency with the remainder of the data set, the record with origin USA should have been labeled US, and the record with origin France
DATA TRANSFORMATION
Data miners should
normalize their
numerical variables, to standardize
the
scale of effect each variable has on the
results
For example,
if we are interested in major league baseball, players’ batting averages will range from zero to less than 0.400, while the number of home runs hit in a season will range from zero to around 70.
Histogram of
time-to-60, with
summary statistics
Min = 8
Max = 25
Min
–
Max Normalization
X − min(X) X − min(X)
X∗ = --- = ---
range(X) max(X) − min(X)
For a “drag-racing-ready” vehicle, which takes only 8 seconds (the field minimum) to reach 60 mph, the min–max normalization is
X∗ = 8 − 8 / 25 − 8 = 0
From this we can learn that data values which represent the minimum for the variable will have a min–max normalization value of zero.
For an “average” vehicle (if any), which takes exactly 15.548 seconds (the variable average) to reach 60 mph, the min–max normalization is
X∗ = 15.548 − 8 / 25 − 8 = 0.444
This tells us that we may expect variables values near the center of the distribution to have a min–max normalization value near 0.5.
For an “I’ll get there when I’m ready” vehicle, which takes 25 seconds (the
variable maximum) to reach 60 mph, the min–max normalization is
X∗ = 25 − 8 / 25 − 8 = 1.0
Z-Score Standardization
X − mean(X)
X∗ =
SD(X)
For the vehicle that takes only 8 seconds to reach 60 mph, the Z-score standardization
is:
X∗ = 8 − 15.548 / 2.911 = −2.593
Thus, data values that lie below the mean will have a negative Z-score standardization.
For an “average” vehicle (if any), which takes exactly 15.548 seconds (the
variable average) to reach 60 mph, the Z-score standardization is X∗ = 15.548 − 15.548 / 2.911 = 0
This tells us that variable values falling exactly on the mean will have a Z-score
standardization of zero.
For the car that takes 25 seconds to reach 60 mph, the Z-score standardization is X∗ = 25 − 15.548 / 2.911 = 3.247
That is, data values that lie above the mean will have a positive Z-score
Min = -2.593
Max = 3.247
Rata = 0.00
SD
= 1.0
N
= 261
VARIAN & STANDAR DEVIASI
NUMERICAL METHODS FOR
IDENTIFYING OUTLIERS
Interquartile Range.
The quartiles of a data set divide the data set into four parts, each containing 25% of the data.
The first quartile (Q1) is the 25th percentile.
The second quartile (Q2) is the 50th percentile, that is, the median.
The third quartile (Q3) is the 75th percentile.
The interquartile range (IQR) is a measure of variability that is much more robust
than the standard deviation. The IQR is calculated as IQR = Q3 − Q1 and may
be interpreted to represent the spread of the middle 50% of the data.
A robust measure of outlier detection is therefore defined as follows. A data value is an outlier if:
INTERQUARTILE RANGE
For example, suppose that for a set of test scores, the 25th percentile was Q1 = 70
and the 75th percentile was Q3 = 80, so that half of all the test scores fell
between 70 and 80. Then the interquartile range, the difference between these quartiles, was
IQR = 80 − 70 = 10.
A test score would be robustly identified as an outlier if:
a. It is lower than Q1 − 1.5(IQR) = 70 − 1.5(10) = 55, or
MEDIAN
Menggambarkan nilai tengah dalam kumpulan
data.
N+1 N = Jumlah Data
Median = ---
CONTOH MEDIAN
Data
= 2, 3, 4, 4, 5, 7, 8, 8, 9, 10, 12
Median
= (11+1) / 2 = 6
Data ke -6
= 7
Data
= 2, 3, 4, 4, 5, 7, 8, 8, 9, 10
Median
= (10+1) / 2 = 5,5
Data ke -5,5
= ?
KUARTIL
Membagi data menjadi Kelompok Data Rendah
dan Median dan Kelompok Data Tinggi.
Q1 = Median dari Kelompok Data Rendah
Q2 = Median dari Seluruh Kelompok Data
CONTOH KUARTIL
2, 3, 4, 4, 5
7
8, 8, 9, 10, 12
Q1
Q2
Q3
2, 3, 4, 4, 5
(5 + 7) / 2
7, 8, 8, 9, 10
Penanganan Noisy Data (#1)
“Apa yang dimaksud noise ?“
Noise adalah kesalahan yang terjadi secara random
atau karena variasi yang terjadi dalam pengukuran
variabel.
Solusi:
Penanganan Noisy Data (#2)
Beberapa pendekatan Smoothing:
–
Binning
–
Clustering
Binning (#1)
Metode-metode binning menghaluskan nilai pada data yang
terurut dengan "berkonsultasi" dengan data "tetangganya", yaitu
nilai-nilai di sekitarnya.
Nilai-nilai yang terurut didistribusikan ke dalam sejumlah
"buckets" atau bins.
Penghalusan data secara lokal.