Bahan Ajar Data Mining.rar (5,149Kb)

(1)

Ericks

(2)

WHY DO WE NEED TO

PREPROCESS THE DATA?



Fields that are obsolete or redundant



Missing values



Outliers



Data in a form not suitable for data

mining models



Values not consistent with policy or

(3)

DATA CLEANING

Customer ID Zip Gender Income Age Marital Status

Transaction Amount

1001 10048 M 75000 C M 5000

1002 J2S7K7 F −40000 40 W 4000

1003 90210 10000000 45 S 7000

1004 6269 M 50000 0 S 1000

1005 55101 M 99999 30 D 3000

(4)

DATA CLEANING - ZIP

Hyancinthe, Quebec, Canada).

Customer 1004? (The zip code is probably 06269, which refers to Storrs, Connecticut, home of the University of Connecticut)

(5)

DATA CLEANING - GENDER

Customer ID Gender

1001 M

1002 F

1003

1004 M

1005 M

(6)

DATA CLEANING - INCOME

entirely possible, especially when considering the customer’s zip code (90210, Beverly Hills), this value of income is nevertheless an outlier, an extreme data value.

Customer 1004’s reported income of −$40,000 lies beyond the field bounds for income and therefore must be an error.

Customer 1005’s income of $99,999? Perhaps nothing; it may in fact be valid. But if all the other incomes are rounded to the nearest $5000, why the precision with customer 1005? Often, in legacy databases, certain pecified values are meant to be codes for anomalous entries, such as missing values. Perhaps

(7)

(8)

DATA CLEANING - AGE

customer 1001’s “age” of C probably reflects an earlier categorization of this man’s age into a bin labeled C. The data mining software will definitely not like this categorical value in an otherwise numerical field, and we will have to resolve this problem somehow.

How about customer 1004’s age of 0? Perhaps there is a newborn male living in Storrs, Connecticut, who has made a transaction of $1000.

(9)

HANDLING MISSING DATA



Replace the missing value with some

constant, specified by the analyst.



Replace the missing value with the field

mean (for numerical variables) or the

mode (for categorical variables).



Replace the missing values with a value

(10)

MEAN (RATA-RATA)

X

_i

X

_i

= Kumpulan Data

Rata = ∑

---

N N = Jumlah Data

(11)

REPLACE MISSING VALUE

Customer 1001, Nilai C diganti

dengan rata-rata Age

(12)

MODE (MODUS)

Menggambarkan nilai yang paling sering muncul

dalam kumpulan data.

Data

= 1, 2, 1, 4, 3, 1, 5, 3, 1, 2

(13)

REPLACE MISSING VALUE

WITH MODE

Customer ID Gender

1001 M

1002 F

1003

1004 M

1005 M

Customer 1003, Isi Field Gender

(14)

IDENTIFYING MISCLASSIFICATIONS

Notice Anything Strange about This Frequency Distribution?

USA and France, have a count of only one automobile each. What is clearly happening here is that two of the records have been classified inconsistently with respect to the origin of manufacture.

To maintain consistency with the remainder of the data set, the record with origin USA should have been labeled US, and the record with origin France

(15)

(16)

(17)

DATA TRANSFORMATION



Data miners should

normalize their

numerical variables, to standardize

the

scale of effect each variable has on the

results

For example,

if we are interested in major league baseball, players’ batting averages will range from zero to less than 0.400, while the number of home runs hit in a season will range from zero to around 70.

(18)

Histogram of

time-to-60, with

summary statistics

Min = 8

Max = 25

(19)

Min

–

Max Normalization

X − min(X) X − min(X)

X∗ = --- = ---

range(X) max(X) − min(X)

For a “drag-racing-ready” vehicle, which takes only 8 seconds (the field minimum) to reach 60 mph, the min–max normalization is

X∗ = 8 − 8 / 25 − 8 = 0

From this we can learn that data values which represent the minimum for the variable will have a min–max normalization value of zero.

For an “average” vehicle (if any), which takes exactly 15.548 seconds (the variable average) to reach 60 mph, the min–max normalization is

X∗ = 15.548 − 8 / 25 − 8 = 0.444

This tells us that we may expect variables values near the center of the distribution to have a min–max normalization value near 0.5.

For an “I’ll get there when I’m ready” vehicle, which takes 25 seconds (the

variable maximum) to reach 60 mph, the min–max normalization is

X∗ = 25 − 8 / 25 − 8 = 1.0

(20)

Z-Score Standardization

X − mean(X)

X∗ =

SD(X)

For the vehicle that takes only 8 seconds to reach 60 mph, the Z-score standardization

is:

X∗ = 8 − 15.548 / 2.911 = −2.593

Thus, data values that lie below the mean will have a negative Z-score standardization.

For an “average” vehicle (if any), which takes exactly 15.548 seconds (the

variable average) to reach 60 mph, the Z-score standardization is X∗ = 15.548 − 15.548 / 2.911 = 0

This tells us that variable values falling exactly on the mean will have a Z-score

standardization of zero.

For the car that takes 25 seconds to reach 60 mph, the Z-score standardization is X∗ = 25 − 15.548 / 2.911 = 3.247

That is, data values that lie above the mean will have a positive Z-score

(21)

Min = -2.593

Max = 3.247

Rata = 0.00

SD

= 1.0

N

= 261

(22)

VARIAN & STANDAR DEVIASI

(23)

NUMERICAL METHODS FOR

IDENTIFYING OUTLIERS

Interquartile Range.

The quartiles of a data set divide the data set into four parts, each containing 25% of the data.

The first quartile (Q1) is the 25th percentile.

The second quartile (Q2) is the 50th percentile, that is, the median.

The third quartile (Q3) is the 75th percentile.

The interquartile range (IQR) is a measure of variability that is much more robust

than the standard deviation. The IQR is calculated as IQR = Q3 − Q1 and may

be interpreted to represent the spread of the middle 50% of the data.

A robust measure of outlier detection is therefore defined as follows. A data value is an outlier if:

(24)

INTERQUARTILE RANGE

For example, suppose that for a set of test scores, the 25th percentile was Q1 = 70

and the 75th percentile was Q3 = 80, so that half of all the test scores fell

between 70 and 80. Then the interquartile range, the difference between these quartiles, was

IQR = 80 − 70 = 10.

A test score would be robustly identified as an outlier if:

a. It is lower than Q1 − 1.5(IQR) = 70 − 1.5(10) = 55, or

(25)

MEDIAN

Menggambarkan nilai tengah dalam kumpulan

data.

N+1 N = Jumlah Data

Median = ---

(26)

CONTOH MEDIAN

Data

= 2, 3, 4, 4, 5, 7, 8, 8, 9, 10, 12

Median

= (11+1) / 2 = 6

Data ke -6

= 7

Data

= 2, 3, 4, 4, 5, 7, 8, 8, 9, 10

Median

= (10+1) / 2 = 5,5

Data ke -5,5

= ?

(27)

KUARTIL

Membagi data menjadi Kelompok Data Rendah

dan Median dan Kelompok Data Tinggi.

Q1 = Median dari Kelompok Data Rendah

Q2 = Median dari Seluruh Kelompok Data

(28)

CONTOH KUARTIL

2, 3, 4, 4, 5

7 8, 8, 9, 10, 12

Q1

_Q2

Q3

2, 3, 4, 4, 5

(5 + 7) / 2

7, 8, 8, 9, 10

(29)

Penanganan Noisy Data (#1)

“Apa yang dimaksud noise ?“

Noise adalah kesalahan yang terjadi secara random

atau karena variasi yang terjadi dalam pengukuran

variabel.

Solusi:

(30)

Penanganan Noisy Data (#2)

Beberapa pendekatan Smoothing:

–

Binning

–

Clustering

(31)

Binning (#1)



Metode-metode binning menghaluskan nilai pada data yang

terurut dengan "berkonsultasi" dengan data "tetangganya", yaitu

nilai-nilai di sekitarnya.



Nilai-nilai yang terurut didistribusikan ke dalam sejumlah

"buckets" atau bins.



Penghalusan data secara lokal.



Pada contoh ini, data pertama kali diurutkan, dan kemudian

dipartisi ke dalam bins dengan kedalaman yang sama, misal 3

(setiap bin berisi tiga nilai).

(32)

Binning (#2)

Contoh 1:

Data untuk variabel harga yang terurut (dalam

dollar):

4, 8, 15, 21, 21, 24, 25, 28, 34

Pertama kali data dipartisi dalam bin-bin dengan

equidepth 3 (kedalaman yang sama):

(33)

Binning (#3)



Smoothing dengan bin-means (nilai rata-rata):



Bin 1 : 9, 9, 9



Bin 2 : 22, 22, 22



Bin 3 : 29, 29, 29



Smoothing dengan bin-median (nilai tengah):



Bin 1 : 8, 8, 8



Bin 2 : 21, 21, 21

(34)

Binning (#4)



Smoothing dengan bin-boundaries (nilai-nilai

batas):



Bin 1 : 4, 4, 15 {8 menjadi 4 karena lebih dekat ke 4

daripada ke 15}



Bin 2 : 21, 21, 24 {21 tidak berubah karena nilainya

sama}



Bin 3 : 25, 25, 34 {28 menjadi 25 karena lebih dekat

(35)