Data preprocessing

(1)

Data preprocessing Lesson Agenda:

a. Definition of data preprocessing?

b. data preprocessing Stages.

c. Data Cleaning.

d. Data transformation

Data preprocessing:

Data preprocessing means transformation the raw data into a useful and efficient format.

Data preprocessing divided into four stages.

1. Data cleaning, 2. Data integration, 3. Data reduction, 4. Data transformation.

Data cleaning

Data cleaning refers to techniques to ‘clean’ data by removing outliers, replacing missing values, smoothing noisy data, and correcting inconsistent data. Many techniques are used to perform each of these tasks, where each technique is specific to user’s preference or problem set. Below, each task is explained in terms of the techniques used to overcome it.

Missing values

In order to deal with missing data, multiple approaches can be used. Let’s look at each of them.

1. Removing the training example: You can ignore the training example if the output label is missing (if it is a classification problem). This is usually discouraged as it leads to loss of data, as you are removing the attribute values that can add value to data set as well.

2. Filling in missing value manually: This approach is time consuming, and not recommended for huge data sets.

(2)

3. Using a standard value to replace the missing value: The missing value can be replaced by a global constant such as ‘N/A’ or ‘Unknown’. This is a simple approach, but not foolproof.

4. Using the most probable value to fill in the missing value: Using algorithms like regression and decision tree, the missing values can be predicted and replaced.

Noisy data

Noise is defined as a random variance in a measured variable. For numeric values, boxplots and scatter plots can be used to identify outliers. To deal with these anomalous values, data smoothing techniques are applied, which are described below.

1. Binning: Using binning methods smooths sorted value by using the values around it.

The sorted values are then divided into ‘bins’. There are various approaches to binning.

Two of them are smoothing by bin means where each bin is replaced by the mean of bin’s values, and smoothing by bin medians where each bin is replaced by the median of bin’s values.

2. Regression: Linear regression and multiple linear regression can be used to smooth the data, where the values are conformed to a function.

3. Outlier analysis: Approaches such as clustering can be used to detect outliers and deal with them.

Binning Methods for Data Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into (equi-depth) bins:

- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34

* Smoothing by bin means:

- Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:

(3)

- Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

* Smoothing by bin Median:

-Bin1: 8.5, 8.5, 8.5, 8.5 -Bin2: 22.5, 22.5, 22.5, 22.5 -Bin3: 28.5, 28.5, 28.5, 28.5

Data Transformation:

#it’s a data preprocessing process.

#Transform or consolidate the data into alternative forms appropriate for mining.

#involve process:

a. Smoothing:

b. Aggregation: Summary and Aggregation operations are applied on the given set of attributes to come up with new attributes.

c. Generalization d. Normalization:

Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as - 1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms.

(4)

Methods of Data Normalization –

 Decimal Scaling

 Min-Max Normalization

 z-Score Normalization(zero-mean Normalization)

# Decimal Scaling: find the normalized value of 400.

Normalized value of attribute = ( v

ⁱ

/ 10

^j

)

Salary bonus Formula CGPA Normalized after Decimal scaling

400 400 / 1000 0.4

310 310 / 1000 0.31

 Min-Max Normalization:

Min max range: 12000 to 98000 New range: 0.0 to 1.0

Value to be normalized: 73600

𝑍 = 73600 − 12000 98000 − 12000 = .716

 z-Score Normalization(zero-mean Normalization)

How to calculate Z-Score of the following data?

8,10,15,20

(5)

#

Discretization

Discretization is the process through which we can transform

continuous variables, models or functions into a discrete form.

(6)

|| II || Approaches to Discretization



Unsupervised:

— Equal-Width

— Equal-Frequency

— K-Means



Supervised:

— Decision Trees

Measures of Distance in Data Mining

1. Euclidean Distance

2. Manhattan Distance 3. Minkowski distance

(x, y) d

( a, b)

(7)

Sample No

X Y

1 185 72

2 170 56

We have to find out distance of (195,45) from Sample 1 and 2 .

1. Euclidean Distance:

Distance (d)= √ (x-a)²+ (y-b)²

D1= √ (195-185)²+ (45-72)²

=28.79

D2= √ (195-170)²+ (45-56)² = 27.31

2. Manhattan Distance

Distance (d)= | x-a | +| y-b | D1= |195-185 | +|45-72 | = 37

(8)

3. Minkowski distance:

Formula:

Let P =3

D= (|195-185 |³ +|45-72 |³+|195-170 |³+|45-56|³) ^1/3 =

###############################

— K-Means

# What is K in K means clustering?

It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by 'K' in K-means.

# When to use K means clustering?

The K-means clustering algorithm is used to find groups which have not been explicitly labeled in the data.

#

Apply K-means Clustering for the following data Sets for 2(two) Clusters.

Sample no X Y

1 185 72

2 170 56

3 168 60

4 179 68

5 182 72

6 188 77

Given K=2

Initial Centroid

(9)

Cluster X Y

k1 185 72

k2 170 56

Calculate Euclidean Distance using the given Equation:

Distance [(x,y), (a,b)] = (𝑋 − 𝑎)^2 + (𝑋 − 𝑏)^2 Cluster 1 (185,72) = (185 − 185)^2 + (72 − 72)^2

= 0

Distance From Cluster 2 (170,56) = (170 − 185)^2 + (56 − 72)^2

=21.93

Cluster 2(170,56)= (170 − 170)^2 + (56 − 56)^2

=0

Updated Centroid

Cluster X Y Assignment

K1 0 21.93

K2 21.93 0

Calculate Euclidean Distance for the next dataset (168,60)

=20.808

Distance From Cluster 2 (170,56) = (168 − 170)^2 + (60 − 56)^2 =4.472

(10)

Dataset Euclidean Distance

Cluster 1 Cluster 2 Assignment

(168,60) 20.808 4.472 2

Update the cluster centroid

Cluster X Y

K1 185 72

K2 =(170+168)/2

=169

=(60+56)/2

=58

= 7.211103

= 14.14214 Dataset Euclidean Distance

(179,68) 7.211103 14.14214 1

(11)

Cluster X Y

K1 =(185+179)/2

=182

=(72+68)/2

=70

K2 169 58

Distance From Cluster 1(182,70)= (182 − 182)^2 + (72 − 70)^2

=2

Distance from Cluster 2= (182 − 169)^2 + (72 − 58)^2

=19.10 Dataset Euclidean Distance

(182,72) 2 19.10 1

Cluster X Y

K1 =(182+182)/2

=182 =(72+70)/2

=71

K2 169 58

Distance From Cluster 1(182,71)= (188 − 182)^2 + (77 − 71)^2

= 8.4852

Distance from Cluster 2(169,58)= (188 − 169)^2 + (77 − 58)^2

= 26.87

(12)

Dataset Euclidean Distance

(188,77) 8.4852 26.87 1

Cluster X Y

K1 =(182+188)/2

=185 =(71+77)/2

=74

K2 169 58

Final Assignment

Sample no X Y Assignment

1 185 72 1

2 170 56 2

3 168 60 2

4 179 68 1

5 182 72 1

6 188 77 1

1 2

++

(13)

To make this decision tree we can apply ID3 algorithm.

Following steps we have to follow when we apply ID3 algorithm to make a decision tree.

(14)

(15)

(16)

(17)

(18)

(19)

(20)

(21)

(22)

(23)

(24)

(25)

(26)

(27)

(28)

(29)

#KNN

(30)

(31)

(32)