Data preprocessing Lesson Agenda:
a. Definition of data preprocessing?
b. data preprocessing Stages.
c. Data Cleaning.
d. Data transformation
Data preprocessing:
Data preprocessing means transformation the raw data into a useful and efficient format.
Data preprocessing divided into four stages.
1. Data cleaning, 2. Data integration, 3. Data reduction, 4. Data transformation.
Data cleaning
Data cleaning refers to techniques to ‘clean’ data by removing outliers, replacing missing values, smoothing noisy data, and correcting inconsistent data. Many techniques are used to perform each of these tasks, where each technique is specific to user’s preference or problem set. Below, each task is explained in terms of the techniques used to overcome it.
Missing values
In order to deal with missing data, multiple approaches can be used. Let’s look at each of them.
1. Removing the training example: You can ignore the training example if the output label is missing (if it is a classification problem). This is usually discouraged as it leads to loss of data, as you are removing the attribute values that can add value to data set as well.
2. Filling in missing value manually: This approach is time consuming, and not recommended for huge data sets.
3. Using a standard value to replace the missing value: The missing value can be replaced by a global constant such as ‘N/A’ or ‘Unknown’. This is a simple approach, but not foolproof.
4. Using the most probable value to fill in the missing value: Using algorithms like regression and decision tree, the missing values can be predicted and replaced.
Noisy data
Noise is defined as a random variance in a measured variable. For numeric values, boxplots and scatter plots can be used to identify outliers. To deal with these anomalous values, data smoothing techniques are applied, which are described below.
1. Binning: Using binning methods smooths sorted value by using the values around it.
The sorted values are then divided into ‘bins’. There are various approaches to binning.
Two of them are smoothing by bin means where each bin is replaced by the mean of bin’s values, and smoothing by bin medians where each bin is replaced by the median of bin’s values.
2. Regression: Linear regression and multiple linear regression can be used to smooth the data, where the values are conformed to a function.
3. Outlier analysis: Approaches such as clustering can be used to detect outliers and deal with them.
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
* Smoothing by bin Median:
-Bin1: 8.5, 8.5, 8.5, 8.5 -Bin2: 22.5, 22.5, 22.5, 22.5 -Bin3: 28.5, 28.5, 28.5, 28.5
Data Transformation:
#it’s a data preprocessing process.
#Transform or consolidate the data into alternative forms appropriate for mining.
#involve process:
a. Smoothing:
b. Aggregation: Summary and Aggregation operations are applied on the given set of attributes to come up with new attributes.
c. Generalization d. Normalization:
Normalization is used to scale the data of an attribute so that it falls in a smaller range, such as - 1.0 to 1.0 or 0.0 to 1.0. It is generally useful for classification algorithms.
Methods of Data Normalization –
Decimal Scaling
Min-Max Normalization
z-Score Normalization(zero-mean Normalization)
# Decimal Scaling: find the normalized value of 400.
Normalized value of attribute = ( v
i/ 10
j)
Salary bonus Formula CGPA Normalized after Decimal scaling
400 400 / 1000 0.4
310 310 / 1000 0.31
Min-Max Normalization:
Min max range: 12000 to 98000 New range: 0.0 to 1.0
Value to be normalized: 73600
𝑍 = 73600 − 12000 98000 − 12000 = .716
z-Score Normalization(zero-mean Normalization)
How to calculate Z-Score of the following data?
8,10,15,20
#
Discretization
Discretization is the process through which we can transform
continuous variables, models or functions into a discrete form.
|| II || Approaches to Discretization
Unsupervised:
— Equal-Width
— Equal-Frequency
— K-Means
Supervised:
— Decision Trees
Measures of Distance in Data Mining
1. Euclidean Distance
2. Manhattan Distance 3. Minkowski distance
(x, y) d
( a, b)
Sample No
X Y
1 185 72
2 170 56
We have to find out distance of (195,45) from Sample 1 and 2 .
1. Euclidean Distance:
Distance (d)= √ (x-a)2 + (y-b)2
D1= √ (195-185)2 + (45-72)2
=28.79
D2= √ (195-170)2 + (45-56)2 = 27.31
2. Manhattan Distance
Distance (d)= | x-a | +| y-b | D1= |195-185 | +|45-72 | = 37
3. Minkowski distance:
Formula:
Let P =3
D= (|195-185 |3 +|45-72 |3+|195-170 |3+|45-56|3) 1/3 =
###############################
— K-Means
# What is K in K means clustering?
It is also called flat clustering algorithm. The number of clusters identified from data by algorithm is represented by 'K' in K-means.
# When to use K means clustering?
The K-means clustering algorithm is used to find groups which have not been explicitly labeled in the data.
#
Apply K-means Clustering for the following data Sets for 2(two) Clusters.
Sample no X Y
1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77
Given K=2
Initial Centroid
Cluster X Y
k1 185 72
k2 170 56
Calculate Euclidean Distance using the given Equation:
Distance [(x,y), (a,b)] = (𝑋 − 𝑎)^2 + (𝑋 − 𝑏)^2 Cluster 1 (185,72) = (185 − 185)^2 + (72 − 72)^2
= 0
Distance From Cluster 2 (170,56) = (170 − 185)^2 + (56 − 72)^2
=21.93
Distance From Cluster 1 (185,72) = (185 − 170)^2 + (72 − 56)^2
=21.93
Cluster 2(170,56)= (170 − 170)^2 + (56 − 56)^2
=0
Updated Centroid
Cluster X Y Assignment
K1 0 21.93
K2 21.93 0
Calculate Euclidean Distance for the next dataset (168,60)
Distance From Cluster 1 (185,72) = (168 − 185)^2 + (60 − 72)^2
=20.808
Distance From Cluster 2 (170,56) = (168 − 170)^2 + (60 − 56)^2 =4.472
Dataset Euclidean Distance
Cluster 1 Cluster 2 Assignment
(168,60) 20.808 4.472 2
Update the cluster centroid
Cluster X Y
K1 185 72
K2 =(170+168)/2
=169
=(60+56)/2
=58
Calculate Euclidean Distance for the next dataset (179,68)
Distance From Cluster 1 (185,72) = (179 − 185)^2 + (68 − 72)^2
= 7.211103
Distance From Cluster 2 (169,58) = (169 − 170)^2 + (68 − 58)^2
= 14.14214 Dataset Euclidean Distance
Cluster 1 Cluster 2 Assignment
(179,68) 7.211103 14.14214 1
Update the cluster centroid
Cluster X Y
K1 =(185+179)/2
=182
=(72+68)/2
=70
K2 169 58
Calculate Euclidean Distance for the next dataset (182,72)
Distance From Cluster 1(182,70)= (182 − 182)^2 + (72 − 70)^2
=2
Distance from Cluster 2= (182 − 169)^2 + (72 − 58)^2
=19.10 Dataset Euclidean Distance
Cluster 1 Cluster 2 Assignment
(182,72) 2 19.10 1
Update the cluster centroid
Cluster X Y
K1 =(182+182)/2
=182 =(72+70)/2
=71
K2 169 58
Calculate Euclidean Distance for the next dataset (188,77)
Distance From Cluster 1(182,71)= (188 − 182)^2 + (77 − 71)^2
= 8.4852
Distance from Cluster 2(169,58)= (188 − 169)^2 + (77 − 58)^2
= 26.87
Dataset Euclidean Distance
Cluster 1 Cluster 2 Assignment
(188,77) 8.4852 26.87 1
Update the cluster centroid
Cluster X Y
K1 =(182+188)/2
=185 =(71+77)/2
=74
K2 169 58
Final Assignment
Sample no X Y Assignment
1 185 72 1
2 170 56 2
3 168 60 2
4 179 68 1
5 182 72 1
6 188 77 1
1 2
++
++
++
To make this decision tree we can apply ID3 algorithm.
Following steps we have to follow when we apply ID3 algorithm to make a decision tree.
#KNN