• Tidak ada hasil yang ditemukan

An Overview

N/A
N/A
Rohmad Setyobudi

Academic year: 2025

Membagikan " An Overview"

Copied!
15
0
0

Teks penuh

(1)

Data Mining:

Concepts and Techniques

(3

rd

ed.)

— Chapter 3 —

Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign &

Simon Fraser University

©2011 Han, Kamber & Pei. All rights reserved.

(2)

Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

Data Quality

Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary

(3)

Data Transformation

A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value can be identified with one of the new values

Methods

Smoothing: Remove noise from data

Attribute/feature construction

New attributes constructed from the given ones

Aggregation: Summarization, data cube construction

Normalization: Scaled to fall within a smaller, specified range

min-max normalization

z-score normalization

normalization by decimal scaling

Discretization: Concept hierarchy climbing

(4)

Normalization

Min-max normalization: to [new_minA, new_maxA]

Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is mapped to

Z-score normalization (μ: mean, σ: standard deviation):

Ex. Let μ = 54,000, σ = 16,000. Then

Normalization by decimal scaling

716 . 0 0 ) 0 0 . 1 000( , 12 000 , 98

000 , 12 600 ,

73

A A

A A

A

A

new max new min new min

min max

min

v ' v ( _  _ )  _

 

A

v A

v

  '

j

v v

'  10

Where

j

is the smallest integer such that Max(|ν’|) < 1

225 . 000 1

, 16

000 , 54 600 ,

73

(5)

Discretization

Three types of attributes

Nominal—values from an unordered set, e.g., color, profession

Ordinal—values from an ordered set, e.g., military or academic rank

Numeric—real numbers, e.g., integer or real numbers

Discretization: Divide the range of a continuous attribute into intervals

Interval labels can then be used to replace actual data values

Reduce data size by discretization

Supervised vs. unsupervised

Split (top-down) vs. merge (bottom-up)

Discretization can be performed recursively on an attribute

Prepare for further analysis, e.g., classification

(6)

Data Discretization Methods

Typical methods: All the methods can be applied recursively

Binning

Top-down split, unsupervised

Histogram analysis

Top-down split, unsupervised

Clustering analysis (unsupervised, top-down split or bottom-up merge)

Decision-tree analysis (supervised, top-down split)

Correlation (e.g., 

2

) analysis (unsupervised, bottom-up

merge)

(7)

Simple Discretization: Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B –A)/N.

The most straightforward, but outliers may dominate presentation

Skewed data is not handled well

Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing approximately same number of samples

Good data scaling

Managing categorical attributes can be tricky

(8)

Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into equal-frequency (equi-depth) bins:

- Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34

* Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

(9)

Discretization Without Using Class Labels (Binning vs. Clustering)

Data Equal interval width (binning)

Equal frequency (binning) K-means clustering leads to better results

(10)

Discretization by Classification &

Correlation Analysis

Classification (e.g., decision tree analysis)

Supervised: Given class labels, e.g., cancerous vs. benign

Using entropy to determine split point (discretization point)

Top-down, recursive split

Details to be covered in Chapter 7

Correlation analysis (e.g., Chi-merge: χ2-based discretization)

Supervised: use class information

Bottom-up merge: find the best neighboring intervals (those having similar distributions of classes, i.e., low χ2 values) to merge

Merge performed recursively, until a predefined stopping condition

(11)

Concept Hierarchy Generation

Concept hierarchy organizes concepts (i.e., attribute values)

hierarchically and is usually associated with each dimension in a data warehouse

Concept hierarchies facilitate drilling and rolling in data warehouses to view data in multiple granularity

Concept hierarchy formation: Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior)

Concept hierarchies can be explicitly specified by domain experts and/or data warehouse designers

Concept hierarchy can be automatically formed for both numeric and nominal data. For numeric data, use discretization methods shown.

(12)

Concept Hierarchy Generation for Nominal Data

Specification of a partial/total ordering of attributes explicitly at the schema level by users or experts

street < city < state < country

Specification of a hierarchy for a set of values by explicit data grouping

{Urbana, Champaign, Chicago} < Illinois

Specification of only a partial set of attributes

E.g., only street < city , not others

Automatic generation of hierarchies (or attribute levels) by the analysis of the number of distinct values

E.g., for a set of attributes: { street, city, state, country }

(13)

Automatic Concept Hierarchy Generation

Some hierarchies can be automatically generated based on the analysis of the number of distinct values per attribute in the data set

The attribute with the most distinct values is placed at the lowest level of the hierarchy

Exceptions, e.g., weekday, month, quarter, year

country

province_or_ state city

street

15 distinct values

365 distinct values

3567 distinct values

674,339 distinct values

(14)

Summary

Data quality: accuracy, completeness, consistency, timeliness, believability, interpretability

Data cleaning: e.g. missing/noisy values, outliers

Data integration from multiple sources:

Entity identification problem

Remove redundancies

Detect inconsistencies

Data reduction

Dimensionality reduction

Numerosity reduction

Data compression

Data transformation and data discretization

Normalization

Concept hierarchy generation

(15)

References

D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of ACM, 42:73-78, 1999

A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996

T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003

J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.

H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:

Language, model, and algorithms. VLDB'01

M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07

H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical Committee on Data Engineering, 20(4), Dec. 1997

H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining Perspective. Kluwer Academic, 1998

J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003

D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999

V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and Transformation, VLDB’2001

T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001

R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.

Knowledge and Data Engineering, 7:623-640, 1995

Referensi

Dokumen terkait

Hasil data cleaning berikutnya diproses dalam data transformation yaitu pengkelasan data dalam kelas- kelas tertentu, hasil dari data transformation selanjutnya

Data cleaning Data cleaning refers to techniques to ‘clean’ data by removing outliers, replacing missing values, smoothing noisy data, and correcting inconsistent data.. Many

This report provides an overview of traffic simulation models, their functions, limitations, and

This document provides an overview of diabetes mellitus, including its classification, epidemiology, pathophysiology, immunology, risk factors, prevention, and

The document provides an overview of the basics of how organisms are classified, including the binomial system introduced by

This document provides an overview of complex predicates, including serial verb constructions, prepositional verbs, multi-predicational constructions, and other related

The document provides an overview of the Advanced Encryption Standard (AES) algorithm, its background, requirements, and submission

This paper provides an overview of groundwater characteristics, qualities, pollutions, and