Classification and Prediction
Classification and Prediction
Lecture 5/DMBI/IKI83403T/MTI/UIYudho Giri Sucahyo, Ph.D, CISA (yudho@cs.ui.ac.id)
Faculty of Computer Science, University of Indonesia
Objectives
Objectives
`
Introduction
`
What is Classification?
`
Classification vs Prediction
`
Classification vs Prediction
`
Supervised and Unsupervised Learning
`
D t P
ti
`
Data Preparation
`
Classification Accuracy
`
ID3 Algorithm
`
Information Gain
`
Bayesian Classification
`
Predictive Modelling
University of Indonesia
`
Predictive Modelling
2
Introduction
Introduction
` Databases are rich with hidden information that can be used for making intelligent business decisions.
` Classification and prediction can be used to extract models
d ibi i d l di f d
describing important data classes or to predict future data trends.
` Classification predicts categorical labels Ex: categorize bank
` Classification predicts categorical labels. Ex: categorize bank loan applications Æsafe or risky.
` Prediction models continuous-valued functions. Ex: predict the expenditures of potential customers on computer equipment given their income and occupation.
` Typical Applications:
` Typical Applications:
` Credit approval, target marketing,
` Medical diagnosis, treatment effectiveness analysisg , y
What is Classification? A two step process
What is Classification? – A two-step process
`
Model construction:
` Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label.
` Data tuples are also referred to as samples, examples, or objects.
` All tuples used for construction is called training set.
` Since the class label of each training sample is providedÆ
supervised learning. In clustering (unsupervised learning),
th l l b l f h t i i l i t k d th the class labels of each training sample is not known, and the number or set of classes to be learned may not be known in advance.
advance.
` The model is represented in the following forms:
What is Classification? A two step process (2)
What is Classification? – A two-step process (2)
`
The model is used for classifying future or
unknown objects.
` First, the predictive accuracy of the model is estimated
` The known label of test sample is compared with the classified result from the model.
` Accuracy rate is the percentage of test set samples that are correctly ` Accuracy rate is the percentage of test set samples that are correctly
classified by the model.
` Test set is independent of training set otherwise over-fitting (it may have incorporated some particular anomalies of the training data that are not present in the overall sample population) will occur.
` If the accuracy of the model is considered acceptable the
` If the accuracy of the model is considered acceptable, the model can be used to classify future objects for which the class label is not known (unknown, previously unseen data).
University of Indonesia
( p y )
5
Classification Process (1)
Classification Process (1)
Classification
Training Data
Classification Algorithms
Data
NAM E RANK YEA RS TENURED
M ike Assistant Prof 3 no
Classifier (Model)
M ary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no
IF rank = ‘professor’ OR years > 6
THEN tenured = ‘yes’
University of Indonesia 6
THEN tenured yes
Classification Process (2)
Classification Process (2)
Classifier
Testing
Data Unseen Data
Data
(Jeff, Professor, 4)
NAM E RANK YEARS TENURED
Tom Assistant Prof 2 no
Tenured?
M erlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 es
7
Joseph Assistant Prof 7 yes
What is Prediction?
What is Prediction?
`
Prediction is similar to classification
` First, construct model.
` Second, use model to predict future or unknown objectsp j
` Major method for prediction is regression:
Linear and multiple regression
Non-liner regression
`
Prediction is different from classification
` Classification refers to predict categorical class label.
` Prediction refers to predict continuous value.
Classification vs Prediction
Classification vs Prediction
`
Sending out promotional literature to every new
customer in the database can be quite costly. A more
cos-efficient method would be to target only those new
h
lik l
h
Æ
customers who are likely to purchase a new computer
Æ
classification
.
P d
h
b
f
h
h
`
Predict the number of major purchases that a customer
will make during a fiscal year
Æ
prediction
.
University of Indonesia 9
Supervised vs Unsupervised Learning
Supervised vs Unsupervised Learning
`
Supervised learning (classification)
` Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the ) p y g observations
` Based on the training set to classify new datag y
`
Unsupervised learning
(clustering)
` We are given a set of measurements observations etc with
` We are given a set of measurements, observations, etc with the aim of establishing the existence of classes or clusters in the data
the data
` No training data, or the “training data” are not accompanied by class labels
University of Indonesia
by class labels
10
Issues Data Preparation
Issues – Data Preparation
`
Data preprocessing can be used to help improve the
accuracy, efficiency, and scalability of the classification or
prediction process.
`
Data Cleaning
` Remove/reduce noise and the treatment of missing values
`
Relevance Analysis
` Many of the attributes in the data may be irrelevant to the classification or prediction task. Ex: data recording the day of the week on which a bank loan application was filed is unlikely to be relevant to the success of the application
to be relevant to the success of the application.
` Other attributes may be redundant.
` This step is known as feature selection
` This step is known as feature selection.
Issues Data Preparation
Issues – Data Preparation
`
Data Transformation
` Data can be generalized to higher-level concepts.
` Useful fot continuous-valued attributes.
` Income can be generalized Ælow, medium, high.
` Street Æcity.
` Generalization compresses the original training data, fewer input/output operations may be involved during learning.
` Wh i l t k ( th th d i l i
Comparing Classification Method
Comparing Classification Method
` Predictive accuracy
` Speed and scalability
` time to construct the model
` time to use the model
` time to use the model
` Robustness
` handling noise and missing values
` Scalability
` efficiency in large databases (not memory resident data)
` Interpretability:
` the level of understanding and insight provided by the model
` Goodness of rules
` Goodness of rules
` decision tree size
` the compactness of classification rules
University of Indonesia 13
Classification Accuracy: Estimating Error Rates
Classification Accuracy: Estimating Error Rates
` Partition: Training-and-testing
` use two independent data sets, e.g., training set (2/3), test set(1/3)
` used for data set with large number of samples
` Cross-validation
` divide the data set into ksubsamples
` use k-1subsamples as training data and one sub-sample as test p g p data ---k-fold cross-validation
` for data set with moderate size
` Bootstrapping (leave-one-out)
` for small size data
University of Indonesia 14
What is a decision tree?
What is a decision tree?
`
A decision tree is a flow-chart-like tree structure.
` Internal nodedenotes a test on an attribute
` Branch represents an outcome of the test
` All tuples in branch have the same value for the tested attribute.
` Leaf node represents class label or class label distribution. `
To classify an unknown sample, the attribute values of the
sample are tested against the decision tree. A path is traced
from the root to a leaf node that holds the class prediction
f h
l
for that sample.
`
Decision trees can easily be converted to classification rules.
15
Training Dataset
Training Dataset
`
An Example
Outlook Tempreature Humidity Windy Classsunny hot high false N`
An Example
from Quinlan’s
ID3
sunny hot high false N
sunny hot high true N
overcast hot highg false P
rain mild high false P
rain cool normal false P
rain cool normal tr e N
rain cool normal true N
overcast cool normal true P sunnyy mild highg false N sunny cool normal false P
rain mild normal false P
ild l t P
sunny mild normal true P
overcast mild high true P overcast hot normal false P
16
A Sample Decision Tree
A Sample Decision Tree
Outlook Outlook
overcast
sunny overcast rain
humidity P windy
high normal true false
N P N P
University of Indonesia 17
Decision Tree Classification Methods
Decision-Tree Classification Methods
`
The basic top-down decision tree generation approach
usually consists of two phases:
`
Tree construction
` At start, all the training examples are at the root.
` Partition examples recursively based on selected
` Partition examples recursively based on selected attributes.
`
Tree pruning
`
Tree pruning
` Aiming at removing tree branches that may lead to errors h l if i t t d t (t i i d t t i i when classifying test data (training data may contain noise, outliers, …)
University of Indonesia 18
ID3 Algorithm
ID3 Algorithm
` All attributes are categorical
` Create a node N;
` if samples are all of the same class C, then ` return N as a leaf node labeled with C
` return N as a leaf node labeled with C
` if attribute-list is empty then
` return N as a leaf node labeled with the most common class
` select split-attribute with highest information gain
` label N with the split-attribute
` f h l A f lit tt ib t b h f N d N
` for each value Aiof split-attribute, grow a branch from Node N
` let Sibe the branch in which all tuples have the value Aifor split- attribute
` if Siis empty then
attach a leaf labeled with the most common class
Else recursively run the algorithm at Node Si
` until all branches reach leaf nodes
` until all branches reach leaf nodes
Choosing Split Attribute –
Information Gain (ID3/C4 5) (1)
Information Gain (ID3/C4.5) (1)
`
Assume all attributes to be categorical (discrete-values).
Continuous-valued attributes must be discretized.
`
Used to select the test attribute at each node in the tree.
`
Also called
measure of the goodness of split
.
Information Gain (ID3/C4 5) (2)
Information Gain (ID3/C4.5) (2)
` Assume that there are two classes,Pand N.
L h f l S l f l P d
` Let the set of examples Scontain pelements of class Pand n elements of class N.
` The amount of information needed to decide if an arbitrary
` The amount of information, needed to decide if an arbitrary example in S belong to Por Nis defined as
I( ) p lo g p n lo g n
` Assume that using attribute A as the root in the tree will partition S {S S S} needed to classify objects in all subtrees Si :
needed to classify objects in all subtrees Si :
E A i iI
University of Indonesia 21
Information Gain (ID3/C4 5) (3)
Information Gain (ID3/C4.5) (3)
` The attribute Ais selected such that the information gain
gain(A) = I(p, n) - E(A)
is maximal, that is, E(A), , ( )is minimal since I(p, n)(p, )is the same to all attributes at a node.
` In the given sample data attribute outlook is chosen to split at
` In the given sample data, attribute outlook is chosen to split at the root :
i ( tl k) = 0 246 gain(outlook) = 0.246
gain(temperature) = 0.029
gain(humidity) = 0.151
gain(windy) = 0.048
University of Indonesia 22
Information Gain (ID3/C4 5) (3)
Information Gain (ID3/C4.5) (3)
`
Examples:
` See Table 7.1.
` Class label: buys_computer. Two values: YES, NO.
` m = 2. C1 correspond to yes, C2correspond to no.
` 9 samples of class yes and 5 samples of class no.
` Compute the expected information needed to classify a given sample
Information Gain (ID3/C4 5) (4)
Information Gain (ID3/C4.5) (4)
` Next, compute the entropy of each attribute. Let’s start with the ib
` Using equation (7 2) the expected information needed to classify
` Using equation (7.2), the expected information needed to classify a given sample if the samples are partitioned by age is
694
` Hence, the gain in information from such a partitioning: Gain(age) = I(s1, s2) –E (age) = 0.246
14 14
14
( g ) ( 1 2) ( g )
` Similarly, we can compute Gain(income) = 0.029, Gain(student) = 0.151, Gain(Credit_rating) = 0.048.
How to use a tree?
How to use a tree?
`
Directly
` test the attribute value of unknown sample against the tree.
` A path is traced from root to a leaf which holds the label
`
Indirectly
` decision tree is converted to classification rules
` one rule is created for each path from the root to a leaf
` IF-THEN is easier for humans to understand
` Example:
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
University of Indonesia 25
Tree Pruning
Tree Pruning
` A decision tree constructed using the training data may have / f
too many branches/leaf nodes.
` Caused by noise, overfitting
M l f l
` May result poor accuracy for unseen samples
` Prune the tree: merge a subtree into a leaf node.
U i f d diff f h i i d
` Using a set of data different from the training data.
` At a tree node, if the accuracy without splitting is higher than the accuracy with splitting replace the subtree with a leaf node the accuracy with splitting, replace the subtree with a leaf node, label it using the majority class.
` Pruning Criterion:
` Pruning Criterion:
` Pessimistic pruning: C4.5
` MDL: SLIQ and SPRINT
University of Indonesia ` Cost complexity pruning: CART
26
Classification and Databases
Classification and Databases
`
Classification is a classical problem extensively studied by
` statisticians
` AI, especially machine learning researchers
`
Database researchers re-examined the problem in the
context of large databases
` most previous studies used small size data, and most algorithms are memory resident
`
Recent data mining research contributes to
` Scalability
` Generalization-based classification
` Parallel and distributed processing
Classifying Large Dataset
Classifying Large Dataset
`
Decision trees seem to be a good choice
` relatively faster learning speed than other classification methods
` can be converted into simple and easy to understand classification rules
` can be used to generate SQL queries for accessing databases
` can be used to generate SQL queries for accessing databases
` has comparable classification accuracy with other methods
`
Classifying data sets with millions of examples and a few
`
Classifying data-sets with millions of examples and a few
hundred even thousands attributes with reasonable
speed
Scalable Decision Tree Methods
Scalable Decision Tree Methods
`
Most algorithms assume data can fit in memory.
`
Data mining research contributes to the scalability issue,
especially for decision trees.
p
y
`
Successful examples
` SLIQ(EDBT’96 Mehta et al ’96)
` SLIQ(EDBT 96 -- Mehta et al. 96)
` SPRINT(VLDB96 -- J. Shafer et al.’96)
` PUBLIC(VLDB98 Rastogi & Shim’98)
` PUBLIC(VLDB98 -- Rastogi & Shim 98)
` RainForest (VLDB98 -- Gehrke, et al.’98)
University of Indonesia 29
Previous Efforts on Scalability
Previous Efforts on Scalability
`
Incremental tree construction (Quinlan’86)
(
)
` using partial data to build a tree.
` testing other examples and those mis-classified ones are used g p to rebuild the tree interactively.
`
Data reduction (Cattlet’91)
` reducing data size by sampling and discretization.
` still a main memory algorithm.
`
Data partition and merge (Chan and Stolfo’91)
` partitioning data and building trees for each partition.
` merging multiple trees into a combined tree.
` experiment results indicated reduced classification accuracy.
University of Indonesia 30
Presentation of Classification Rules
Presentation of Classification Rules
31
Other Classification Methods
Other Classification Methods
`
Bayesian Classification
`Bayesian Classification
`Neural Networks
`Genetic Algorithm
`Rough Set Approach
`Rough Set Approach
`
k-Nearest Neighbor Classifier
`Case-Based Reasoning (CBR)
`Fuzzy Logic
`
Fuzzy Logic
`
Support Vector Machine (SVM)
Bayesian Classification
Bayesian Classification
`
Bayesian classifiers are statistical classifiers.
`
They can predict class membership probabilities, such as
the probability that a given sample belongs to a particular
class.
`
Bayesian classification is based on Bayes theorem.
`
Naive Bayesian Classifier is comparable in performance
with decision tree and neural network classifiers.
`
Bayesian classifiers also have high accuracy and speed
when applied to large databases.
University of Indonesia 33
Bayes Theorem (1)
Bayes Theorem (1)
`
Let
X
be a data sample whose class label is unknown.
`
Let
H
be some hypothesis, such as that the data sample
X
belongs to a specified class
C
.
`
We want to determine
P
(
H
|X), the probability the the
hypothesis
H
holds given the observed data sample
X
.
`
P
(
H
|X) is the
posterior probability or a posteriori
probability
, of
H
conditioned on
X
.
` Support the world of data samples consists of fruits, described by their color and shape., Suppose that X is red and round, and
h H i h h h i h X i l Th P(H|X) that H is the hypothesis that X is an apple. Then P(H|X) reflects our confidence that X is an apple given that we have seen that X is red and round.
University of Indonesia seen that X is red and round.
34
Bayes Theorem (2)
Bayes Theorem (2)
`
P
(
H
) is the
prior probability or a priori probability
, of
H.
` The probability that any given data sample is an apple,regardless of how the data sample looks.
`
The posterior probability is based on more information
(such as background knowledge) than the prior
b bili hi h
i
i d
d
f
X
probability which is independent of
X.
`Bayes theorem is
)
`
See example 7.4 for example on Naive Bayesian
) (X P
Classification.
Predictive Modeling in Databases
Predictive Modeling in Databases
`
What if we would like to predict a continuous value,
rather than a categorical label?
` Prediction of continuous values can be modeled by statistical
h i f i
techniques of regression.
` Example:
` A m dle t redict the salar f c lle e rad ates ith 10 ears f ` A modle to predict the salary of college graduates with 10 years of
work experience.
` Potential sales of a new product given its price.
`
Many problems can be solved by
linear regression
.
`
Software packages for solving regression problems:
`
Software packages for solving regression problems:
Linear Regression
Linear Regression
`
Data are modeled using a straight line.
`
The simplest form of regression
`
Bivariate liner regressions models a random variable
g
Y
(called a
response variable
), as a linear function of another
random variable,
X
(called a
predictor variable
)
` Y = α+ βX
`
See Example 7.6 for an example of linear regression.
p
p
g
`
Other regression models
` Multiple regression
` Multiple regression
` Log-linear models
University of Indonesia 37
Prediction: Numerical Data
Prediction: Numerical Data
University of Indonesia 38
Prediction: Categorical Data
Prediction: Categorical Data
39
Conclusion
Conclusion
` Classification is an extensively studied problem (mainly in
statistics, machine learning & neural networks)
` Classification is probably one of the most widely used data
` Classification is probably one of the most widely used data
mining techniques with a lot of applications.
` Scalability is still an important issue for database applications.
` Combininggclassification with database techniquesq should be a
promising research topic.
` Research Direction: Classification ofnon relational data e g
` Research Direction: Classification ofnon-relational data, e.g.,
text, spatial, multimedia, etc..
References
References
` C. Apte and S. Weiss. Data mining with decision trees and decision rules. Future Generation
C t S t 13 1997
Computer Systems, 13, 1997.
` L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth
International Group, 1984.p
` P. K. Chan and S. J. Stolfo. Learning arbiter and combiner trees from partitioned data for scaling
machine learning. In Proc. 1st Int. Conf. Knowledge Discovery and Data Mining (KDD'95), pages
39 44 M l C d A 1995
39-44, Montreal, Canada, August 1995.
` U. M. Fayyad. Branching on attribute values in decision tree generation. In Proc. 1994 AAAI Conf.,
pages 601-606, AAAI Press, 1994.
p g , ,
` J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest: A framework for fast decision tree
construction of large datasets. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 416-427, New York, NY, August 1998.
` M. Kamber, L. Winstone, W. Gong, S. Cheng, and J. Han. Generalization and decision tree induction:
Efficient classification in data mining. In Proc. 1997 Int. Workshop Research Issues on Data
University of Indonesia
Efficient classification in data mining. In Proc. 1997 Int. Workshop Research Issues on Data Engineering (RIDE'97), pages 111-120, Birmingham, England, April 1997.
41
References (2)
References (2)
` J. Magidson. The chaid approach to segmentation modeling: Chi-squared automatic interaction detection. In R. P. Bagozzi, editor, Advanced Methods of Marketing Research, pages g , , g , p g 118-159. Blackwell Business, Cambridge Massechusetts, 1994.
` M. Mehta, R. Agrawal, and J. Rissanen. SLIQ : A fast scalable classifier for data mining. In Proc. 1996 Int. Conf. Extending Database Technology (EDBT'96), Avignon, France, March 1996.
` S. K. Murthy, Automatic Construction of Decision Trees from Data: A Multi-Diciplinary Survey, Data Mining and Knowledge Discovery 2(4): 345-389, 1998
` J. R. Quinlan. Bagging, boosting, and c4.5. In Proc. 13th Natl. Conf. on Artificial Intelligence J gg g g g (AAAI'96), 725-730, Portland, OR, Aug. 1996.
` R. Rastogi and K. Shim. Public: A decision tree classifer that integrates building and pruning. In Proc. 1998 Int. Conf. Very Large Data Bases, 404-415, New York, NY, August 1998.
` J. Shafer, R. Agrawal, and M. Mehta. SPRINT : A scalable parallel classifier for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases, 544-555, Bombay, India, Sept. 1996.
` S. M. Weiss and C. A. Kulikowski. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufman, 1991.