1
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Data Mining
Practical Machine Learning Tools and Techniques
Slides for Chapter 2 of Data Mining by I. H. Witten and E. Frank
2
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Input:
Concepts, instances, attributesTerminology What’s a concept?
♦Classification, association, clustering, numeric prediction
What’s in an example?
♦Relations, flat files, recursion
What’s in an attribute?
♦Nominal, ordinal, interval, ratio
Preparing the input
♦ARFF, attributes, missing values, getting to know data
3
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Terminology
✁ Components of the input:
♦Concepts: kinds of things that can be learned
✂
Aim: intelligible and operational concept description
♦Instances: the individual, independent examples of a concept
✂
Note: more complicated forms of input are possible
♦Attributes: measuring aspects of an instance
✂
We will focus on nominal and numeric ones
4
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
What’s a concept?
✁ Styles of learning:
♦Classification learning: predicting a discrete class
♦Association learning:
detecting associations between features
♦Clustering:
grouping similar instances into clusters
♦Numeric prediction: predicting a numeric quantity
✁ Concept: thing to be learned ✁
Concept description: output of learning scheme
5
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Classification learning
✁
Example problems: weather data, contact lenses, irises, labor negotiations
✁
Classification learning is supervised ♦Scheme is provided with actual outcome
✁
Outcome is called the class of the example ✁
Measure success on fresh data for which class labels are known (test data) ✁
In practice success is often measured subjectively
6
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Association learning
✁
Can be applied if no class is specified and any kind of structure is considered “interesting” ✁
Difference to classification learning:
♦Can predict any attribute’s value, not just the class, and more than one attribute’s value at a time
♦Hence: far more association rules than classification rules
♦Thus: constraints are necessary
✂
7
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Clustering
✁
Finding groups of items that are similar ✁
Clustering is unsupervised ♦The class of an example is not known
✁
Success often measured subjectively
… … …
Iris virginica 1 .9
Iris virginica 2 .5
6 .0 3 .3 6 .3
Iris versicolor 1 .5
4.5 3 .2 6 .4
Iris versicolor 1 .4
4.7 3 .2 7 .0
Iris setosa 0 .2
1 .4 3 .0 4 .9
Iris setosa 0 .2
1 .4 3 .5 5 .1
Type Petal width Petal length Sepal width Sepal length
8
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Numeric prediction
✁
Variant of classification learning where “class” is numeric (also called “regression”) ✁
Learning is supervised
♦Scheme is being provided with target value
✁ Measure success on test data
… Overcast
0
Play- time
Windy Humidity Temperature Outlook
9
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
What’s in an example?
Instance: specific type of example ✁
Thing to be classified, associated, or clustered
✁
Individual, independent example of target concept
✁
Characterized by a predetermined set of attributes
Input to learning scheme: set of instances/dataset
✁
Represented as a single relation/flat file
Rather restricted form of input ✁
No relationships between objects
Most common form in practical data mining
1 0
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
A family tree
=
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Family tree represented as a table
Ian
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
The “sister-of” relation
yes
Sister of? Second person First person
No
All the rest
Yes
Sister of? Second person First person
1 3
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
A full representation in one table
Ian Parent 2
Fem ale Fem ale Fem ale Fem ale Fem ale Fem ale Gender Parent 1 Name
Parent 2 Parent 1 Gender
Fem ale Fem ale Male
St even
Sister of ? Second person
First person
If second person’s gender = female
and first person’s parent = second person’s parent then sister-of = yes
1 4
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Generating a flat file
✁
Process of flattening called “denormalization” ♦Several relations are joined together to make one
✁
Possible with any finite set of finite relations ✁
Problematic: relationships without pre-specified number of objects
♦Example: concept of nuclear-family ✁
Denormalization may produce spurious regularities that reflect structure of database
♦Example: “supplier” predicts “supplier address”
1 5
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
The “ancestor-of” relation
Yes Other positive ex am ples here
Yes Ian Pam Femal e Nikk i ? ? Femal e Grace Parent 2
Male Femal e Femal e Femal e Femal e Male
Femal e Femal e Male Male Male Male
No All the rest
Yes
Ancestor of? Second person
First person
1 6
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Recursion
✁
Appropriate techniques are known as “inductive logic programming”
♦(e.g. Quinlan’s FOIL)
♦Problems: (a) noise and (b) computational complexity
If person1 is a parent of person2 then person1 is an ancestor of person2
If person1 is a parent of person2 and person2 is an ancestor of person3 then person1 is an ancestor of person3
✁
Infinite relations require recursion
1 7
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
What’s in an attribute?
✁
Each instance is described by a fixed predefined set of features, its “attributes” ✁
But: number of attributes may vary in practice ♦Possible solution: “irrelevant value” flag
✁
Related problem: existence of an attribute may depend of value of another one ✁
Possible attribute types (“levels of measurement”):
♦Nominal, ordinal, interval and ratio
1 8
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Nominal quantities
✁ Values are distinct symbols
♦Values themselves serve only as labels or names
♦Nominal comes from the Latin word for name
✁
Example: attribute “outlook” from weather data ♦Values: “sunny”,”overcast”, and “rainy”
✁
No relation is implied among nominal values (no ordering or distance measure)
✁
1 9
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Ordinal quantities
✁ Impose order on values
✁
But: no distance between values defined ✁
Example:
attribute “temperature” in weather data ♦Values: “hot” > “mild” > “cool”
✁
Note: addition and subtraction don’t make sense ✁
Example rule:
temperature < hot ⇒ play = yes ✁
Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)
20
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Interval quantities
✁
Interval quantities are not only ordered but measured in fixed and equal units ✁
Example 1: attribute “temperature” expressed in degrees Fahrenheit
✁ Example 2: attribute “year” ✁
Difference of two values makes sense ✁
Sum or product doesn’t make sense ♦Zero point is not defined!
2 1
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Ratio quantities
✁
Ratio quantities are ones for which the measurement scheme defines a zero point ✁
Example: attribute “distance”
♦Distance between an object and itself is zero
✁
Ratio quantities are treated as real numbers ♦All mathematical operations are allowed
✁
But: is there an “inherently” defined zero point? ♦Answer depends on scientific knowledge (e.g.
Fahrenheit knew no lower limit to temperature)
22
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Attribute types used in practice
✁
Most schemes accommodate just two levels of measurement: nominal and ordinal
✁
Nominal attributes are also called “categorical”, ”enumerated”, or “discrete”
♦But: “enumerated” and “discrete” imply order
✁
Special case: dichotomy (“boolean” attribute) ✁
Ordinal attributes are called “numeric”, or “continuous”
♦But: “continuous” implies mathematical continuity
2 3
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Metadata
✁
Information about the data that encodes background knowledge
✁
Can be used to restrict search space ✁
Examples:
♦Dimensional considerations
(i.e. expressions must be dimensionally correct)
♦Circular orderings (e.g. degrees in compass)
♦Partial orderings
(e.g. generalization/specialization relations)
2 4
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Preparing the input
✁
Denormalization is not the only issue ✁
Problem: different data sources (e.g. sales department, customer billing department, …)
♦Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors
♦Data must be assembled, integrated, cleaned up
♦“Data warehouse”: consistent point of access
✁
External data may be required (“overlay data”)
✁
2 5
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
The ARFF format
%
% ARFF file for weather data with some numeric features %
@relation weather
@attribute outlook {sunny, overcast, rainy} @attribute temperature numeric
@attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no}
@data
sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes ...
26
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Additional attribute types
ARFF supports
string
attributes:
♦Similar to nominal attributes but list of values
is not pre-specified
It also supports
date
attributes:
♦Uses the ISO-8601 combined date and time format yyyy-MM-dd-THH:mm:ss
@attribute description string
@attribute today date
2 7
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Sparse data
In some applications most attribute values in
a dataset are zero
♦E.g.: word counts in a text categorization problem
ARFF supports sparse data
This also works for nominal attributes (where
the first value corresponds to “zero”)
0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A” 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”
{1 26, 6 63, 10 “class A”} {3 42, 10 “class B”}
2 8
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Attribute types
✁
Interpretation of attribute types in ARFF depends on learning scheme
♦Numeric attributes are interpreted as
✂
ordinal scales if less-than and greater-than are used
✂
ratio scales if distance calculations are performed (normalization/standardization may be required)
♦Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise)
✁
Integers in some given data file: nominal, ordinal, or ratio scale?
2 9
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Nominal vs. ordinal
Attribute “age” nominal
Attribute “age” ordinal
(e.g. “young” < “pre-presbyopic” < “presbyopic”)
If age = young and astigmatic = no and tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft
If age ≤ pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft
3 0
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Missing values
✁
Frequently indicated by out-of-range entries ♦Types: unknown, unrecorded, irrelevant
♦Reasons:
✂
malfunctioning equipment
✂
changes in experimental design
✂
collation of different datasets
✂
measurement not possible
✁
Missing value may have significance in itself (e.g. missing test in a medical examination)
3 1
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Inaccurate values
✁
Reason: data has not been collected for mining it
✁
Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer)
✁
Typographical errors in nominal attributes ⇒
values need to be checked for consistency
✁
Typographical and measurement errors in numeric attributes ⇒ outliers need to be identified
✁
Errors may be deliberate (e.g. wrong zip codes)
✁
Other problems: duplicates, stale data
3 2
Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06
Getting to know the data
✁
Simple visualization tools are very useful ♦Nominal attributes: histograms (Distribution
consistent with background knowledge?)
♦Numeric attributes: graphs (Any obvious outliers?)
✁
2-D and 3-D plots show dependencies ✁
Need to consult domain experts ✁