Instance: specific type of example Thing to be classified, associated, or clustered Individual, independent example of target concept Characterized by a predetermined set of attributes Input to learning scheme: set of instancesdataset Represented as a sin

(1)

1

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Data Mining

Practical Machine Learning Tools and Techniques

Slides for Chapter 2 of Data Mining by I. H. Witten and E. Frank

2

Input:

Concepts, instances, attributes

Terminology What’s a concept?

♦Classification, association, clustering, numeric prediction

What’s in an example?

♦Relations, flat files, recursion

What’s in an attribute?

♦Nominal, ordinal, interval, ratio

Preparing the input

♦ARFF, attributes, missing values, getting to know data

3

Terminology

✁ Components of the input:

♦Concepts: kinds of things that can be learned

✂

Aim: intelligible and operational concept description

♦Instances: the individual, independent examples of a concept

✂

Note: more complicated forms of input are possible

♦Attributes: measuring aspects of an instance

✂

We will focus on nominal and numeric ones

4

What’s a concept?

✁ Styles of learning:

♦Classification learning: predicting a discrete class

♦Association learning:

detecting associations between features

♦Clustering:

grouping similar instances into clusters

♦Numeric prediction: predicting a numeric quantity

✁ Concept: thing to be learned ✁

Concept description: output of learning scheme

5

Classification learning

✁

Example problems: weather data, contact lenses, irises, labor negotiations

✁

Classification learning is supervised ♦Scheme is provided with actual outcome

✁

Outcome is called the class of the example ✁

Measure success on fresh data for which class labels are known (test data) ✁

In practice success is often measured subjectively

6

Association learning

✁

Can be applied if no class is specified and any kind of structure is considered “interesting” ✁

Difference to classification learning:

♦Can predict any attribute’s value, not just the class, and more than one attribute’s value at a time

♦Hence: far more association rules than classification rules

♦Thus: constraints are necessary

✂

(2)

7

Clustering

✁

Finding groups of items that are similar ✁

Clustering is unsupervised ♦The class of an example is not known

✁

Success often measured subjectively

… … …

Iris virginica 1 .9

Iris virginica 2 .5

6 .0 3 .3 6 .3

Iris versicolor 1 .5

4.5 3 .2 6 .4

Iris versicolor 1 .4

4.7 3 .2 7 .0

Iris setosa 0 .2

1 .4 3 .0 4 .9

Iris setosa 0 .2

1 .4 3 .5 5 .1

Type Petal width Petal length Sepal width Sepal length

8

Numeric prediction

✁

Variant of classification learning where “class” is numeric (also called “regression”) ✁

Learning is supervised

♦Scheme is being provided with target value

✁ Measure success on test data

… Overcast

0

Play- time

Windy Humidity Temperature Outlook

9

What’s in an example?

Instance: specific type of example ✁

Thing to be classified, associated, or clustered

✁

Individual, independent example of target concept

✁

Characterized by a predetermined set of attributes

Input to learning scheme: set of instances/dataset

✁

Represented as a single relation/flat file

Rather restricted form of input ✁

No relationships between objects

Most common form in practical data mining

1 0

A family tree

=

Family tree represented as a table

Ian

The “sister-of” relation

yes

Sister of? Second person First person

No

All the rest

Yes

Sister of? Second person First person

(3)

1 3

A full representation in one table

Ian Parent 2

Fem ale Fem ale Fem ale Fem ale Fem ale Fem ale Gender Parent 1 Name

Parent 2 Parent 1 Gender

Fem ale Fem ale Male

St even

Sister of ? Second person

First person

If second person’s gender = female

and first person’s parent = second person’s parent then sister-of = yes

1 4

Generating a flat file

✁

Process of flattening called “denormalization” ♦Several relations are joined together to make one

✁

Possible with any finite set of finite relations ✁

Problematic: relationships without pre-specified number of objects

♦Example: concept of nuclear-family ✁

Denormalization may produce spurious regularities that reflect structure of database

♦Example: “supplier” predicts “supplier address”

1 5

The “ancestor-of” relation

Yes Other positive ex am ples here

Yes Ian Pam Femal e Nikk i ? ? Femal e Grace Parent 2

Male Femal e Femal e Femal e Femal e Male

Femal e Femal e Male Male Male Male

No All the rest

Yes

Ancestor of? Second person

First person

1 6

Recursion

✁

Appropriate techniques are known as “inductive logic programming”

♦(e.g. Quinlan’s FOIL)

♦Problems: (a) noise and (b) computational complexity

If person1 is a parent of person2 then person1 is an ancestor of person2

If person1 is a parent of person2 and person2 is an ancestor of person3 then person1 is an ancestor of person3

✁

Infinite relations require recursion

1 7

What’s in an attribute?

✁

Each instance is described by a fixed predefined set of features, its “attributes” ✁

But: number of attributes may vary in practice ♦Possible solution: “irrelevant value” flag

✁

Related problem: existence of an attribute may depend of value of another one ✁

Possible attribute types (“levels of measurement”):

♦Nominal, ordinal, interval and ratio

1 8

Nominal quantities

✁ Values are distinct symbols

♦Values themselves serve only as labels or names

♦Nominal comes from the Latin word for name

✁

Example: attribute “outlook” from weather data ♦Values: “sunny”,”overcast”, and “rainy”

✁

No relation is implied among nominal values (no ordering or distance measure)

✁

(4)

1 9

Ordinal quantities

✁ Impose order on values

✁

But: no distance between values defined ✁

Example:

attribute “temperature” in weather data ♦Values: “hot” > “mild” > “cool”

✁

Note: addition and subtraction don’t make sense ✁

Example rule:

temperature < hot ⇒ play = yes ✁

Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)

20

Interval quantities

✁

Interval quantities are not only ordered but measured in fixed and equal units ✁

Example 1: attribute “temperature” expressed in degrees Fahrenheit

✁ Example 2: attribute “year” ✁

Difference of two values makes sense ✁

Sum or product doesn’t make sense ♦Zero point is not defined!

2 1

Ratio quantities

✁

Ratio quantities are ones for which the measurement scheme defines a zero point ✁

Example: attribute “distance”

♦Distance between an object and itself is zero

✁

Ratio quantities are treated as real numbers ♦All mathematical operations are allowed

✁

But: is there an “inherently” defined zero point? ♦Answer depends on scientific knowledge (e.g.

Fahrenheit knew no lower limit to temperature)

22

Attribute types used in practice

✁

Most schemes accommodate just two levels of measurement: nominal and ordinal

✁

Nominal attributes are also called “categorical”, ”enumerated”, or “discrete”

♦But: “enumerated” and “discrete” imply order

✁

Special case: dichotomy (“boolean” attribute) ✁

Ordinal attributes are called “numeric”, or “continuous”

♦But: “continuous” implies mathematical continuity

2 3

Metadata

✁

Information about the data that encodes background knowledge

✁

Can be used to restrict search space ✁

Examples:

♦Dimensional considerations

(i.e. expressions must be dimensionally correct)

♦Circular orderings (e.g. degrees in compass)

♦Partial orderings

(e.g. generalization/specialization relations)

2 4

Preparing the input

✁

Denormalization is not the only issue ✁

Problem: different data sources (e.g. sales department, customer billing department, …)

♦Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors

♦Data must be assembled, integrated, cleaned up

♦“Data warehouse”: consistent point of access

✁

External data may be required (“overlay data”)

✁

(5)

2 5

The ARFF format

%

% ARFF file for weather data with some numeric features %

@relation weather

@attribute outlook {sunny, overcast, rainy} @attribute temperature numeric

@attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no}

@data

sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes ...

26

Additional attribute types

ARFF supports

string

attributes:

♦_{Similar to nominal attributes but list of values}

is not pre-specified

It also supports

date

attributes:

♦Uses the ISO-8601 combined date and time format yyyy-MM-dd-THH:mm:ss

@attribute description string

@attribute today date

2 7

Sparse data

In some applications most attribute values in

a dataset are zero

♦E.g.: word counts in a text categorization problem

ARFF supports sparse data

This also works for nominal attributes (where

the first value corresponds to “zero”)

0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A” 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”

{1 26, 6 63, 10 “class A”} {3 42, 10 “class B”}

2 8

Attribute types

✁

Interpretation of attribute types in ARFF depends on learning scheme

♦Numeric attributes are interpreted as

✂

ordinal scales if less-than and greater-than are used

✂

ratio scales if distance calculations are performed (normalization/standardization may be required)

♦Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise)

✁

Integers in some given data file: nominal, ordinal, or ratio scale?

2 9

Nominal vs. ordinal

Attribute “age” nominal

Attribute “age” ordinal

(e.g. “young” < “pre-presbyopic” < “presbyopic”)

If age = young and astigmatic = no and tear production rate = normal then recommendation = soft

If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft

If age ≤ pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft

3 0

Missing values

✁

Frequently indicated by out-of-range entries ♦Types: unknown, unrecorded, irrelevant

♦Reasons:

✂

malfunctioning equipment

✂

changes in experimental design

✂

collation of different datasets

✂

measurement not possible

✁

Missing value may have significance in itself (e.g. missing test in a medical examination)

(6)

3 1

Inaccurate values

✁

Reason: data has not been collected for mining it

✁

Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer)

✁

Typographical errors in nominal attributes ⇒

values need to be checked for consistency

✁

Typographical and measurement errors in numeric attributes ⇒ outliers need to be identified

✁

Errors may be deliberate (e.g. wrong zip codes)

✁

Getting to know the data

✁

Simple visualization tools are very useful ♦Nominal attributes: histograms (Distribution

consistent with background knowledge?)

♦Numeric attributes: graphs (Any obvious outliers?)

✁

2-D and 3-D plots show dependencies ✁

Need to consult domain experts ✁