• Tidak ada hasil yang ditemukan

Instance: specific type of example Thing to be classified, associated, or clustered Individual, independent example of target concept Characterized by a predetermined set of attributes Input to learning scheme: set of instancesdataset Represented as a sin

N/A
N/A
Protected

Academic year: 2019

Membagikan "Instance: specific type of example Thing to be classified, associated, or clustered Individual, independent example of target concept Characterized by a predetermined set of attributes Input to learning scheme: set of instancesdataset Represented as a sin"

Copied!
6
0
0

Teks penuh

(1)

1

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Data Mining

Practical Machine Learning Tools and Techniques

Slides for Chapter 2 of Data Mining by I. H. Witten and E. Frank

2

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Input:

Concepts, instances, attributes

Terminology What’s a concept?

♦Classification, association, clustering, numeric prediction

What’s in an example?

♦Relations, flat files, recursion

What’s in an attribute?

♦Nominal, ordinal, interval, ratio

Preparing the input

♦ARFF, attributes, missing values, getting to know data

3

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Terminology

✁ Components of the input:

♦Concepts: kinds of things that can be learned

Aim: intelligible and operational concept description

♦Instances: the individual, independent examples of a concept

Note: more complicated forms of input are possible

♦Attributes: measuring aspects of an instance

We will focus on nominal and numeric ones

4

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

What’s a concept?

✁ Styles of learning:

♦Classification learning: predicting a discrete class

♦Association learning:

detecting associations between features

♦Clustering:

grouping similar instances into clusters

♦Numeric prediction: predicting a numeric quantity

✁ Concept: thing to be learned ✁

Concept description: output of learning scheme

5

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Classification learning

Example problems: weather data, contact lenses, irises, labor negotiations

Classification learning is supervised ♦Scheme is provided with actual outcome

Outcome is called the class of the example ✁

Measure success on fresh data for which class labels are known (test data) ✁

In practice success is often measured subjectively

6

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Association learning

Can be applied if no class is specified and any kind of structure is considered “interesting” ✁

Difference to classification learning:

♦Can predict any attribute’s value, not just the class, and more than one attribute’s value at a time

♦Hence: far more association rules than classification rules

♦Thus: constraints are necessary

(2)

7

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Clustering

Finding groups of items that are similar ✁

Clustering is unsupervised ♦The class of an example is not known

Success often measured subjectively

Iris virginica 1 .9

Iris virginica 2 .5

6 .0 3 .3 6 .3

Iris versicolor 1 .5

4.5 3 .2 6 .4

Iris versicolor 1 .4

4.7 3 .2 7 .0

Iris setosa 0 .2

1 .4 3 .0 4 .9

Iris setosa 0 .2

1 .4 3 .5 5 .1

Type Petal width Petal length Sepal width Sepal length

8

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Numeric prediction

Variant of classification learning where “class” is numeric (also called “regression”) ✁

Learning is supervised

♦Scheme is being provided with target value

✁ Measure success on test data

… Overcast

0

Play- time

Windy Humidity Temperature Outlook

9

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

What’s in an example?

Instance: specific type of example ✁

Thing to be classified, associated, or clustered

Individual, independent example of target concept

Characterized by a predetermined set of attributes

Input to learning scheme: set of instances/dataset

Represented as a single relation/flat file

Rather restricted form of input ✁

No relationships between objects

Most common form in practical data mining

1 0

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

A family tree

=

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Family tree represented as a table

Ian

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

The “sister-of” relation

yes

Sister of? Second person First person

No

All the rest

Yes

Sister of? Second person First person

(3)

1 3

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

A full representation in one table

Ian Parent 2

Fem ale Fem ale Fem ale Fem ale Fem ale Fem ale Gender Parent 1 Name

Parent 2 Parent 1 Gender

Fem ale Fem ale Male

St even

Sister of ? Second person

First person

If second person’s gender = female

and first person’s parent = second person’s parent then sister-of = yes

1 4

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Generating a flat file

Process of flattening called “denormalization” ♦Several relations are joined together to make one

Possible with any finite set of finite relations ✁

Problematic: relationships without pre-specified number of objects

♦Example: concept of nuclear-family

Denormalization may produce spurious regularities that reflect structure of database

♦Example: “supplier” predicts “supplier address”

1 5

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

The “ancestor-of” relation

Yes Other positive ex am ples here

Yes Ian Pam Femal e Nikk i ? ? Femal e Grace Parent 2

Male Femal e Femal e Femal e Femal e Male

Femal e Femal e Male Male Male Male

No All the rest

Yes

Ancestor of? Second person

First person

1 6

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Recursion

Appropriate techniques are known as “inductive logic programming”

♦(e.g. Quinlan’s FOIL)

♦Problems: (a) noise and (b) computational complexity

If person1 is a parent of person2 then person1 is an ancestor of person2

If person1 is a parent of person2 and person2 is an ancestor of person3 then person1 is an ancestor of person3

Infinite relations require recursion

1 7

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

What’s in an attribute?

Each instance is described by a fixed predefined set of features, its “attributes” ✁

But: number of attributes may vary in practice ♦Possible solution: “irrelevant value” flag

Related problem: existence of an attribute may depend of value of another one ✁

Possible attribute types (“levels of measurement”):

Nominal, ordinal, interval and ratio

1 8

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Nominal quantities

✁ Values are distinct symbols

♦Values themselves serve only as labels or names

Nominal comes from the Latin word for name

Example: attribute “outlook” from weather data ♦Values: “sunny”,”overcast”, and “rainy”

No relation is implied among nominal values (no ordering or distance measure)

(4)

1 9

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Ordinal quantities

✁ Impose order on values

But: no distance between values defined ✁

Example:

attribute “temperature” in weather data ♦Values: “hot” > “mild” > “cool”

Note: addition and subtraction don’t make sense ✁

Example rule:

temperature < hot ⇒ play = yes ✁

Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)

20

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Interval quantities

Interval quantities are not only ordered but measured in fixed and equal units ✁

Example 1: attribute “temperature” expressed in degrees Fahrenheit

✁ Example 2: attribute “year” ✁

Difference of two values makes sense ✁

Sum or product doesn’t make sense ♦Zero point is not defined!

2 1

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Ratio quantities

Ratio quantities are ones for which the measurement scheme defines a zero point ✁

Example: attribute “distance”

♦Distance between an object and itself is zero

Ratio quantities are treated as real numbers ♦All mathematical operations are allowed

But: is there an “inherently” defined zero point? ♦Answer depends on scientific knowledge (e.g.

Fahrenheit knew no lower limit to temperature)

22

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Attribute types used in practice

Most schemes accommodate just two levels of measurement: nominal and ordinal

Nominal attributes are also called “categorical”, ”enumerated”, or “discrete”

♦But: “enumerated” and “discrete” imply order

Special case: dichotomy (“boolean” attribute) ✁

Ordinal attributes are called “numeric”, or “continuous”

♦But: “continuous” implies mathematical continuity

2 3

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Metadata

Information about the data that encodes background knowledge

Can be used to restrict search space ✁

Examples:

♦Dimensional considerations

(i.e. expressions must be dimensionally correct)

♦Circular orderings (e.g. degrees in compass)

♦Partial orderings

(e.g. generalization/specialization relations)

2 4

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Preparing the input

Denormalization is not the only issue ✁

Problem: different data sources (e.g. sales department, customer billing department, …)

♦Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors

♦Data must be assembled, integrated, cleaned up

♦“Data warehouse”: consistent point of access

External data may be required (“overlay data”)

(5)

2 5

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

The ARFF format

%

% ARFF file for weather data with some numeric features %

@relation weather

@attribute outlook {sunny, overcast, rainy} @attribute temperature numeric

@attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no}

@data

sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes ...

26

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Additional attribute types

ARFF supports

string

attributes:

Similar to nominal attributes but list of values

is not pre-specified

It also supports

date

attributes:

♦Uses the ISO-8601 combined date and time format yyyy-MM-dd-THH:mm:ss

@attribute description string

@attribute today date

2 7

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Sparse data

In some applications most attribute values in

a dataset are zero

♦E.g.: word counts in a text categorization problem

ARFF supports sparse data

This also works for nominal attributes (where

the first value corresponds to “zero”)

0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A” 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”

{1 26, 6 63, 10 “class A”} {3 42, 10 “class B”}

2 8

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Attribute types

Interpretation of attribute types in ARFF depends on learning scheme

♦Numeric attributes are interpreted as

ordinal scales if less-than and greater-than are used

ratio scales if distance calculations are performed (normalization/standardization may be required)

♦Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise)

Integers in some given data file: nominal, ordinal, or ratio scale?

2 9

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Nominal vs. ordinal

Attribute “age” nominal

Attribute “age” ordinal

(e.g. “young” < “pre-presbyopic” < “presbyopic”)

If age = young and astigmatic = no and tear production rate = normal then recommendation = soft

If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft

If age pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft

3 0

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Missing values

Frequently indicated by out-of-range entries ♦Types: unknown, unrecorded, irrelevant

♦Reasons:

malfunctioning equipment

changes in experimental design

collation of different datasets

measurement not possible

Missing value may have significance in itself (e.g. missing test in a medical examination)

(6)

3 1

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Inaccurate values

Reason: data has not been collected for mining it

Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer)

Typographical errors in nominal attributes ⇒

values need to be checked for consistency

Typographical and measurement errors in numeric attributes ⇒ outliers need to be identified

Errors may be deliberate (e.g. wrong zip codes)

Other problems: duplicates, stale data

3 2

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Getting to know the data

Simple visualization tools are very useful ♦Nominal attributes: histograms (Distribution

consistent with background knowledge?)

♦Numeric attributes: graphs (Any obvious outliers?)

2-D and 3-D plots show dependencies ✁

Need to consult domain experts ✁

Referensi

Dokumen terkait