• Tidak ada hasil yang ditemukan

Instance: specific type of example Thing to be classified, associated, or clustered Individual, independent example of target concept Characterized by a predetermined set of attributes Input to learning scheme: set of instancesdataset Represented as a sin

N/A
N/A
Protected

Academic year: 2019

Membagikan "Instance: specific type of example Thing to be classified, associated, or clustered Individual, independent example of target concept Characterized by a predetermined set of attributes Input to learning scheme: set of instancesdataset Represented as a sin"

Copied!
6
0
0

Teks penuh

(1)

1

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Data Mining

Practical Machine Learning Tools and Techniques

Slides for Chapter 2 of Data Mining by I. H. Witten and E. Frank

2

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Input:

Concepts, instances, attributes

Terminology What’s a concept?

♦Classification, association, clustering, numeric prediction

What’s in an example?

♦Relations, flat files, recursion

What’s in an attribute?

♦Nominal, ordinal, interval, ratio

Preparing the input

♦ARFF, attributes, missing values, getting to know data

3

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Terminology

✁ Components of the input:

♦Concepts: kinds of things that can be learned

Aim: intelligible and operational concept description

♦Instances: the individual, independent examples of a concept

Note: more complicated forms of input are possible

♦Attributes: measuring aspects of an instance

We will focus on nominal and numeric ones

4

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

What’s a concept?

✁ Styles of learning:

♦Classification learning: predicting a discrete class

♦Association learning:

detecting associations between features

♦Clustering:

grouping similar instances into clusters

♦Numeric prediction: predicting a numeric quantity

✁ Concept: thing to be learned ✁

Concept description: output of learning scheme

5

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Classification learning

Example problems: weather data, contact lenses, irises, labor negotiations

Classification learning is supervised ♦Scheme is provided with actual outcome

Outcome is called the class of the example ✁

Measure success on fresh data for which class labels are known (test data) ✁

In practice success is often measured subjectively

6

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Association learning

Can be applied if no class is specified and any kind of structure is considered “interesting” ✁

Difference to classification learning:

♦Can predict any attribute’s value, not just the class, and more than one attribute’s value at a time

♦Hence: far more association rules than classification rules

♦Thus: constraints are necessary

(2)

7

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Clustering

Finding groups of items that are similar ✁

Clustering is unsupervised ♦The class of an example is not known

Success often measured subjectively

Iris virginica 1 .9

Iris virginica 2 .5

6 .0 3 .3 6 .3

Iris versicolor 1 .5

4.5 3 .2 6 .4

Iris versicolor 1 .4

4.7 3 .2 7 .0

Iris setosa 0 .2

1 .4 3 .0 4 .9

Iris setosa 0 .2

1 .4 3 .5 5 .1

Type Petal width Petal length Sepal width Sepal length

8

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Numeric prediction

Variant of classification learning where “class” is numeric (also called “regression”) ✁

Learning is supervised

♦Scheme is being provided with target value

✁ Measure success on test data

… Overcast

0

Play- time

Windy Humidity Temperature Outlook

9

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

What’s in an example?

Instance: specific type of example ✁

Thing to be classified, associated, or clustered

Individual, independent example of target concept

Characterized by a predetermined set of attributes

Input to learning scheme: set of instances/dataset

Represented as a single relation/flat file

Rather restricted form of input ✁

No relationships between objects

Most common form in practical data mining

1 0

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

A family tree

=

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Family tree represented as a table

Ian

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

The “sister-of” relation

yes

Sister of? Second person First person

No

All the rest

Yes

Sister of? Second person First person

(3)

1 3

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

A full representation in one table

Ian Parent 2

Fem ale Fem ale Fem ale Fem ale Fem ale Fem ale Gender Parent 1 Name

Parent 2 Parent 1 Gender

Fem ale Fem ale Male

St even

Sister of ? Second person

First person

If second person’s gender = female

and first person’s parent = second person’s parent then sister-of = yes

1 4

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Generating a flat file

Process of flattening called “denormalization” ♦Several relations are joined together to make one

Possible with any finite set of finite relations ✁

Problematic: relationships without pre-specified number of objects

♦Example: concept of nuclear-family

Denormalization may produce spurious regularities that reflect structure of database

♦Example: “supplier” predicts “supplier address”

1 5

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

The “ancestor-of” relation

Yes Other positive ex am ples here

Yes Ian Pam Femal e Nikk i ? ? Femal e Grace Parent 2

Male Femal e Femal e Femal e Femal e Male

Femal e Femal e Male Male Male Male

No All the rest

Yes

Ancestor of? Second person

First person

1 6

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Recursion

Appropriate techniques are known as “inductive logic programming”

♦(e.g. Quinlan’s FOIL)

♦Problems: (a) noise and (b) computational complexity

If person1 is a parent of person2 then person1 is an ancestor of person2

If person1 is a parent of person2 and person2 is an ancestor of person3 then person1 is an ancestor of person3

Infinite relations require recursion

1 7

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

What’s in an attribute?

Each instance is described by a fixed predefined set of features, its “attributes” ✁

But: number of attributes may vary in practice ♦Possible solution: “irrelevant value” flag

Related problem: existence of an attribute may depend of value of another one ✁

Possible attribute types (“levels of measurement”):

Nominal, ordinal, interval and ratio

1 8

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Nominal quantities

✁ Values are distinct symbols

♦Values themselves serve only as labels or names

Nominal comes from the Latin word for name

Example: attribute “outlook” from weather data ♦Values: “sunny”,”overcast”, and “rainy”

No relation is implied among nominal values (no ordering or distance measure)

(4)

1 9

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Ordinal quantities

✁ Impose order on values

But: no distance between values defined ✁

Example:

attribute “temperature” in weather data ♦Values: “hot” > “mild” > “cool”

Note: addition and subtraction don’t make sense ✁

Example rule:

temperature < hot ⇒ play = yes ✁

Distinction between nominal and ordinal not always clear (e.g. attribute “outlook”)

20

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Interval quantities

Interval quantities are not only ordered but measured in fixed and equal units ✁

Example 1: attribute “temperature” expressed in degrees Fahrenheit

✁ Example 2: attribute “year” ✁

Difference of two values makes sense ✁

Sum or product doesn’t make sense ♦Zero point is not defined!

2 1

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Ratio quantities

Ratio quantities are ones for which the measurement scheme defines a zero point ✁

Example: attribute “distance”

♦Distance between an object and itself is zero

Ratio quantities are treated as real numbers ♦All mathematical operations are allowed

But: is there an “inherently” defined zero point? ♦Answer depends on scientific knowledge (e.g.

Fahrenheit knew no lower limit to temperature)

22

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Attribute types used in practice

Most schemes accommodate just two levels of measurement: nominal and ordinal

Nominal attributes are also called “categorical”, ”enumerated”, or “discrete”

♦But: “enumerated” and “discrete” imply order

Special case: dichotomy (“boolean” attribute) ✁

Ordinal attributes are called “numeric”, or “continuous”

♦But: “continuous” implies mathematical continuity

2 3

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Metadata

Information about the data that encodes background knowledge

Can be used to restrict search space ✁

Examples:

♦Dimensional considerations

(i.e. expressions must be dimensionally correct)

♦Circular orderings (e.g. degrees in compass)

♦Partial orderings

(e.g. generalization/specialization relations)

2 4

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Preparing the input

Denormalization is not the only issue ✁

Problem: different data sources (e.g. sales department, customer billing department, …)

♦Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors

♦Data must be assembled, integrated, cleaned up

♦“Data warehouse”: consistent point of access

External data may be required (“overlay data”)

(5)

2 5

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

The ARFF format

%

% ARFF file for weather data with some numeric features %

@relation weather

@attribute outlook {sunny, overcast, rainy} @attribute temperature numeric

@attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no}

@data

sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes ...

26

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Additional attribute types

ARFF supports

string

attributes:

Similar to nominal attributes but list of values

is not pre-specified

It also supports

date

attributes:

♦Uses the ISO-8601 combined date and time format yyyy-MM-dd-THH:mm:ss

@attribute description string

@attribute today date

2 7

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Sparse data

In some applications most attribute values in

a dataset are zero

♦E.g.: word counts in a text categorization problem

ARFF supports sparse data

This also works for nominal attributes (where

the first value corresponds to “zero”)

0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, “class A” 0, 0, 0, 42, 0, 0, 0, 0, 0, 0, “class B”

{1 26, 6 63, 10 “class A”} {3 42, 10 “class B”}

2 8

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Attribute types

Interpretation of attribute types in ARFF depends on learning scheme

♦Numeric attributes are interpreted as

ordinal scales if less-than and greater-than are used

ratio scales if distance calculations are performed (normalization/standardization may be required)

♦Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise)

Integers in some given data file: nominal, ordinal, or ratio scale?

2 9

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Nominal vs. ordinal

Attribute “age” nominal

Attribute “age” ordinal

(e.g. “young” < “pre-presbyopic” < “presbyopic”)

If age = young and astigmatic = no and tear production rate = normal then recommendation = soft

If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft

If age pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft

3 0

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Missing values

Frequently indicated by out-of-range entries ♦Types: unknown, unrecorded, irrelevant

♦Reasons:

malfunctioning equipment

changes in experimental design

collation of different datasets

measurement not possible

Missing value may have significance in itself (e.g. missing test in a medical examination)

(6)

3 1

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Inaccurate values

Reason: data has not been collected for mining it

Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer)

Typographical errors in nominal attributes ⇒

values need to be checked for consistency

Typographical and measurement errors in numeric attributes ⇒ outliers need to be identified

Errors may be deliberate (e.g. wrong zip codes)

Other problems: duplicates, stale data

3 2

Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 03/04/06

Getting to know the data

Simple visualization tools are very useful ♦Nominal attributes: histograms (Distribution

consistent with background knowledge?)

♦Numeric attributes: graphs (Any obvious outliers?)

2-D and 3-D plots show dependencies ✁

Need to consult domain experts ✁

Referensi

Dokumen terkait

Di samping itu, jurnal ilmiah di bidang Keperawatan ini akan menjadi wahana temuan ilmiah dan teknologi yang akan bermanfaat bagi ilmuwan, Pendidik dan Praktisi di

Menurut Gelles kekerasan terhadap anak (child abuse) dapat didefinisikan sebagai peristiwa pelukaan fisik, mental, atau seksual yang umumnya dilakukan oleh orang-orang yang

Jika tumbuhan C-3 menutup stomatanya karena temperatur sangat tinggi maka akan berakibat pada proses fotosintesis akan terhenti (stagnan). Hal ini disebabkan karena pada keadaan

Pada hari ke-4 dan 5 keadaan buah tidak jauh beda yaitu buahnyan layu, warnanya hitam, teksturnya menjadi lembek, jamur yang tumbuh semakin banyak, mulai keluar air, ini pada

Belanja adalah semua pengeluaran dari rekening kas umum negara atau daerah yang mengurangi saldo anggaran lebih dalam periode tahun anggaran bersangkutan yang

Untuk tingkat perkembangan parasitoid yang dipelihara pada suhu yang berbeda, dihitung unit suhu harian atau derajat hari ( degree day = DD ) yang dibutuhkan untuk perkembangan

Dalam setiap usaha koperasi pembagian sisa hasil usaha yang dilakukan kepada para anggota memiliki sistem dan prosedur yang berbeda sesuai dengan kebijakan pihak

Dari hasil penelitian keberhasilan alat dalam melakukan pencampuran cat berdasarkan volume yang diinginkan dengan komposisi cat warna dasar, rancang bangun alat pencampur cat