• Tidak ada hasil yang ditemukan

Specialization vs. generalization

N/A
N/A
Protected

Academic year: 2023

Membagikan "Specialization vs. generalization"

Copied!
10
0
0

Teks penuh

(1)

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 1

Knowledge in Learning

Chapter 19

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 2

Outline

Learning and logic Version space learning Explanation-based learning

Learning using relevance information Inductive logic programming

Learning and logic

Can restrict learning to hypotheses that can be expressed as logical sentences

Examples and classifications can also be expressed as logical sentences

Classification sentence can be inferred from hypothesis and example description

Straightforward to incorporate prior knowledge Can be done with propositional as well as first order logic

Incrementally updating a hypothesis

(a) (b) (c) (d) (e)

+ +

+

++ + +

+ +

+

++ + +

+

+ +

+

++ + +

+

+ +

+

++ + +

+

+ +

+

++ + +

+

(2)

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 5

Specialization vs. generalization

False positive: the hypothesis classifies an example as positive when it really is negative

We need to specialize the hypothesis to exclude the negative example

False negative: the hypothesis classifies an example as negative when it really is positive

We need to generalize the hypothesis to include the positive example

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 6

Current-best learning

function CURRENT-BEST-LEARNING(examples) returns a hypothesis H any hypothesis consistent with the first example in examples for each remaining example in examples do

if e is false positive for H then

H choose a specialization of H consistent with examples else if e is false negative for H then

H choose a generalization of H consistent with examples if no consistent specialization/generalization can be found then fail end

return H

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 7

The version space

this region all inconsistent This region all inconsistent

More general

More specific S1

G1

S2

G2 G3 . . . Gm

. . . Sn

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 8

The boundaries of the version space

+ +

+ +

+

+ + +

+ +

S1 G1

G2

(3)

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 9

General-to-specific search

Start with most general hypothesis Until all examples have been processed

If current example is negative

Replace all hypotheses that cover the example by their most general specializations that do not cover the example Delete hypotheses that are more specific than other ones Delete hypotheses not consistent with all positive examples

If current example is positive

Delete all hypotheses not consistent with example

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 10

Specific-to-general search

Start with most specific hypothesis Until all examples have been processed

If current example is positive

Replace all hypotheses that do not cover the example by their most specific generalizations that cover the example Delete hypotheses that are more general than other ones Delete hypotheses covering a negative example

If current example is negative

Delete all hypotheses covering the example

Comments on the two algorithms

General-to-specific search needs to maintain

The set of most general hypotheses G

The set of positive examples seen so far Specific-to-general search needs to maintain

The set of most specific hypotheses S

The set of negative examples seen so far Storing the examples can be avoided by

maintaining the set of all consistent hypotheses!

Version space learning

function VERSION-SPACE-LEARNING(examples) returns a version space local variables: V, the version space: the set of all hypotheses V the set of all hypotheses

for each example e in examples do

if V is not empty then V VERSION-SPACE-UPDATE(V, e) end

return V

function VERSION-SPACE-UPDATE(V, e) returns an updated version space

V

h V : h is consistent with e

(4)

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 13

Maintaining the version space

Naïve approach: explicitly storing all the consistent hypotheses

Observation: version space fully specified by boundary sets S and G

S: set of most specific hypotheses consistent with the examples seen so far

G: set of most general hypotheses consistent with the examples seen so far

We only need to maintain S and G !!

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 14

Incorporating a positive example

Delete Gi that do not cover the example

Replace all Si that do not cover the example by their most specific generalizations that cover it Delete Si that are more general than some other member of S

Delete S

i that are not more specific than all members of G

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 15

Incorporating a negative example

Delete S

i that cover the example

Replace all Gi that cover the example by their most general specializations that do not cover it Delete Gi that are less general than some other member of G

Delete Gi that are not more general than all members of S

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 16

Termination of the algorithm

First possibility: we run out of data

Several hypotheses survive

Second possibility: S and G converge

We have found the only consistent hypothesis Third possibility: version space collapses

No hypothesis survives

This may be due to noise or an inappropriate hypothesis space

(5)

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 17

Learning with prior knowledge

Observations Hypotheses Predictions

Prior knowledge

Knowledge−based inductive learning

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 18

Explanation-based learning

Entailment constraint for inductive learning:

Hypothesis Descriptions |= Classifications Explanation-based learning: generalizing an example using background knowledge:

Hypothesis Descriptions |= Classifications Background |= Hypothesis

We don't really learn anything new

We generate a rule that explains an example using our background knowledge

Why is this learning?

EBL learns a general rule that can be applied in situations similar to a given example

Generalization of memoization technique Humans often learn the same way

Learning to play a board game when you already know the rules

EBL is also called speedup learning

Using learned rules we can solve problems faster

New rules can be applied "without thinking"

Example: differentiation

Derivative of X2

Using product rule: (X2)' = (1×X)+(X×1) = 2×X Many people can compute the derivative by inspection because they have learned the rule

by looking at examples

We can compute the derivative using standard rules but that takes more time

ArithmeticUnknown u Derivative u2, u 2u

(6)

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 21

How can we learn a rule?

First: construct explanation of the observation using background knowledge

Proof that the goal predicate applies to the example

Proof explains how background k. applies to example

Can be constructed using theorem prover

Then: establish definition of class of cases for which the same explanation structure can be used

This is done by looking at the leaves of the generalized proof tree

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 22

An example

Let's learn a rule for simplifying 1 × (0 + x) This is our background knowledge:

Rewrite u , v Simplify v , w Simplify u , w Primitive u Simplify u , u

ArithmeticUnknown u Primitive u Number u Primitive u

Rewrite 1 u , u Rewrite 0 u , u

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 23

Proof tree (backward-chaining prover)

Primitive(X)

ArithmeticUnknown(X)

Primitive(z)

ArithmeticUnknown(z) Simplify(X,w)

Yes, { } Yes, {x / 1, v / y+z}

Simplify(y+z,w)

Rewrite(y+z,v’) Yes, {y / 0, v’ / z}

{w / X}

Yes, { } Yes, {v / 0+X}

Yes, {v’ / X}

Simplify(z,w) {w / z}

Simplify(1 (0+X),w)+

Rewrite(x (y+z),v)+

Simplify(x (y+z),w)+

Rewrite(1 (0+X),v)+ Simplify(0+X,w)

Rewrite(0+X,v’)

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 24

The result

Use leaves of generalized proof tree to form general rule for goal predicate:

First two conditions are true regardless of the value of z and we can discard them:

Rewrite 1 0 z , 0 z Rewrite 0 z , z ArithmeticUnknown z

Simplify 1 0 z , z

ArithmeticUnknown z Simplify 1 0 z , z

(7)

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 25

The four steps of the EBL process

1) Construct proof that goal predicate applies to example using background knowledge

2) Construct a generalized proof tree for the variabilized goal using the same inference steps 3) Get rule (using bindings from generalized proof)

Left-hand side: leaves of proof tree Right-hand side: variabilized goal 4) Drop conditions that are always true

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 26

Improving efficiency

Can obtain more general rules by pruning, e.g.:

Covers cases where z is a number

Even more general rule:

Rule can be extracted from any partial subtree

Primitive z Simplify 1 0 z , z

Simplify y z , w Simplify 1 y z , w

Which rule should we pick?

Three points to consider:

1) Adding large number of rules to knowledge base slows down reasoning process (branching factor) 2) Derived rules should lead to significant speedup for

cases that they cover

3) Rules should be as general as possible

We want rules that are general and easy to solve (as defined by an operationality criterion)

Primitive(z) vs. Simplify(y + z, w)

Operationality vs. generality

More specific subgoals are usually easier to solve

How do you define operationality?

2, 10, 100 proof steps?

Cost of solving a particular rule depends on other rules in the knowledge base

Rarely possible to devise mathematical model of effect on overall efficiency

Alternative approach: empirical evaluation

(8)

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 29

Attribute selection

If we know which attributes are relevant for a classification we can delete the rest

Our background knowledge might tell us that some attributes are irrelevant

Can significantly improve learning process Reduced amount of data results in speedup Makes it less likely to overfit

Resulting classifier can be simpler

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 30

Consistent determinations

A determination P Q says that if any examples match on P, then they must also match on Q

Example: MaterialTemperature Conductance

Sample Mass Temperature Material Size Conductance

S1 12 26 Copper 3 0.59

S1 12 100 Copper 3 0.57

S2 24 26 Copper 6 0.59

S3 12 26 Lead 2 0.05

S3 12 100 Lead 2 0.04

S4 24 26 Lead 4 0.05

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 31

Finding minimal consistent determinations

function MINIMAL-CONSISTENT-DET(E, A) returns a determination inputs: E, a set of examples

A, a set of attributes, of size n for i 0, . . . , n do

for each subset Aiof A of size i do if CONSISTENT-DET?(Ai, E) then return Ai

end end

function CONSISTENT-DET?(A, E) returns a truth-value inputs: A, a set of attributes

E, a set of examples local variables: H, a hash table for each example e in E do

if some example in H has the same values as e for the attributes A but a different classification then return False

store the class of e in H, indexed by the values for attributes A of the example e end

return True

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 32

Time complexity

Assume size of smallest determination is p There are subsets of that size Algorithm won't terminate before searching them Hence the algorithm's time complexity is exponential in the size of the minimal determination

It turns out that the problem is NP-complete In practice p is often small

n p

O np

(9)

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 33

RBDTL: decision tree learning with minimal consistent determinations

Learning curve for dataset with 16 attributes where only 5 are relevant:

0.4 0.5 0.6 0.7 0.8 0.9 1

0 20 40 60 80 100 120 140

% correct on test set

Training set size RBDTL

DTL

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 34

Inductive logic programming

Learning logic programs from data

Background knowledge is given in form of a logic program, data is used to extend it

Theoretically well-founded method for knowledge-based inductive learning

Background knowledge and new hypothesis combine to explain the examples

BackgroundHypothesisDescriptions |= Classifications

Learning the grandparent predicate

Beatrice Andrew

Eugenie William Harry

Charles Diana

Mum George

Philip

Elizabeth Margaret

Kydd Spencer

Peter Mark

Zara

Anne Sarah Edward

Possible input for learning

Descriptions:

Father(Philip,Charles) Father(Philip,Anne) ...

Mother(Mum, Margaret) Mother(Mum, Elizabeth) ...

Married(Diana,Charles) Married(Elizabeth,Philip) ...

Male (Philip) Male(Charles) ...

Female(Beatrice) Female(Margaret) ...

Classifications:

Grandparent(Mum,Charles) Grandparent(Elizabeth,Beatrice) ...

¬GrandparentIMum,Harry) ¬Grandparen(Spencer,Peter) ...

Background: empty

(10)

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 37

Possible output

A possible hypothesis:

Grandparent(x,y) ⇔ [∃z Mother(x,z) Mother(z,y)]

∨[∃z Mother(x,z) ∧Father(z,y)]

∨[∃z Father(x,z) ∧Mother(z,y)]

∨[∃z Father(x,z) ∧Father(z,y)]

With background knowledge

Parent(x,y) ⇔ Mother(x,y) Father(x,y)

hypothesis could be

Grandparent(x,y) ⇔ [∃z Parent(x,z) Parent(z,y)]

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 38

Why ILP?

Can't (directly) learn grandfather predicate using propositional/attribute-based learning algorithms

Attribute-based learning algorithms can't (directly) learn relational predicates

However, in practice the problem can often be transformed into attribute-based form

This process is called propositionalization

ILP can (in principle) learn recursive definitions (e.g. ancestor)

© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 39

Two main approaches for ILP

Inverse resolution: finds hypothesis so that classifications can be proven

Proof needs to be run backwards

FOIL: generates set of Horn clauses by learning one clause at a time

Each clause is formed greedily by adding literals that maximize the clause's accuracy on the training data

Successfully solved programming problems from Prolog textbook

Referensi

Dokumen terkait

This is in line with (Sumaryana 2018), (Salamah, 2019) and (Aslamiah, 2019), which stated that organizational culture positively and significantly affects employee