Specialization vs. generalization

(1)

Knowledge in Learning

Chapter 19

Outline

Learning and logic Version space learning Explanation-based learning

Learning using relevance information Inductive logic programming

Learning and logic

Can restrict learning to hypotheses that can be expressed as logical sentences

Examples and classifications can also be expressed as logical sentences

Classification sentence can be inferred from hypothesis and example description

Straightforward to incorporate prior knowledge Can be done with propositional as well as first order logic

Incrementally updating a hypothesis

(a) (b) (c) (d) (e)

+ +

+

++ + +

− −

−

−− −

−

+ +

+

++ + +

− −

−

−− −

−

+

+ +

+

++ + +

− −

−

−− −

−

+

+ +

+

++ + +

− −

−

− −

−

− +

+ +

+

++ + +

− −

−

− −

−

+

− −

−

(2)

Specialization vs. generalization

False positive: the hypothesis classifies an example as positive when it really is negative

We need to specialize the hypothesis to exclude the negative example

False negative: the hypothesis classifies an example as negative when it really is positive

We need to generalize the hypothesis to include the positive example

Current-best learning

function CURRENT-BEST-LEARNING(examples) returns a hypothesis H any hypothesis consistent with the first example in examples for each remaining example in examples do

if e is false positive for H then

H choose a specialization of H consistent with examples else if e is false negative for H then

H choose a generalization of H consistent with examples if no consistent specialization/generalization can be found then fail end

return H

The version space

this region all inconsistent This region all inconsistent

More general

More specific S1

G₁

S2

G₂ G3 . . . Gm

. . . Sn

The boundaries of the version space

+ +

+

+ + +

+ +

−

− − −

− −

−

S1 G1

G2

(3)

General-to-specific search

Start with most general hypothesis Until all examples have been processed

If current example is negative

Replace all hypotheses that cover the example by their most general specializations that do not cover the example Delete hypotheses that are more specific than other ones Delete hypotheses not consistent with all positive examples

If current example is positive

Delete all hypotheses not consistent with example

Specific-to-general search

Start with most specific hypothesis Until all examples have been processed

If current example is positive

Replace all hypotheses that do not cover the example by their most specific generalizations that cover the example Delete hypotheses that are more general than other ones Delete hypotheses covering a negative example

If current example is negative

Delete all hypotheses covering the example

Comments on the two algorithms

General-to-specific search needs to maintain

The set of most general hypotheses G

The set of positive examples seen so far Specific-to-general search needs to maintain

The set of most specific hypotheses S

The set of negative examples seen so far Storing the examples can be avoided by

maintaining the set of all consistent hypotheses!

Version space learning

function VERSION-SPACE-LEARNING(examples) returns a version space local variables: V, the version space: the set of all hypotheses V the set of all hypotheses

for each example e in examples do

if V is not empty then V VERSION-SPACE-UPDATE(V, e) end

return V

function VERSION-SPACE-UPDATE(V, e) returns an updated version space

V

h V : h is consistent with e

(4)

Maintaining the version space

Naïve approach: explicitly storing all the consistent hypotheses

Observation: version space fully specified by boundary sets S and G

S: set of most specific hypotheses consistent with the examples seen so far

G: set of most general hypotheses consistent with the examples seen so far

We only need to maintain S and G !!

Incorporating a positive example

Delete G_ithat do not cover the example

Replace all S_ithat do not cover the example by their most specific generalizations that cover it Delete S_ithat are more general than some other member of S

Delete S

i that are not more specific than all members of G

Incorporating a negative example

Delete S

i that cover the example

Replace all G_ithat cover the example by their most general specializations that do not cover it Delete G_ithat are less general than some other member of G

Delete G_i that are not more general than all members of S

Termination of the algorithm

First possibility: we run out of data

Several hypotheses survive

Second possibility: S and G converge

We have found the only consistent hypothesis Third possibility: version space collapses

No hypothesis survives

This may be due to noise or an inappropriate hypothesis space

(5)

Learning with prior knowledge

Observations ^Hypotheses Predictions

Prior knowledge

Knowledge−based inductive learning

Explanation-based learning

Entailment constraint for inductive learning:

Hypothesis ∧ Descriptions |= Classifications Explanation-based learning: generalizing an example using background knowledge:

Hypothesis ∧ Descriptions |= Classifications Background |= Hypothesis

We don't really learn anything new

We generate a rule that explains an example using our background knowledge

Why is this learning?

EBL learns a general rule that can be applied in situations similar to a given example

Generalization of memoization technique Humans often learn the same way

Learning to play a board game when you already know the rules

EBL is also called speedup learning

Using learned rules we can solve problems faster

New rules can be applied "without thinking"

Example: differentiation

Derivative of X²

Using product rule: (X²)' = (1×X)+(X×1) = 2×X Many people can compute the derivative by inspection because they have learned the rule

by looking at examples

We can compute the derivative using standard rules but that takes more time

ArithmeticUnknown u Derivative u², u 2u

(6)

How can we learn a rule?

First: construct explanation of the observation using background knowledge

Proof that the goal predicate applies to the example

Proof explains how background k. applies to example

Can be constructed using theorem prover

Then: establish definition of class of cases for which the same explanation structure can be used

This is done by looking at the leaves of the generalized proof tree

An example

Let's learn a rule for simplifying 1 × (0 + x) This is our background knowledge:

Rewrite u , v Simplify v , w Simplify u , w Primitive u Simplify u , u

ArithmeticUnknown u Primitive u Number u Primitive u

Rewrite 1 u , u Rewrite 0 u , u

Proof tree (backward-chaining prover)

Primitive(X)

ArithmeticUnknown(X)

Primitive(z)

ArithmeticUnknown(z) Simplify(X,w)

Yes, { } Yes, {x / 1, v / y+z}

Simplify(y+z,w)

Rewrite(y+z,v’) Yes, {y / 0, v’ / z}

{w / X}

Yes, { } Yes, {v / 0+X}

Yes, {v’ / X}

Simplify(z,w) {w / z}

Simplify(1 (0+X),w)+

Rewrite(x (y+z),v)+

Simplify(x (y+z),w)+

Rewrite(1 (0+X),v)+ Simplify(0+X,w)

Rewrite(0+X,v’)

The result

Use leaves of generalized proof tree to form general rule for goal predicate:

First two conditions are true regardless of the value of z and we can discard them:

Rewrite 1 0 z , 0 z Rewrite 0 z , z ArithmeticUnknown z

Simplify 1 0 z , z

ArithmeticUnknown z Simplify 1 0 z , z

(7)

The four steps of the EBL process

1) Construct proof that goal predicate applies to example using background knowledge

2) Construct a generalized proof tree for the variabilized goal using the same inference steps 3) Get rule (using bindings from generalized proof)

Left-hand side: leaves of proof tree Right-hand side: variabilized goal 4) Drop conditions that are always true

Improving efficiency

Can obtain more general rules by pruning, e.g.:

Covers cases where z is a number

Even more general rule:

Rule can be extracted from any partial subtree

Primitive z Simplify 1 0 z , z

Simplify y z , w Simplify 1 y z , w

Which rule should we pick?

Three points to consider:

1) Adding large number of rules to knowledge base slows down reasoning process (branching factor) 2) Derived rules should lead to significant speedup for

cases that they cover

3) Rules should be as general as possible

We want rules that are general and easy to solve (as defined by an operationality criterion)

Primitive(z) vs. Simplify(y + z, w)

Operationality vs. generality

More specific subgoals are usually easier to solve

How do you define operationality?

2, 10, 100 proof steps?

Cost of solving a particular rule depends on other rules in the knowledge base

Rarely possible to devise mathematical model of effect on overall efficiency

Alternative approach: empirical evaluation

(8)

Attribute selection

If we know which attributes are relevant for a classification we can delete the rest

Our background knowledge might tell us that some attributes are irrelevant

Can significantly improve learning process Reduced amount of data results in speedup Makes it less likely to overfit

Resulting classifier can be simpler

Consistent determinations

A determination P Q says that if any examples match on P, then they must also match on Q

Example: Material∧Temperature Conductance

Sample Mass Temperature Material Size Conductance

S1 12 26 Copper 3 0.59

S1 12 100 Copper 3 0.57

S2 24 26 Copper 6 0.59

S3 12 26 Lead 2 0.05

S3 12 100 Lead 2 0.04

S4 24 26 Lead 4 0.05

Finding minimal consistent determinations

function MINIMAL-CONSISTENT-DET(E, A) returns a determination inputs: E, a set of examples

A, a set of attributes, of size n for i 0, . . . , n do

for each subset Aiof A of size i do if CONSISTENT-DET?(Ai, E) then return Ai

end end

function CONSISTENT-DET?(A, E) returns a truth-value inputs: A, a set of attributes

E, a set of examples local variables: H, a hash table for each example e in E do

if some example in H has the same values as e for the attributes A but a different classification then return False

store the class of e in H, indexed by the values for attributes A of the example e end

return True

Time complexity

Assume size of smallest determination is p There are subsets of that size Algorithm won't terminate before searching them Hence the algorithm's time complexity is exponential in the size of the minimal determination

It turns out that the problem is NP-complete In practice p is often small

n p

O n^p

(9)

RBDTL: decision tree learning with minimal consistent determinations

Learning curve for dataset with 16 attributes where only 5 are relevant:

0.4 0.5 0.6 0.7 0.8 0.9 1

0 20 40 60 80 100 120 140

% correct on test set

Training set size RBDTL

DTL

Inductive logic programming

Learning logic programs from data

Background knowledge is given in form of a logic program, data is used to extend it

Theoretically well-founded method for knowledge-based inductive learning

Background knowledge and new hypothesis combine to explain the examples

Background∧Hypothesis∧Descriptions |= Classifications

Learning the grandparent predicate

Beatrice Andrew

Eugenie William Harry

Charles Diana

Mum George

Philip

Elizabeth Margaret

Kydd Spencer

Peter Mark

Zara

Anne Sarah Edward

Possible input for learning

Descriptions:

Father(Philip,Charles) Father(Philip,Anne) ...

Mother(Mum, Margaret) Mother(Mum, Elizabeth) ...

Married(Diana,Charles) Married(Elizabeth,Philip) ...

Male (Philip) Male(Charles) ...

Female(Beatrice) Female(Margaret) ...

Classifications:

Grandparent(Mum,Charles) Grandparent(Elizabeth,Beatrice) ...

¬GrandparentIMum,Harry) ¬Grandparen(Spencer,Peter) ...

Background: empty

(10)

Possible output

A possible hypothesis:

Grandparent(x,y) ⇔ [∃z Mother(x,z) ∧Mother(z,y)]

∨[∃z Mother(x,z) ∧Father(z,y)]

∨[∃z Father(x,z) ∧Mother(z,y)]

∨[∃z Father(x,z) ∧Father(z,y)]

With background knowledge

Parent(x,y) ⇔ Mother(x,y) ∨Father(x,y)

hypothesis could be

Grandparent(x,y) ⇔ [∃z Parent(x,z) ∧ Parent(z,y)]

Why ILP?

Can't (directly) learn grandfather predicate using propositional/attribute-based learning algorithms

Attribute-based learning algorithms can't (directly) learn relational predicates

However, in practice the problem can often be transformed into attribute-based form

This process is called propositionalization

ILP can (in principle) learn recursive definitions (e.g. ancestor)

Two main approaches for ILP

Inverse resolution: finds hypothesis so that classifications can be proven

Proof needs to be run backwards

FOIL: generates set of Horn clauses by learning one clause at a time

Each clause is formed greedily by adding literals that maximize the clause's accuracy on the training data

Successfully solved programming problems from Prolog textbook