© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 1
Knowledge in Learning
Chapter 19
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 2
Outline
Learning and logic Version space learning Explanation-based learning
Learning using relevance information Inductive logic programming
Learning and logic
Can restrict learning to hypotheses that can be expressed as logical sentences
Examples and classifications can also be expressed as logical sentences
Classification sentence can be inferred from hypothesis and example description
Straightforward to incorporate prior knowledge Can be done with propositional as well as first order logic
Incrementally updating a hypothesis
(a) (b) (c) (d) (e)
+ +
+
++ + +
− −
−
−
−
−
−
−− −
−
+ +
+
++ + +
− −
−
−
−
−
−
−− −
−
+
+ +
+
++ + +
− −
−
−
−
−
−
−− −
−
+
+ +
+
++ + +
− −
−
−
−
−
−
− −
−
− +
+ +
+
++ + +
− −
−
−
−
−
−
− −
−
+
− −
−
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 5
Specialization vs. generalization
False positive: the hypothesis classifies an example as positive when it really is negative
We need to specialize the hypothesis to exclude the negative example
False negative: the hypothesis classifies an example as negative when it really is positive
We need to generalize the hypothesis to include the positive example
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 6
Current-best learning
function CURRENT-BEST-LEARNING(examples) returns a hypothesis H any hypothesis consistent with the first example in examples for each remaining example in examples do
if e is false positive for H then
H choose a specialization of H consistent with examples else if e is false negative for H then
H choose a generalization of H consistent with examples if no consistent specialization/generalization can be found then fail end
return H
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 7
The version space
this region all inconsistent This region all inconsistent
More general
More specific S1
G1
S2
G2 G3 . . . Gm
. . . Sn
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 8
The boundaries of the version space
+ +
+ +
+
+ + +
+ +
−
−
−
−
−
−
−
−
− − −
− −
−
S1 G1
G2
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 9
General-to-specific search
Start with most general hypothesis Until all examples have been processed
If current example is negative
Replace all hypotheses that cover the example by their most general specializations that do not cover the example Delete hypotheses that are more specific than other ones Delete hypotheses not consistent with all positive examples
If current example is positive
Delete all hypotheses not consistent with example
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 10
Specific-to-general search
Start with most specific hypothesis Until all examples have been processed
If current example is positive
Replace all hypotheses that do not cover the example by their most specific generalizations that cover the example Delete hypotheses that are more general than other ones Delete hypotheses covering a negative example
If current example is negative
Delete all hypotheses covering the example
Comments on the two algorithms
General-to-specific search needs to maintain
The set of most general hypotheses G
The set of positive examples seen so far Specific-to-general search needs to maintain
The set of most specific hypotheses S
The set of negative examples seen so far Storing the examples can be avoided by
maintaining the set of all consistent hypotheses!
Version space learning
function VERSION-SPACE-LEARNING(examples) returns a version space local variables: V, the version space: the set of all hypotheses V the set of all hypotheses
for each example e in examples do
if V is not empty then V VERSION-SPACE-UPDATE(V, e) end
return V
function VERSION-SPACE-UPDATE(V, e) returns an updated version space
V
h V : h is consistent with e
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 13
Maintaining the version space
Naïve approach: explicitly storing all the consistent hypotheses
Observation: version space fully specified by boundary sets S and G
S: set of most specific hypotheses consistent with the examples seen so far
G: set of most general hypotheses consistent with the examples seen so far
We only need to maintain S and G !!
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 14
Incorporating a positive example
Delete Gi that do not cover the example
Replace all Si that do not cover the example by their most specific generalizations that cover it Delete Si that are more general than some other member of S
Delete S
i that are not more specific than all members of G
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 15
Incorporating a negative example
Delete S
i that cover the example
Replace all Gi that cover the example by their most general specializations that do not cover it Delete Gi that are less general than some other member of G
Delete Gi that are not more general than all members of S
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 16
Termination of the algorithm
First possibility: we run out of data
Several hypotheses survive
Second possibility: S and G converge
We have found the only consistent hypothesis Third possibility: version space collapses
No hypothesis survives
This may be due to noise or an inappropriate hypothesis space
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 17
Learning with prior knowledge
Observations Hypotheses Predictions
Prior knowledge
Knowledge−based inductive learning
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 18
Explanation-based learning
Entailment constraint for inductive learning:
Hypothesis ∧ Descriptions |= Classifications Explanation-based learning: generalizing an example using background knowledge:
Hypothesis ∧ Descriptions |= Classifications Background |= Hypothesis
We don't really learn anything new
We generate a rule that explains an example using our background knowledge
Why is this learning?
EBL learns a general rule that can be applied in situations similar to a given example
Generalization of memoization technique Humans often learn the same way
Learning to play a board game when you already know the rules
EBL is also called speedup learning
Using learned rules we can solve problems faster
New rules can be applied "without thinking"
Example: differentiation
Derivative of X2
Using product rule: (X2)' = (1×X)+(X×1) = 2×X Many people can compute the derivative by inspection because they have learned the rule
by looking at examples
We can compute the derivative using standard rules but that takes more time
ArithmeticUnknown u Derivative u2, u 2u
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 21
How can we learn a rule?
First: construct explanation of the observation using background knowledge
Proof that the goal predicate applies to the example
Proof explains how background k. applies to example
Can be constructed using theorem prover
Then: establish definition of class of cases for which the same explanation structure can be used
This is done by looking at the leaves of the generalized proof tree
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 22
An example
Let's learn a rule for simplifying 1 × (0 + x) This is our background knowledge:
Rewrite u , v Simplify v , w Simplify u , w Primitive u Simplify u , u
ArithmeticUnknown u Primitive u Number u Primitive u
Rewrite 1 u , u Rewrite 0 u , u
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 23
Proof tree (backward-chaining prover)
Primitive(X)
ArithmeticUnknown(X)
Primitive(z)
ArithmeticUnknown(z) Simplify(X,w)
Yes, { } Yes, {x / 1, v / y+z}
Simplify(y+z,w)
Rewrite(y+z,v’) Yes, {y / 0, v’ / z}
{w / X}
Yes, { } Yes, {v / 0+X}
Yes, {v’ / X}
Simplify(z,w) {w / z}
Simplify(1 (0+X),w)+
Rewrite(x (y+z),v)+
Simplify(x (y+z),w)+
Rewrite(1 (0+X),v)+ Simplify(0+X,w)
Rewrite(0+X,v’)
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 24
The result
Use leaves of generalized proof tree to form general rule for goal predicate:
First two conditions are true regardless of the value of z and we can discard them:
Rewrite 1 0 z , 0 z Rewrite 0 z , z ArithmeticUnknown z
Simplify 1 0 z , z
ArithmeticUnknown z Simplify 1 0 z , z
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 25
The four steps of the EBL process
1) Construct proof that goal predicate applies to example using background knowledge
2) Construct a generalized proof tree for the variabilized goal using the same inference steps 3) Get rule (using bindings from generalized proof)
Left-hand side: leaves of proof tree Right-hand side: variabilized goal 4) Drop conditions that are always true
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 26
Improving efficiency
Can obtain more general rules by pruning, e.g.:
Covers cases where z is a number
Even more general rule:
Rule can be extracted from any partial subtree
Primitive z Simplify 1 0 z , z
Simplify y z , w Simplify 1 y z , w
Which rule should we pick?
Three points to consider:
1) Adding large number of rules to knowledge base slows down reasoning process (branching factor) 2) Derived rules should lead to significant speedup for
cases that they cover
3) Rules should be as general as possible
We want rules that are general and easy to solve (as defined by an operationality criterion)
Primitive(z) vs. Simplify(y + z, w)
Operationality vs. generality
More specific subgoals are usually easier to solve
How do you define operationality?
2, 10, 100 proof steps?
Cost of solving a particular rule depends on other rules in the knowledge base
Rarely possible to devise mathematical model of effect on overall efficiency
Alternative approach: empirical evaluation
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 29
Attribute selection
If we know which attributes are relevant for a classification we can delete the rest
Our background knowledge might tell us that some attributes are irrelevant
Can significantly improve learning process Reduced amount of data results in speedup Makes it less likely to overfit
Resulting classifier can be simpler
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 30
Consistent determinations
A determination P Q says that if any examples match on P, then they must also match on Q
Example: Material∧Temperature Conductance
Sample Mass Temperature Material Size Conductance
S1 12 26 Copper 3 0.59
S1 12 100 Copper 3 0.57
S2 24 26 Copper 6 0.59
S3 12 26 Lead 2 0.05
S3 12 100 Lead 2 0.04
S4 24 26 Lead 4 0.05
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 31
Finding minimal consistent determinations
function MINIMAL-CONSISTENT-DET(E, A) returns a determination inputs: E, a set of examples
A, a set of attributes, of size n for i 0, . . . , n do
for each subset Aiof A of size i do if CONSISTENT-DET?(Ai, E) then return Ai
end end
function CONSISTENT-DET?(A, E) returns a truth-value inputs: A, a set of attributes
E, a set of examples local variables: H, a hash table for each example e in E do
if some example in H has the same values as e for the attributes A but a different classification then return False
store the class of e in H, indexed by the values for attributes A of the example e end
return True
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 32
Time complexity
Assume size of smallest determination is p There are subsets of that size Algorithm won't terminate before searching them Hence the algorithm's time complexity is exponential in the size of the minimal determination
It turns out that the problem is NP-complete In practice p is often small
n p
O np
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 33
RBDTL: decision tree learning with minimal consistent determinations
Learning curve for dataset with 16 attributes where only 5 are relevant:
0.4 0.5 0.6 0.7 0.8 0.9 1
0 20 40 60 80 100 120 140
% correct on test set
Training set size RBDTL
DTL
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 34
Inductive logic programming
Learning logic programs from data
Background knowledge is given in form of a logic program, data is used to extend it
Theoretically well-founded method for knowledge-based inductive learning
Background knowledge and new hypothesis combine to explain the examples
Background∧Hypothesis∧Descriptions |= Classifications
Learning the grandparent predicate
Beatrice Andrew
Eugenie William Harry
Charles Diana
Mum George
Philip
Elizabeth Margaret
Kydd Spencer
Peter Mark
Zara
Anne Sarah Edward
Possible input for learning
Descriptions:
Father(Philip,Charles) Father(Philip,Anne) ...
Mother(Mum, Margaret) Mother(Mum, Elizabeth) ...
Married(Diana,Charles) Married(Elizabeth,Philip) ...
Male (Philip) Male(Charles) ...
Female(Beatrice) Female(Margaret) ...
Classifications:
Grandparent(Mum,Charles) Grandparent(Elizabeth,Beatrice) ...
¬GrandparentIMum,Harry) ¬Grandparen(Spencer,Peter) ...
Background: empty
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 37
Possible output
A possible hypothesis:
Grandparent(x,y) ⇔ [∃z Mother(x,z) ∧Mother(z,y)]
∨[∃z Mother(x,z) ∧Father(z,y)]
∨[∃z Father(x,z) ∧Mother(z,y)]
∨[∃z Father(x,z) ∧Father(z,y)]
With background knowledge
Parent(x,y) ⇔ Mother(x,y) ∨Father(x,y)
hypothesis could be
Grandparent(x,y) ⇔ [∃z Parent(x,z) ∧ Parent(z,y)]
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 38
Why ILP?
Can't (directly) learn grandfather predicate using propositional/attribute-based learning algorithms
Attribute-based learning algorithms can't (directly) learn relational predicates
However, in practice the problem can often be transformed into attribute-based form
This process is called propositionalization
ILP can (in principle) learn recursive definitions (e.g. ancestor)
© Eibe Frank 2003 (all figures © Stuart Russell and Peter Norvig, 2003) 39
Two main approaches for ILP
Inverse resolution: finds hypothesis so that classifications can be proven
Proof needs to be run backwards
FOIL: generates set of Horn clauses by learning one clause at a time
Each clause is formed greedily by adding literals that maximize the clause's accuracy on the training data
Successfully solved programming problems from Prolog textbook