Rule-Based and Hybrid Financial Data Mining
3.1. Decision tree and DNF learning
3.1.2. Limitation: size of the tree
In this section, we discuss the limited expressive power of decision trees. There are many consistent sets of rules, which cannot be represented by a relatively small single decision tree [Parsaye, 1997]. For example hav- ing data from table 3.1 we may wish to discover rules, which will describe the behavior of target T as a function of and
Target T in table 3.1 is governed by simple rule R1:
The decision tree equivalent to this rule is shown in figure 3.1. This kind of rule is called a prepositional rule [Mitchell, 1997]. Propositional rules op- erate only with logical constants in contrast with first-order rules operating with logical variables. We will discuss first order rules in chapter 4, here we focus on decision trees. Rule Rl and its equivalent decision tree in figure 3.1 belong to the relatively restricted language of propositional logic.
Let’s take This x is shown in the first line in ta- ble 3.1. We feed the decision tree in figure 3.1 with this x. Node called
the root is the beginning point. Here therefore we should traverse to the left from the root. Then we test the value of which is also equal to 0.
Next and are tested sequentially. Finally, we come to the leftmostter- minal node with a 0 value for the target attribute.
Figure 3.1. Decision tree
The tree in figure 3.1 has 31 nodes and can be simplified. For instance, the four rightmost terminal nodes on the bottom line produce the same value, 1. We merge them by assigning the rightmost node (in the third line) as a new terminal node. This node is marked in bold and all edges de- scending from this are eliminated (represented by dotted lines in figure 3.1). Similarly, other redundant nodes are merged. Finally, the decision tree has 19 nodes shown in bold in 1.1. This is significantly less then the total number of initial nodes, 31. Moreover, the new tree is also redundant. Two subtrees beginning from the leftmost on the line 2 are identical, i.e., inde- pendent of the values of the target values are the same. Therefore, can be eliminated. The new simplified tree has 13 nodes. It is shown in bold in figure 3.2.
Figure 3.2. Simplified decision tree
The tree in figure 3.2 can be written as a set of rules. Each rule represents
a path from the root to a terminal node. The IF-part ANDs all nodes of the branch except the leaf and the THEN-part consists of the leaf (terminal node). For example, the right-most branch of the tree in figure 3.2 can be written as follows:This tree returns a solution (a 0 or 1 value of the target T in a terminal node) for all possible combinations of the four binary attributes. We say that a decision tree produces a complete solution if it delivers an output
value for all possible input alternatives. There are many other trees, which do not deliver complete solutions. For example, we can remove the right- most branch of the tree in figure 3.2. The resulting shortened tree will not deliver solutions for combinations of attributes with
Often decision tree methods implicitly assume a close world: all positive instances (i.e., with target values equal to 1, ) are presented in training data. Obviously, this is a questionable assumption for many applications.
There is no guarantee that we are able to collect all positive instances. The decision tree in figure 3.2 delivers a complete solution; therefore, the close world assumption is satisfied for this tree. Hence it is enough to write only three rules with positive solutions to represent ail solutions given by this tree completely:
We also can combine these rules into one rule, Rule RT:
This single rule represents the tree from figure 3.2. It is called the disjunc- tive normal form (DNF) of the rule. It is a special property of decision tree methods that all rules have a specific DNF form, i.e., all AND clauses (con- junctions) include a value of the attribute assigned to the root. In rule RT all AND clauses include the root More generally in this type of rule, many AND clauses have significant overlapping. For example in rule RT,
clauses differ only in
We call this specific DNF form of rule presentation the tree form of a rule. The tree form of rules tends to be long for some simple non-redundant DNF rules like Rule RS:
The last form of rule is called a minimal DNF. The corresponding tree has 4 components (tree edges) like and in contrast with 9 components
in the tree form of the same rule, RT. We show the comparative growth of rule size in more detail below. Decision tree algorithms need a special crite- rion to stop the search for the best tree, because the search time is growing exponentially as a function of the number of nodes and edges in a tree.
The standard stopping criterion is the number of components (nodes, edges) of the tree [Graven, Shavlik, 1997]. Therefore, if the search for rule RT will be restricted to six components, rule RT will not be discovered by the decision tree method, since rule RT has nine components in its IF-part.
Discovering rules without requiring the tree form of rules is called dis- covering the disjunctive normal form (DNF) of rules. The MMDR method (chapter 4) and the R-MINI method (section 3.2.3 in this chapter) belong to this class.
Rule RS is represented using only four components in the IF-part (com-
parisons of values of attributes with It can be
drawn as two small trees with five nodes each and 10 nodes total represent- ing the rules:
These rules are shown in bold in figure 3.3.
Figure 3.3. Two shorter decision trees
At first glance, the difference between trees in figures 3.2 and 3.3 is not large. However, this difference will growing very quickly with the number of pairs of attributes involved (see table 3.2).
For example to express rule RG
a tree having 8 levels and 61 nodes is needed in contrast to 20 nodes for the set of separate small trees (see Table 3.2).
The larger rule RL:
requires a still larger tree. The general formula for the number of nodes N(k) for one tree is:
and for the set of trees it is:
where k is the number of pairs such as These formulas are tabulated in table 3.2 for computing the number of nodes for a single tree and a set of decision trees equivalent to the single tree.
Table 3.2 shows that to represent a rule like RL with 5 pairs (10 nodes) we need 125 nodes in a single tree and only 25 nodes in a set of trees. For 10 pairs (20 nodes) this difference is much more significant: 4093 nodes vs. 50 nodes. For 30 pairs (60 nodes) we have about vs. 150 nodes.
There are rules (prepositional rules), which are more complex, e.g.,
Having a more complex task with more than two values for each attribute, the numbers in table 3.2 become even larger for a single decision tree and grow exponentially. Therefore, although a decision tree can represent any propositional rule, this representation can be too large for practical use.
Other more general propositional and relational methods can produce smaller rules in these cases (see Section 3.2.3 in this chapter and Chapter 4).
Let us consider an illustrative example of discovering three simple trad- ing rules using data from Table 3.3:
1. IF the stock price today (date t) is greater than the forecast price for to- morrow (date t+1) THEN sell today;
2. IF the stock price today is less than the forecast price for tomorrow THEN buy today;
3. IF the stock price today is equal to the forecast price for tomorrow THEN hold.
In Table 3.3, a value of 1 indicates sell, while -1 indicates buy and 0 in- dicates hold. This example allows us to illustrate the difference between the methods. First, we discover rules 1-3 using Table 3.3 and the propositional logic approach. Table 3.3 allows the extraction of a propositional rule PR.
Prepositional rule (PR):
The bold line in Figure 3.4 illustrates this rule. The area above the line cor-
responds to buy, the area below corresponds to sell and the line itself repre-
sents the hold.Figure 3.4. Diagram for decision tree boundary
We denote stock prices on date t (today) as a and stock prices on date (tomorrow) as b. Let’s apply this rule for All conjunc- tions of the rule PR are false for (17.8, 17.5). Therefore, the rule does not deliver a sell signal, which is wrong. Prepositional rule PR can also be dis- covered as a decision tree (Figure 3.5).
This tree is mathematically equivalent to propositional rule PR and gives the same incorrect buy/sell/hold signal as rule PR for and
Obviously, the correct answer should be “sell shares on date t” for these data.
This example shows that decision tree learning cannot discover the simple relational rule:
Figure 3.5. Example of decision tree
The reason is that decision tree learning methods work only with only one attribute value at a time by comparing this attribute value with a constant (e.g., S(t)>17.8). Decision trees do not compare two attribute values of ac- tual cases, for example Relational methods described in chap- ter 4 specifically address this issue.
Dietterich (1997) discusses similar constraints of the decision tree method. Different training samples will shift the locations of the staircase approximation shown in figure 3.4. By generating different approximations
a better classifier can be constructed to the diagonal decision boundary through combinations of these approximations. There are many ways for combining approximations. One of them would be majority vot- ing M(x):
where is the number of approximations which classifyx as a member of class #1 and is the number of approximations which classify x as a member of class #0
Dietterich [1997] noted that these improved staircase approximations are equivalent to very complex decision trees. Those trees are too large to be included in a practical hypothesis space H. The space would be far too large for the available training data
The method of voting ensembles of decision trees is one the approaches to resolve size and accuracy problems of traditional decision trees in finan-
cial applications. On the other hand there are two current limitations of this approach pointed out by Dietterich [1997]:
1. Voting ensembles can require large amounts of memory to store and large amounts of computation to apply. For example, 200 decision trees require 59 megabytes of storage, which makes them less practical than other methods.
2. A second difficulty with voting ensemble classifiers is that a voting en- semble provides little insight into how it makes its decisions. A single decision tree can often be interpreted by human users, but an ensemble of 200 voted decision trees is much more difficult to understand.
Another alternative in finance is discovering sets of rules, i.e., rules in pro- positional disjunctive normal form (section 3.2.3 and [Apte, Hong, 1996].
The third approach is discovering sets of first-order probabilistic rules in financial time series (chapter 4 and [Kovalerchuk, Vityaev, 1998, 1999]).