Rule-Based and Hybrid Financial Data Mining
1. Predictive accuracy and 2. Comprehensibility
3.4. Extracting decision trees from neural networks in fi- nance
3.4.1. Predicting the Dollar-Mark exchange rate
In this section, we discuss predicting the daily Dollar-Mark exchange rate using a decision tree extracted from an independently trained Neural Network. According to Graven and Shavlik [1997] the decision tree pro- duced from the training set without requests to the “oracle” (learned neural network) is much less accurate and comprehensible than the tree extracted by the Trepan algorithm. Formally, three evaluation parameters have been used to evaluate the decision tree: complexity, predictive accuracy, and fi- delity to the network. The fidelity parameter measures how close the fore- cast of the decision tree is to the forecast by the underlying neural network.
Data. The data for the task consist of daily values of foreign exchange rate between the U.S. Dollar and the German Mark from January 15, 1985 through January 27, 1994 [Weigend et al., 1996] where:
1. the test set consisted of the last 216 days;
2. the validation set consisted of 535 days (every fourth day from the re- maining data);
3. the training set consisted of the remaining 1607 days.
The neural network used was trained by Weigend et al [1996] independ- ent of the decision tree constructed by Graven and Shavlik [1997]. The net-
work is composed of 69 input units, a single hidden layer of 15 units, and 3
output units.The input units represent two types of information:
– Information derived from the time series itself such as a relative strength index, skewness, point and figure chart indicators, etc. (12 inputs).
– Fundamental information beyond the series itself such as indicators de- pendent on exchange rates between different countries, interest rates, stock indices, currency futures, etc. (57 inputs).
Three output units reflect a forecast of the exchange rate and two char- acteristics representing turning points in the exchange rate, specifically:
– a normalized version of the return (the logarithm of the ratio of tomor- row's price to today's price divided by the standard deviation computed over the last 10 trading days),
– the number of days to the next turning point where the exchange rate will reverse direction,
– the return between today and the next turning point.
This task was converted into the classification task of predicting whether the current day's price is going up or down. The network was trained using the technique of “cleaning” [Weigend et al., 1996]. The obtained cleaned net- work contained connections to 15 of the real valued attributes and five of the discrete valued attributes from original the 69 attributes. In computational experiments, two sets of attributes were involved:
– the 20 attributes taken from the cleaned network and – the entire set of 69 attributes.
The clearning method involves simultaneously cleaning the data and learning the underlying structure of the data. Specifically, cleaning uses a cost function that consists of two terms:
The first term, measures the cost of learning by weighting the squared deviation between the network’s output y and the target value The second term, measures the cost of cleaning by squaring the deviation between the cleaned input x and the actual input The parame- ters and k are the learning rate and the cleaning rate, respectively.
The cleaning technique assumes that both inputs and outputs are cor- rupted by noise. Correcting the inputs and outputs can suppress the noise level and help to find a simpler regularity by training a network. A simpler regularity suffers less from over fitting the output, therefore the found regu- larity can be more reliable for out-of-sample forecasting.
Algorithm specifics. Trepan uses a given validation test set to measure the closeness of the decision tree output to the neural network output, this is
called fidelity. Trepan generates the tree with the highest value of fidelity to the target network. This is a fundamental property of the approach. If the network does not catch the essence of the underlying financial process, the same will happen with an extracted decision tree.
For the min_sample threshold (see Section 3.3.2), values of 1000, 5000, and 10000 are tried. Let us remember that the training set consists only of 1607 days. There is a big difference in training the decision tree directly from 1607 actual samples and training with 10000 artificial examples gen- erated by the neural network.
3.4.2. Comparison of performance
Two methods for discovering decision trees directly from data were used for comparison with Trepan; namely, C4.5 [Quinlan, 1993] and an enhanced version of ID2-of-3 [Murphy & Pazzani, 1991; Graven, Shavlik, 1997]. In addition to these methods, a naïve prediction algorithm was run for com- parison with Trepan. This algorithm simply reproduces the current up/down direction as a forecast. Below we analyze experimental results reported by Graven and Shavlik. These results are summarized in Table 3.10.
The cleaned network provides the most accurate predictions and the m- of-n tree extracted from the neural network made similar predictions. The naive rule and conventional decision trees produce the worst predictions.
In addition, Graven and Shavlik match Trepan’s output and the output from the neural network. Trepan was able to reproduce the network’s up/down output in about 80% of the cases, but only 60.6% of Trepan’s out- puts match to real up/down behavior of the exchange rate. On the other hand, methods extracting rules directly from data (C4.5, ID2-of-3+) did not reach this level of accuracy.
Table 3.10 shows that the accuracy of the naïve prediction, C4.5, and ID2-of-3+ ranges from 52.8% to 59.3%. This performance is similar to the average 55.6% for extracting a decision tree from data without use of a neu- ral network obtained by Sipina_W for SP500 in Section 3.2.2.
It is important to note here that one can not expect that an extracted deci- sion tree would outperform the underlying neural network significantly, be-
cause Trepan is designed to mimic the behavior of the network. Table 3.10 confirms this statement.
Figure 3.9 shows the tree extracted by Trepan from the cleaned neural network. This tree has one m-of-n test, and four ordinary splitting tests. Ta- ble 3.11 [Graven, Shavlik, 1997] provides brief descriptions of the attributes that are incorporated into the depicted tree. For example, and represent different functions of the Mark-Dollar exchange rate. Each leaf in figure 3.9 predicts the up/down direction of the exchange rate.
Figure 3.9. The Trepan tree [Graven, Shavlik, 1997]
The beginning node of the tree (the root) is an m-of-n node with and It means that any five of 10 properties (tests) listed in this node such as should be satisfied to get T (“true” output) and otherwise this node leads to F output.
The trees extracted from the neural networks are considerably simpler than trees extracted from data directly, although the latter trees were simpli- fied. Based on these results, Graven and Shavlik concluded that the tree ex- tracted from the neural network is more comprehensible than the trees learned from data directly. Remember that these authors measure compre- hensibility of the decision tree by the number of nodes.
However, the single root node in Figure 3.9, which is a compact 5-of-10 rule extracted from the neural network, is equivalent to the set of all combi- nation 5 of 10 tests, i.e., 252 conventional rules. The conventional decision tree equivalent to this set of rules will not be smaller than decision trees dis- covered from data directly by C4.5 and ID2-of-3+ algorithms. For instance, C4.5 produced 103 internal nodes [Graven, Shavlik, 1997].
Graven and Shavlik report that the first node (5-of-10 rule) in the tree is able to mimic nearly 80% of the neural network’s behavior and successive nodes added to the tree do not significantly improve this fidelity value. They express concern that Trepan possibly fails to explain the underlying regular- ity in the financial process discovered by the neural network. These obser- vations raise the questions:
– Is this 5-of-10 rule really more comprehensible than the set of rules ex- tracted from data directly?
– Does this 5-of-10 rule provide enough explanation of the underlying regularity in the financial process discovered by the neural network?
Conventional decision trees and sets of rules have an obvious advantage in comprehensibility. We think that this non-conventional 5-of-10 rule and other similar m-of-n rules need to be further analyzed to extract more un- derstandable, conventional IF-THEN rules. Alternatively, appropriate rules can be selected from the already identified 256 conventional rules.
It was noted in [Graven, Shavlik, 1997] that Trepan discovered the over- all behavior of the target in the regions that correspond to up or down trends.
However, it was not able to catch all target fluctuations in the regions with small values of up and down. Apte and Hong (see Section 3.2.3) avoided this problem by putting a threshold of 6% for up/down in their trading strat- egy. Therefore, the trading strategy chosen can eliminate some problems of matching the neural network and the decision tree.
From our viewpoint, these experiments show that to this point m-of-n rule language has not helped uncover really comprehensible rules behind regularities discovered by neural networks in finance. The language of m-of- rules is better understandable than the language of neural networks, if m and n are small. However, in the described experiments relatively large and were found. Therefore, the search for an appropriate language should continue. Meanwhile, these experiments show that a decision tree can be extracted from neural networks in finance with a close approximation to
the network. This can be valuable if the underlying neural network performs well.