Ensembles and Hybrid methods for decision trees

Rule-Based and Hybrid Financial Data Mining

1. Predictive accuracy and 2. Comprehensibility

3.1.4. Ensembles and Hybrid methods for decision trees

Many studies have shown advantages of ensembles of different decision trees and hybrids of decision trees with other methods when compared to individual decision tree design. In this section we compare an individual decision tree constructed using the C4.5 algorithm [Quinlan, 1993] and different ensembles of decision trees. C4.5 algorithm is one of the most widely used decision tree algorithms. C4.5 consists of the following steps:

1. Choosing the attribute to be a starting node (root) in the decision tree.

2. Generating various possible attribute-splitting tests at the node (e.g., creating branches x<l 7.7 and ).

3. Ranking generated attribute tests using the Information Gain Ratio Crite- rion (IGRC).

4. Choosing the attribute-value splitting test top-ranked according to IGRC.

5. Extending the decision tree with new nodes corresponding to the best test.

6. Repeating steps 2-4 for each internal node of the decision tree.

The algorithm splits discrete valued attribute and real-valued attributes differently. If attribute S has V discrete values then the data can be split into V subsets. Each subset contains objects with one of V values. If any real

number can be a value of attribute S then the data is split using some thresh- old T into 2 subsets (S<T and ).

Ensembles. The first approach to create ensembles of decision trees is to run a decision tree learning algorithm several times, each time with a different subset of the training examples (subsamples). We already briefly dis- cussed three subsampling methods in Section 2.8.1:

4. Random selection of subsets (bootstrap aggregation -- bagging), 5. Selection of disjoint subsets (cross-validated committees), 6. Selection of subsets according probability distribution.

Sometimes, methods which use random selections of subsets are called bagged methods, e.g., bagged C4.5.

There are several versions of these methods, which are summarized below in table 3.4 following to [Dietterich, 1997].

For example, the AdaBoost boosting method uses different probability distributions to better match an ensemble of classifiers with the most difficult training data. Below steps of this algorithm are presented [Feund, Scha- pire, 1995,1996; Dietterich, 1997]:

Step 1. Compute a probability distribution over the training

examples Tr.

Step 2. Draw a training set of size s according the probability distribution

Step 3. Produce a decision tree (classifier ) using the decision tree learning algorithm.

Step 4. Compute the weighted error rate (Er) of classifier on the train

ing examples using where is er

ror of classifier for example

Step 5. Adjust the probability distribution on the training examples

(examples with higher error rate obtain higher probability values).

Step 6. Generate a new training subsample of size k with replacement according to the adjusted probability distribution and repeat step beginning from step 3.

The final classifier, is constructed by a weighted vote of the individual classifiers, Each classifier is weighted according to its accuracy for the distribution that it was trained on.

Table 3.5, condensed from descriptions in [Dietterich, 1997], summarizes performance of different decision tree methods in comparison with their ensembles. Most of these comparisons were done using the C4.S algorithm as a benchmark method [Quintan, 1993] with one exception for the option method [Buntine, 1990]. The option method produces decision trees where an internal node may contain several alternative splits (each producing its own sub-decision tree). Actually, an option tree is a set of voted conven- tional sub-decision trees.

An alternative approach to creating ensembles of decision trees is to change a tree already discovered by using transition probabilities to go from one tree into another one. For instance, the Markov Chain Monte Carlo (MCMC) method [Dietterich, 1997] interchanges a parent and a child node in the tree or replaces one node with another. Each tree is associ- ate with some probability Then these trees are combined by weighted vote to produce a forecast. Probabilities can be assigned using some prior probabilities and training data in the Bayesian approach. The process of generating decision trees and assigning probabilities as transition prob-

abilities from one tree to another one is modeled as a Markov process. More about general Markov processes can be found in [Hiller, Lieberman, 1995].

3.1.5. Discussion

A learning algorithmis unstable if its forecast is altered significantly by a small change in training data. Many examples of such instabilities are known for decision tree, neural network, and rule learning methods. The linear regression, k nearest neighbor, and the linear discriminant methods suffer less from these instabilities. Dietterich [1997] argues that voting ensembles of unstable decision trees learning methods are much more stable than the underlying decision tree methods. In addition to better stability of the forecast, Table 3.5 shows that such a forecast fits the real target values better, producing less forecasting errors.

Why do individual decision trees often perform worse than the voting ensembles built on them? There are at least three reasons: insufficient training data, difficult search problems, and inadequate hypotheses space [Dietterich,

1997]:

1. Insufficient training data. Usually several hypotheses are confirmed on training data. It would be unwise to prefer one of them and reject others with the same performance knowing that the data are insufficient for this preference.

2. Search problems. It is computational challenge to find the smallest possible decision trees or neural networks consistent with a training set.

Both problems are NP-hard [Hyafil, Rivest, 1976; Blum, Rivest, 1988].

Therefore, search heuristics became common for finding small decision trees. Similarly, practical neural network algorithms search only locally op- timal weights for the network. These heuristics have a chance to produce better solutions (decision trees or neural networks) if they use slightly different data as done with ensembles of classifiers.

3. Inadequate hypothesis space. It is possible that an hypothesis space H does not contain the actual function f, but several acceptable approxima- tions to f. Weighted combinations of them can lead to an acceptable repre- sentation of such f.

There are two potential problems with having an inadequate hypothesis space. H may not contain a weighted combination of decision trees close to f. Alternatively, such a combination can be too large to be practical. In both cases ensembles do not help to find f. Therefore, a wider hypothesis space and more expressive hypothesis language are needed. The DNF and first order languages provide this option.

The fight with insufficient training data is also not so straightforward -- add as many additional data as possible. It is common that the real shortage is in so called “border” data -- data, which are representing the border between classes. For instance, in figure 3.4, the diagonal is the actual border between two classes, but available training data are not sufficient to identify the diagonal unambiguously. These data permit several significantly different borderlines. Indeed, having thousands of training examples located far away from the border does not permit the border to be identified with the desired accuracy. In Figure 3.4, the simple linear discriminant function has enough expressive power to solve this problem completely. Just a few points on the border (“hold” state) or a few points really close to this border between classes “buy”, “sell”, are needed. Nevertheless, thousands of additional examples representing “buy” and “sell” situations far away from the actual border are useless for identifying this border.

Dalam dokumen DATA MINING IN FINANCE (Halaman 98-102)