A survey of cost-sensitive decision tree induction algorithms

Although ID3 adopts some of the ideas of CLS, a significant development change was ID3's use of an information theoretic measure for attribute selection [Quinlan 1979]. Algorithms that continue this adaptation of information theoretic measures, but also take into account the cost of misclassification as well as the costs of testing include an approach. If an expert has no knowledge of the importance of an attribute, this bias is set to the default value of 1.

Each weight is calculated as the average of the off-diagonal cells in the corresponding column.

Fig. 1. Decision tree after ID3 has been applied to the dataset in Table I.

Postconstruction

The approach also uses discriminant analysis and uses the following partitioning that minimizes the cost provided that the class distributions are multivariate normal. One strategy, explored in Vadera [2005a], is to try all possible combinations and select the subset that minimizes the cost. An alternative strategy, explored in Vadera [2010], selects two of the most informative features, as measured by information gain, and uses Eq.

A convex hull created from the points (0,0), four classifiers and (1,1) represents the optimal front. This means that for every classifier under this convex hull, there is a classifier on the front that is cheaper. The idea behind the approach of Ferri et al. [2002] is to create alternative classifiers by considering all possible labels for the leaf nodes of the tree.

For a tree with m leaf nodes and a two-class problem, there are alternative labels of 2 million, which could be computationally expensive. 2002] shows that for a two-class problem, if the leaves are ordered based on the accuracy of one of the classes, only m+1 alternative labelings are needed to define the convex hull, where the jthnode of the i -the labeling,Li,j , is defined by. The convex hull formed by these labelings can then be used to determine the most optimal classifier once the cost of misclassification is known.

MULTIPLE-TREE, NONGREEDY METHODS FOR COST-SENSITIVE DECISION TREE INDUCTION

Use of Genetic Evolution for Cost-Sensitive Tree Induction

One of the first systems to use GAs was Turney's [1995] ICET (Inexpensive Classification with Expensive Tests) system. The fitness function used in GDT-MC aims to take into account the expected misclassification cost as well as the size of the trees and has the form [Kretowski and Grzes 2007]. Such a maximal tree is then interpreted by mapping the nodes to attributes, assuming that the branches are ordered relative to the features.

A version of the minimum error pruning algorithm that minimizes the cost instead of the error is used for pruning. The goodness-of-fit measure used is the expected cost of classification, taking into account both the cost of misclassification and the cost of tests. After the genes are mapped to the decision trees and pruned, and their fitness is obtained, the standard mutation and crossover operators are applied, a new generation of the strongest is developed, and the process is repeated a fixed number of cycles.

2005] leverages the capabilities of Genetic Programming (GP), which allows representation of trees as programs instead of bit strings, to develop a cost-sensitive decision tree induction algorithm. 1The elitist strategy ensures that a few of the strongest are copied to the new generation, and the linear ranking strategy ensures some diversity and avoids that the strongest do not dominate the evolution too early in the evolution. The strongest of the four is copied to the pool for the next generation and this tournament process is repeated to produce the complete mating pool for the next generation.

Wrapper Methods for Cost-Sensitive Tree Induction

Thus, AdaBoost consists of three key steps: initialization, weight update equations, and the final weighted combination of hypotheses. However, they note that this advantage diminishes for multi-class problems. 3 The presentation here assumes that the weights are normalized by a factor Ztis at the end of a trial, thus simplifying the equations. and suggest that this is due to the mapping of different misclassification costs into a single misclassification cost by Eq. More specifically, given the cost of misclassification and the number of examples of class 1 and class 2 are N1 and N2, respectively, the data distribution is changed so that the number of examples N1, N2 of class 1 and 2 are satisfied.

This makes it possible to define the expected cost of misclassification compared to cases and using gradient descent, Abe et al. This principle is used in MetaCost [Domingos 1999], which is one of the first systems to use cost-sensitive bagging. 2003a, 2003b] describe a method called Costing that, like MetaCost, applies a base learner to sample data to generate alternative classifiers.

The goal of each resampling is to change the distribution of the data so that the error reduction in the modified distribution is equivalent to the cost reduction in the original distribution (ie, as described for JOUS-Boost). Note that unlike MetaCost, there is no relabeling of the data to create a single decision tree. The results of empirical evaluations show that there is not much to be gained by embedding AdaBoost or CSB0 in MetaCost.

2002] experiment with different ways of producing this weighted classification taking advantage of the fact that different decision trees can share the same part of a tree set. In contrast to MetaCost, where a single tree is obtained by applying a base learner to the reclassified examples, a single tree is inferred by traversing a multitree bottom-up and selecting those suspended nodes that most agree with the results of the multitree . -tree using a randomly generated dataset.

Stochastic Approach

They develop a framework for algorithms for such situations, called TATA (Tree classification AT Anycost), which is able to reduce misclassification costs as the budget for using tests increases. They develop this framework by first noting that existing top-down tree induction algorithms can be modified so that the total testing cost for any example will not exceed a predetermined cost. This can be achieved during the tree induction process by considering only those attributes whose cost is below the current available budget, where the current budget is the initial budget minus the cost of the attributes used from the root to the current node.

Next, they adopt an approach similar to ACT, except that the r samples are taken using an adapted version of C4.5 in which attributes costing more than the available budget are excluded and attributes are chosen stochastically with a probability proportional to their information gain. . Samples for each available attribute are used to estimate the cost of misclassification and the minimum cost of misclassification chosen. For a contract algorithm, the budget for testing costs is not available until the classification phase.

The number of trees used depends on the amount of time and memory available, but also affects the time available for the number of stochastic samples, r, that are possible. To achieve the goals of a discontinuous algorithm, where neither learning budgets nor total test costs are available, they propose developing a repertoire of trees and then starting classification using the tree with the minimum possible budget, and then iteratively moving to the tree with the next highest budget until it terminates or reaches the final tree. The cost of misclassification also decreases as the number of stochastic samples, r, increases, with the most significant improvement occurring when one, two, or three samples are used, but minimal improvement after three samples.

CONCLUSIONS

The misclassification cost is also reduced as the number of random samples, r, increases, with the most significant improvement occurring when one, two, or three samples are used, but minimal improvement after three samples. non-backtracking and non-greedy decisions that use multiple trees and multiple choices to induce trees. a) Application of costs during construction, whereby attribute selection measures are adapted to include costs. The algorithms differ in the way trees are generated or represented and how fitness is measured. Other differences between algorithms include how the sampling is performed and how error rates or confidence rates have been calculated to give more importance to the trees with the least errors in composite voting methods. e) Bagging, which generates a number of independent decision trees using resamples from the training set, thus differing from the trees generated by boosting, being independent of each other in the same way as those in GA methods.

In general, these algorithms are wrapper methods, which use the decision tree as a subroutine and revolve the incorporation of costs around it. Differences between these algorithms are how sampling takes place and in the composite voting method used. f) Multiple structures, which extend the ideas of generating alternative trees and combining the outcome by having alternative trees in one structure. It shows all possible alternative choices of feature selection in one decision tree so that alternative choices are not discarded as in the usual decision tree process but are stored and can be expanded in the future. g) Stochastic approach, which causes decision trees created by generating stochastic samples of trees rooted at each potential attribute and selecting the attribute that results in the best tree.

Changing the number of samples results in behavior at any time where quality can be improved with more time. Although the particular experimental methods, data sets used (see Table A.1 in the Appendix) and the respective systems compared differ, it is possible to form a general picture from the empirical evaluations presented in the studies. Producing results similar to multi-tree methods using single-tree methods represents a major research challenge, but as work on nonlinear decision trees [Vadera 2010] shows, it is possible to produce results comparable to MetaCostandICET for cost minimization. of misclassification in a fraction of the computing time.

APPENDIX

Given the relative success of non-greedy cost-sensitive tree induction algorithms, a fair question is, "Is it worth using or even continuing future research on greedy cost-sensitive decision tree induction algorithms?". In Proceedings of the International Conference on Computational Science (ICCS ’02). Lecture Notes in Computer Science, vol.