GP and Logical Trees - Data classification using genetic programming.

This section reviews studies which have used GP to evolve logical trees. Logical trees are composed of boolean operators, and they output a value of true or false.

Kuoet al.[122] evolved logical trees using GP. The function set that was used was {AN D, OR, N OT, >,≥, <,≤, If−T hen, If−T hen−T hen−Else}. The terminal set was made up of the values that each attribute could take, for example, an attribute having values {A, B, C} would result in those three values being added to the terminal set. Furthermore, the classes were added to the terminal set, and could only appear in the second or third branch of the if statements. Two genetic operators were developed to resolve the issues of redundancy and subsumption. The eliminator operator was proposed in order to identify duplicate rules within any given tree (redundancy), if duplicates were found, then the deepest duplicated rule within the tree was removed. The merge operator attempted to remove rules that subsume another by examining if two rules had the same classes, and if the attributes of one was a subset of the other (subsumption). The fitness function computed the

accuracy and complexity of each tree. For a particular treet, the complexity oft was defined as the ratio between the number of nodes withint, and the average number of nodes within all the trees in the the initial population. The proposed GP algorithm was evolved over 500 generations and applied to a binary classification problem. The data set was split with 70% of the data being allocated to the training set, and 30%

to the test set. Crossover, mutation, elimination, and merge had probabilities of 0.9, 0.02, 0.8, and 0.8 respectively. The proposed GP approach outperformed C5.0 and a standard GP algorithm. Standard GP did not make use of the eliminator and merge operators, and thus indicating the usefulness of those two operators.

De Falcoet al.[9] evolved 2000 trees over 30 generations. A binary decomposition approach was used. The function set utilised was{AN D, OR, <,≤, >,≥, IN, OU T}.

TheIN andOUT functions take 3 arguments each and are used to represent internal and external intervals respectively; figure 4.9 illustrates the IN function. The IN function outputs true if the value of the attribute is within the range of the lower and upper bound, and outputs false otherwise. The attributes formed the terminal set.

Since a binary decomposition approach was used, a method was proposed in order to resolve conflicts between the classifiers. Each tree corresponds to a logical rule which is represented by a boundary in Euclidean space. To illustrate this, consider a two dimensional space, thus two trees represent two rectangular boundaries within the space. The surroundings of each boundary are referred to as a frontier. Thus in order to resolve conflicts, the distance between the points in Euclidean space is computed, however there are no further details regarding this proposed method. The fitness function took the accuracy, tree depth, and number of nodes into consideration.

The initial population was created using the ramped half and half method and a maximum initial depth of 5. The maximum tree depth allowed during evolution was set to 7. Crossover, mutation, and reproduction had application rates of 80%, 10% and 10% respectively. Tournament selection with a size of 50 was used. The proposed GP algorithm was evaluated on six data sets from the UCI repository, namely, WBC, Diabetes, Heart Disease,Horse Colic, Glass, and Australian Credit Approval. For each data set, 75% of the data was allocated to the training set, and the remaining 25% allocated to the test set. Thirty GP runs were performed for each data set using different seeds.

CHAPTER 4. GP AND DATA CLASSIFICATION 79

Figure 4.9: TheIN function proposed by De Falco et al.[9].

In the two previous studies, the terminal set was made up of attributes. In the study of Eggermontet al.[123], the terminal set consisted of atoms. Each atom had an attribute, one inequality{>, <}, and a constant between 0 and 1. For example, an atom cam be represented as X >0.4, where X is an attribute in some data set.

The authors proposed subatomic mutation, this operator selects a node within a tree, and if that node is an atom, then either the variable or the constant is modified.

For this study the function set was{AN D, OR, N AN D, N OR}. A population size of 2000 was initialised using ramped half and half with a maximum tree depth of 5.

Crossover and subatomic mutation had probabilities of 0.9 and 0.1 respectively. The proposed GP approach was tested on 4 binary data sets, namely,Australian Credit, German Credit, Heart Disease, and Pima. Thek-fold cross-validation method was used with values ofk varying from 9 to 12. No rationale behind the variation in the values ofk was provided.

A binary decomposition approach is also investigated by Tanet al.[124] whereby n GP runs are performed for an n class problem. The logical trees form rules in the following format: if antecedents then class. In this study, the antecedents are made up of functions and terminals. The function set was{AN D, N OT}, and the terminal set represented all the values for each attribute. Unlike the other studies, Tan et al. enforced a constraint which ensured that for a given tree, once a value has been selected for a particular attribute, another value for that same attribute cannot be selected. For example, assume some attribute i has values {1,2,3}, and the value of“1” is used within a particular tree, the constraint ensures that another value for attribute i cannot be selected. This was enforced to reduce redundancy.

The population size varied from 10 to 100, and the number of generations from 10 to 50. The initial population was generated using the ramped half and half method.

Crossover, mutation, reproduction, and elitism were applied with probabilities 0.9, 0.1, 0.1, and 0.05 respectively. The GP approach was tested on 9 binary and mul-

ticlass data sets from the UCI repository, namely, Weather, Contact-Lenses, Zoo, Breast Cancer,Monk1,2,3,Mushroom, and Nursery. 66% of the instances was allocated to training, and 33% to testing. Thirty GP runs were performed on each data set.

GP was applied to a medical data set consisting of 165 attributes and 12 classes for chest pain diagnosis in the study of Bojarczuket al.[125]. Binary decomposition was also used, thus GP was run 12 times in order to create a classifier for each class.

The rules were encoded in the form, if antecedent then class, and the antecedent is represented by a GP tree. The function set was composed of {AN D, OR, N OT}, and the attributes represent the terminal set. Five hundred GP individuals were evolved over 50 generations. The initial population was generated using the ramped half and half method with an initial maximum depth of 10. The maximum tree depth during evolution was set to 17. The fitness function made use of the sen- sitivity and specificity (refer to chapter 3, section 3.4.2), and also the number of nodes within a tree. Parents were selected using the fitness proportionate selection method. Crossover and reproduction were applied with probabilities of 0.9 and 0.1 respectively. The data set was split into a training set containing 65% of the data, and 35% was allocated to the test set. Ten GP runs were performed for each class;

the proposed GP algorithm outperformed the C5.0 algorithm.

4.4.1 Advantages and disadvantages of GP logical trees

Evolving logical trees using GP has the benefit of being able to handle both numerical and categorical data fairly simply in comparison to decision trees or arithmetic trees.

With decision trees, discretisation is required in order to use numerical attributes, and in order to use categorical attributes when evolving arithmetic trees, one has to convert those categorical attributes into numerical ones. When using the right function set it is possible to cater for either categorical or numerical attributes. A typical function set to handle numerical attributes is{AN D, OR, N OT, <, >}, whereas for categorical data a suitable function set is {AN D, OR, N OT,=}. Through the use of the equal operator, it is possible to make use of categorical attributes.

An additional significant advantage of evolving logical trees is the fact that the trees are highly interpretable to a researcher. This is further accentuated when sim- ple logical functions are used, such as{AN D, OR, N OT}. Including functions such as {N AN D, N OR, XOR} will increase the complexity in terms of interpretability since those functions are not as trivially understood compared to the simpler ones.

Interpreting logical trees is easier than interpreting arithmetic trees.

Logical trees and arithmetic trees share a common disadvantage. Every single node within the tree for both representations has to be traversed in order to output the classification result. This can affect the GP run time especially if large trees

CHAPTER 4. GP AND DATA CLASSIFICATION 81 are used, and additionally, if a large data set is used. Similarly to arithmetic trees, additional complexity is encountered when evolving logical trees for multiclass classification problems. Typically, binary decomposition can be used to solve this issue, however this means running the GP algorithm numerous times depending on the number of classes.

4.4.2 Summary of the findings

GP has been used to evolve logical trees in several studies. The logical functions {AN D, OR, N OT} have been used as the function set, and the terminal set was composed of the attributes. Similarly to evolving arithmetic trees, it is simpler to evolve logical trees for binary classification problems due to the fact that no conflict resolution method has to be put in place. When dealing with multiclass classification problems, the binary decomposition approach has been used [124, 125] successfully.

It is however possible to make use of the conflict resolution approaches which were proposed for evolving arithmetic trees as discussed in the previous section. The GP parameters varied from one study to another. The population size varied from 500 to 2000, and the number of generations from 30 to 500. In four studies, the ramped half and half method was used to create the initial population with a maximum depth varying from 5 to 10; this method has been used in most of the reviewed studies.

This study shall thus make use of it to generate the initial population. The maximum allowed depth during the evolutionary process varied from 7 to 17. Since the studies reviewed in this section made use of generational GP, this study will also make use of it. Similar to studies using GP and the two previously discussed representations, the accuracy and complexity of the trees was used as the fitness function. This study will take into account the accuracy and complexity of the trees when evolving logical trees. Both the tournament and fitness proportionate selection methods have been used for evolving logical trees. However, the tournament selection method has generally been used more often when solving data classification problems and thus will be used in this study. Crossover and mutation were the most commonly used GOs with crossover having the highest application rate; this study will also make use of a similar setting.

Dalam dokumen Data classification using genetic programming. (Halaman 98-102)