8.2.1 Decision trees and encapsulation
The encapsulation operator typically selects any element of the function set within a parse tree and encapsulates the subtree located at that position. In the context of decision trees, this is achieved by randomly selecting an attribute node. Encapsu- lating a node representing a class will be of no particular benefit. The pseudocode for the encapsulation operator in the context of data classification is presented in algorithm 8.1.
Incorporating the encapsulation operator for classification results in changes to
133
Algorithm 8.1:Pseudocode for encapsulation in the context of data classifi- cation.
1 begin
2 Select a random tree T within the current GP population.
3 Select a random attribute node within the tree.
4 Remove the subtree located at the random attribute node and store it in memory.
5 Allocate an encapsulated terminal name to the removed subtree. Initially named E0, then E1, and so on.
6 Add the encapsulated terminal to the terminal set.
7 Within T, replace the subtree with the encapsulated terminal.
8 end
steps 4 and 6 within the standard GP algorithm (refer to algorithm 2.1 in chapter 2). In this study the encapsulation operator is applied after every second generation.
Applying the operator at each generation would result in a large number of encap- sulated nodes which would consequently impact the performance of the algorithm.
Additional trial runs were performed and revealed that applying the operator at every second generation led to improved results. The following changes are made to step 6 in the standard GP algorithm. For odd generations, only the crossover and mutation genetic operators are applied. For even generations, the encapsulation op- erator is applied in addition to crossover and mutation. The encapsulation operator is performed before crossover and mutation so that the latter two GOs can make use of the newly created encapsulated terminals.
The mutation operator selects a random tree and a random mutation point within that tree. It then creates an entirely new subtree at that point. Given the fact that the encapsulation operator is performed prior to mutation, the mutation operator can then add the new encapsulated terminals to the subtrees. Similarly, the crossover operator exchanges subtrees within two randomly selected parents, and if the resulting offspring are larger than a certain specified depth the trees are pruned. By pruning the trees, any function node at the maximum specified depth is replaced with a terminal node. Since the encapsulated terminals are added to the terminal set, it is thus possible that these terminals which are added during the pruning process can be one of the new terminals created by the encapsulation operator.
Figure 8.1 illustrates a tree which has been pruned. The tree on the left, which was produced by crossover or mutation has a depth of 4. If the maximum depth permitted is 3, all the leaves below the horizontal line have to be pruned. Removing all the nodes below the horizontal line results in an invalid tree because attributes cannot be leaves. Thus, the two attributes at depth 3 have to be replaced with ter-
CHAPTER 8. GP ENCAPSULATION FOR DATA CLASSIFICATION 135 minals. The tree on the right is the result of the pruning process. From the original tree, the left attribute at depth 3 was replaced with an encapsulated terminal, and the right attribute at depth 3 was replaced with a class.
Figure 8.1: Pruning trees and adding encapsulated terminals at the leaves.
Step 4 (the evaluation phase in algorithm 2.1) in the standard GP algorithm is modified to cater for the encapsulated terminals. When a tree is evaluated, and if an encapsulated terminal is reached, the subtree corresponding to that terminal is evaluated. For example, if the following tree is evaluated: temperature E0 class2, when the left branch is visited the algorithm will then proceed by evaluating the subtree corresponding toE0.
It is possible that the algorithm encapsulates a subtree which already contains an encapsulated terminal. The evaluation process will evaluate all the encapsulated terminals recursively until a class is reached. For instance, some tree may have a reference to subtree E5, and E5 may in turn have a reference to subtree E6.
Figure 8.2: Evaluating a tree with an encapsulated terminal.
Assume the left tree in figure 8.2 is evaluated, and if the left branch is visited, the algorithm will then proceed by evaluating the corresponding subtree represented
by the terminalE0.If the subtree E0 evaluates to “class 2”, then the classification output for the left tree is “class 2”. An outline of the modified GP algorithm with encapsulation is presented in algorithm 8.2.
Algorithm 8.2:Pseudocode of proposed GP algorithm with encapsulation.
1 begin
2 Create initial population.
3 Evaluate the initial population.
4 generation ←0.
5 while generation ¡= generation max do
6 generation←generation+ 1.
7 Apply selection methods.
8 If odd generation, apply crossover and mutation using current terminal set.
9 If even generation, apply encapsulation, and add newly created terminals to the terminal set.
10 Evaluate the current population. If an encapsulated terminal is encountered, then call the corresponding subtree.
11 end
12 return Best solution found.
13 end
8.2.2 Maintaining the most called subtrees
The proposed algorithm of incorporating encapsulation to the GP algorithm pre- sented in section 8.2.1 allows the GP algorithm to select from a large list of encap- sulated terminals. When the number of encapsulated terminals is very large, this can hinder the performance of the GP algorithm as a large terminal set increases the GP program space. In order to overcome this problem, another method is pro- posed which maintains a list of encapsulated terminals, this new proposed method is named selective encapsulation. Details about this enhanced proposed method are discussed below.
The list of encapsulated terminals is initialised the first time the encapsulation genetic operator is called. The maintained list has one user defined parameter,max, which allows the researcher to control the maximum size of the list. The process of initialising the list is illustrated in algorithm 8.3. In section 8.2.1 all the encapsulated nodes are kept in memory and added to the terminal set; however, in this proposed method only certain elements are kept in the maintained list. In this chapter the term memory represents all the encapsulated terminals created throughout the GP run. Encapsulated terminals in the memory are never deleted. The maintained list contains certain encapsulated terminals from the memory. However unlike the memory, terminals in the maintained list can be removed. When the encapsulation
CHAPTER 8. GP ENCAPSULATION FOR DATA CLASSIFICATION 137 operator is applied, the newly created encapsulated terminals are automatically added to the memory, but not necessarily to the maintained list. The maintained list is updated by removing encapsulated terminals from the list, and adding new ones.
Algorithm 8.3:Pseudocode for initialising the maintained list.
1 begin
2 for Each encapsulated terminal E in the current population do
3 if there are less than max elements in the maintained list then
4 Insert the encapsulated terminalE into the maintained list.
5 end
6 else
7 Insert the encapsulated terminalE into memory but not into the maintained list.
8 end
9 end
10 end
Algorithm 8.4 presents the pseudocode for updating the maintained list. The list is updated after the encapsulation genetic operator is executed, in this chapter this is performed after every second GP generation.
For a given encapsulated terminal, the number of calls is determined by com- puting the number of times that encapsulated terminal is called within the entire current population. Assume E0 is a terminal which is found in five trees in the current population, then E0 has a total of five calls. When the list is updated, an encapsulated terminal which is not in the list, and has the most number of calls is selected. This process only deals with encapsulated terminals which are currently present in the population. For instance, if the terminal E44 had a large number of calls in the previous population but is not found in the current population, then E44 is not considered as a potential terminal to be selected. Thus, the process finds all the encapsulated terminals within the current population which are not in the maintained list, and determines which one has the highest number of calls. The next step (step 4 in algorithm 8.4) is to determine which encapsulated terminal within the list is called the least within the current population. Finally in step 5, the terminal which is called the least within the list is then replaced with the terminal outside the list which is called the most. Both of the encapsulated terminals remain within memory; however, the terminal which was previously in the maintained list is no longer in the list. Thus, after every second GP generation, one terminal leaves the list and another one enters the list.
After the initial population generation, terminals are only added to trees when the mutation operator is executed, or when trees are pruned. Algorithm 8.5 il-
Algorithm 8.4:Pseudocode for updating the maintained list.
1 begin
2 Determine the encapsulated terminal in memory which is not in the maintained list, and that has the most number of calls in the current population. Call thismemTerm.
3 If no element in step 1 is found, then determine the encapsulated terminal in memory that has the most number of calls within the current
population. Call thismemTerm.
4 Determine the encapsulated terminal within the maintained list which has the least number of calls. Call this listTerm.
5 Swap memTermwithlistTerm, i.e. memTermis now part of the maintained list, and listTermis no longer part of the maintained list, however listTermis still in memory.
6 end
lustrates how terminals are selected and added to trees when the maintained list approach is incorporated into the GP algorithm.
If there are no encapsulated terminals in memory, then classes are used as termi- nals. When the selective encapsulation approach is used, there is a 60% probability that one of the encapsulated terminals from the list is selected. The value of 60% was chosen in such a way so as to slightly bias the algorithm towards selecting terminals from the list, and additionally a value of 60% does not completely bias the choice towards only selecting from the list. Thus, there is a higher probability that GP will select from a smaller range of encapsulated terminals which have been called fre- quently in the recent population. This contrasts from the initial approach described in section 8.2.1 whereby the GP algorithm can select any encapsulated terminals in memory. This proposed method results in two additional parameters, however, these parameters help further control the method by enabling the GP algorithm to select certain useful encapsulated nodes.