• Tidak ada hasil yang ditemukan

Knowledge Ensemble based on Individual Mining Results . 185

10.2 EP-Based Knowledge Ensemble Methodology

10.2.2 Knowledge Ensemble based on Individual Mining Results . 185

10.2 EP-Based Knowledge Ensemble Methodology 185

⎪⎪

⎪⎪⎨

=

≥ +

+

=

=

m i

b w x y

C w w

i

i i

i

m

i i

,..., 2 , 1 , 0

1 ] ) ( [ s.t.

) 2 1 ( ) , (

Min 2 1

ξ

ξ ϕ

ξ ξ

φ

(10.6)

where w is a distance parameter, C is a margin parameter and

ξ

i is posi- tive slack variable, which is necessary to allow misclassification. Through computation of Equation (10.6), the optimal separating hyperplane is ob- tained in the following form:

( ∑

+

)

= SV iyi xi xj b

y sgn α ϕ( )ϕ( ) (10.7)

where SV represents the support vectors. If there exist a kernel function such that K(xi,xj)=(ϕ(xi),ϕ(xj)), it is usually unnecessary to explicitly know what ϕ(x) is, and we only need to work with a kernel function in the training algorithm, i.e., the optimal classifier can be represented by

( ∑

+

)

= SV iyiK xi xj b

y sgn α ( , ) (10.8)

Any function satisfying Mercy condition (Vapnik, 1995, 1998) can be used as the kernel function. Common examples of the kernel function are the polynomial kernel K(xi,xj)=(xixTj +1)d and the Gaussian radial basis functionK(xi,xj)=exp

(

−(xixj)2/2σ2

)

. The construction and selection of kernel function is important to SVM, but in practice the kernel function is often given directly.

10.2.2 Knowledge Ensemble based on Individual Mining

186 10 An EP-Based Knowledge Ensemble Model for Credit Risk Analysis classification problems due to its easy implementation. Ensemble mem- bers’ voting determines the final decision. Usually, it takes over half the ensemble to agree a result for it to be accepted as the final output of the ensemble regardless of the diversity and accuracy of each model’s gener- alization. However, majority voting has several important shortcomings.

First of all, it ignores the fact some classifiers that lie in a minority some- times do produce the correct results. Second, if too many inefficient and uncorrelated classifiers are considered, the vote of the majority would lead to worse prediction than the ones obtained by using a single classifier.

Third, it does not consider for their different expected performance when they are employed in particular circumstances, such as plausibility of out- liers. At the stage of integration, it ignores the existence of diversity that is the motivation for ensembles. Finally, this method can not be used when the classes are continuous (Olmeda and Fernandez, 1997; Yang and Browne, 2004). For these reasons, an additive method that permits a con- tinuous aggregation of predictions should be preferred. In this chapter, we propose an evolutionary programming based approach to realize the classi- fication/prediction accuracy maximization.

Suppose that we create p classifiers and let cij be the classification re- sults that classifier j, j =1, 2, …, p makes of sample i, i = 1, 2, …, N. With- out loss of generality, we assume there are only two classes (failed and non-failed firms) in the data samples, i.e., cij ∈{0,1} for all i, j. Let

) (=1

= pj j ij

w

i Sign w c

C θ be the ensemble prediction of the data sample i, where wj is the weight assigned to classifier j, θ is a confidence threshold and sign(.) is a sign function. For corporate failure prediction problem, an analyst can adjust the confidence threshold θ to change the final classifica- tion results. Only when the ensemble output is larger than the cutoff, the firm can be classified as good or healthful firm. Let Ai(w) be the associ- ated accuracy of classification:

⎪⎪

=

=

=

=

=

. otherwise

0

, 1 and 1 if

, 0 and 0 if )

( 2

1

is iw

s i w

i

i a C C

C C

a w

A (10.9)

where Ciw is the classification result of the ensemble classifier, Cis is the actual observed class of data sample itself, a1 and a2 are the Type I and Type II accuracy, respectively, whose definitions can be referred to Lai et al. (2006a, 2006b, 2006c).

The current problem is how to formulate an optimal combination of classifiers for ensemble prediction. A natural idea is to find the optimal

10.2 EP-Based Knowledge Ensemble Methodology 187 combination of weights w*=(w1*,w2*,L,w*p) by maximizing total classifica- tion accuracy including Type I and II accuracy. Usually, the classification accuracy can be estimated through k-fold cross-validation (CV) technique.

With the principle of total classification accuracy maximization, the above problem can be summarized as an optimization problem:

( )

⎪⎪

⎪⎪

=

=

=

=

=

=

=

=

=

=

. otherwise

0

, 1 and 1 if

, 0 and 0 if

)

(

, 2 , 1 ,

s.t.

) ( )

( max

2 1

1 1

s i w

i

s i w

i i

p

j j ij

w i

M

i i

w

C C

a

C C

a w A

M i

c w sign C

w A w

A

θ L

(10.10)

where M is the size of cross-validation set and other symbols are similar to the above notations.

Since the constraint Ciwis a nonlinear threshold function and the Ai(w) is a step function, the optimization methods assuming differentiability of the objective function may have some problems. Therefore the above prob- lem cannot be solved with classical optimization methods. For this reason, an evolutionary programming (EP) algorithm (Fogel, 1991) is proposed to solve the optimization problem indicated in (10.10) because EP is a useful method of optimization when other techniques such as gradient descent or direct, analytical discovery are not possible. For the above problem, the EP algorithm is described as follows:

(1) Create an initial set of L solution vectors wr =(wr1,wr2,L,wrp), L

r =1,2,L, for above optimization problems by randomly sampling the interval [x, y], x,yR. Each population or individual wr can be seen as a trial solution.

(2) Evaluate the objective function of each of the vectors A(wr). Here A(wr) is called as the fitness of wr.

(3) Add a multivariate Gaussian vector ∆r=N(0,G(A(wr))) to the vector wr to obtain wr′=wr+∆r, where G is an appropriate monotone function.

Re-evaluate A(wr′). Here G(A(wr))is called as mutation rate and wr is called as an offspring of individual wr.

(4) Define wi =wi,wi+L =wi′,i=1,2,L,L,C =wi,i=1,2,L,2L. For every wj, L

j=1,2,L,2 , choose q vectors w* from C at random. If A(wj)>A(w*), assign wj as a “winner”.

188 10 An EP-Based Knowledge Ensemble Model for Credit Risk Analysis (5) Choose the L individuals with more number of “winners”

*

wi , i=1,2,L,L . If the stop criteria are not fulfilled, let

* i

r w

w = ,i =1,2,L,L, generation = generation +1 and go to step 2.

Using this EP algorithm, an optimal combination, w*, of classifiers that maximizes the total classification accuracy is formulated. To verify the ef- fectiveness of the proposed knowledge ensemble methodology, a real- world business credit risk dataset is used.

10.3 Research Data and Experiment Design

The research data used here is about UK corporate from the Financial Analysis Made Easy (FAME) CD-ROM database which can be found in the Appendix of Beynon and Peel (2001). It contains 30 failed and 30 non- failed firms. 12 variables are used as the firms’ characteristics description, which are described in Section 6.4.1 of Chapter 6.

The above dataset is used to identify the two classes of business insol- vency risk problem: failed and non-failed. They are categorized as “0” or

“1” in the research data. “0” means failed firm and “1” represent non- failed firm. In this empirical test, 40 firms are randomly drawn as the train- ing sample. Due to the scarcity of inputs, we make the number of good firms equal to the number of bad firms in both the training and testing samples, so as to avoid the embarrassing situations that just two or three good (or bad, equally likely) inputs in the testing sample. Thus the training sample includes 20 data of each class. This way of composing the sample of firms was also used by several researchers in the past, e.g., Altman (1968), Zavgren (1985) and Dimitras et al. (1999), among others. Its aim is to minimize the effect of such factors as industry or size that in some cases can be very important. Except from the above learning sample, the testing sample was collected using a similar approach. The testing sample consists of 10 failed and 10 non-failed firms. The testing data is used to test results with the data that is not utilized to develop the model.

In BPNN, this chapter varies the number of nodes in the hidden layer and stopping criteria for training. In particular, 6, 12, 18, 24, 32 hidden nodes are used for each stopping criterion because BPNN does not have a general rule for determining the optimal number of hidden nodes (Kim, 2003). For the stopping criteria of BPNN, this chapter allows 100, 500, 1000, 2000 learning epochs per one training example since there is little general knowledge for selecting the number of epochs. The learning rate is set to 0.15 and the momentum term is to 0.30. The hidden nodes use the