9.2 SVM-based Metamodeling Process
9.2.3 SVM-based Metamodeling Process
As well known to us, the SVM has proven to be a very suitable tool for classification problem, but later studies also found some shortcomings, such as overfitting or under fitting (Tay and Cao, 2001). To overcome the limitation, a SVM-based metamodel is proposed for complex classification problem in this chapter. Generally, the main aims of the SVM-based metamodeling approach are to improve classification accuracy. Using the SVM and the above extended metalearning process, we can construct a SVM-based nonlinear metamodeling process, as illustrated in Fig.9.3.
Similar to Fig.9.2, the SVM-based metamodeling process consists of four stages: data partition and sampling, SVM base learning, base model selection and pruning, SVM metalearning. Compared with Fig. 9.2, Fig.
9.3 describes a concrete SVM-based metalearning model. That is, the stan- dard SVM learning algorithm proposed by Vapnik (1995) is used as both base learner and meta-learner.
From the SVM-based metamodeling process shown in Fig. 9.3, there are four main problems to be further addressed, i.e.,
(1) how to create n different training subsets from the original train- ing set TR;
(2) how to create different SVM base models fi (i = 1, 2,…, n) using different training subsets;
(3) how to select some diverse base models from n SVM base models in the previous stage; and
(4) how to create a metamodel with different metadata produced by the selected SVM base models.
For the above problems, this chapter attempts to give some feasible so- lutions along with the descriptions of the four-stage SVM metamodeling process, which is interpreted below.
166 9 Credit Risk Analysis with a SVM-based Metamodeling Ensemble
Fig. 9.3. SVM-based metamodeling process
9.2 SVM-based Metamodeling Process 167 9.2.3.1 Stage I: Data Partition and Sampling
When applying SVM to credit risk analysis, data partition is a necessary step. Furthermore, data partition can have a significant impact on the final results (Lai et al., 2006a). In many applications, the initial data set is split into two sets: training set and testing set. The former set is used for model training and parameter estimation and the latter one is used for model test- ing and verification. However, the further research results indicate that the third set from original data set called validation set can effectively increase the robustness of machine learning algorithm like SVM. For this reason, this chapter will divide the original data set into three different parts: train- ing set, validation set and testing set, according to a predefined partition rate. Until now there is not a consensus for this partition rate in machine learning field, which is often determined on an arbitrary basis. However, a general rule to make machine learning algorithms possess good generaliza- tion capability is to guarantee enough training data because too less and too much training data will degrade the classification performance. There- fore, training data set with an appropriate size will be useful for improving classification performance. Yao and Tao (2000) used the 7:2:1 ratio to predict some foreign exchange rates and obtained good performance. Note that the divided three data sets are not overlapped. For convenience, we also use other partition ratios for data division, which is depended on the problems.
After data partition, how to create n training subsets from the original training set TR becomes a key problem. There are several different meth- ods to create training subsets. A direct replication approach is to replicate the training set TR n times to produce n training subsets {TR1, TR2, …, TRn}. But this approach requires that metalearning must use different learning algorithms.
Noise injection method proposed by Raviv and Intrator (1996) is to add noise to the training dataset. Since the injection of noise increases the in- dependence between the different training sets derived from original data sets, this method can effectively reduce the model variance. This method also produces many different training sets. But a potential problem of this method is that the noise may change the characteristics of original data set.
Bootstrap aggregating (bagging) proposed by Breiman (1996) is a widely used data sampling method in machine learning. The bagging algo- rithm can generate different training data sets by random sampling with replacement. Therefore, bagging is a useful data sampling method for ma- chine learning, especially for data shortage. The bagging algorithm is somewhat similar to the first method and it is a more general algorithm of the direct replication approach. Detailed bagging algorithm is described in
168 9 Credit Risk Analysis with a SVM-based Metamodeling Ensemble Chapter 8. In this chapter, we also adopt bagging algorithm to create dif- ferent training subsets.
9.2.3.2 Stage II: SVM Base Learning
With the different training subsets, SVM can be trained to formulate dif- ferent base classifiers (i.e., SVM base models). As previously mentioned, a metamodel consisting of diverse base models with much disagreement is more likely to have a good generalization performance. Therefore, how to create the diverse base models is the key to the construction of an effective metamodel. For the SVM model, there are several methods for generating diverse models.
(1) Adopting different kernel functions, such as polynomial function, sigmoid function and Gaussian function;
(2) Varying the SVM model parameters, such as upper bound pa- rameter C and kernel parameters;
(3) Utilizing different training data sets. This method is done by the first stage.
In this chapter, the individual SVM models with different parameters based on different training subsets are therefore used as base learner L1, L2, …, Ln with a hybrid strategy, as illustrated in Fig. 9.3. That is, we util- ize the two ways (using different parameters and different training sets) to create diverse SVM base predictors. Through training and validation, dif- ferent SVM base models f1, f2, …, fn can be formulated in a parallel way.
Such a parallel computing environment can effectively increase the learn- ing efficiency and computational scalability.
9.2.3.3 Stage III: Base Model Selection and Pruning
When a large number of SVM base models are generated, it is necessary to select the appropriate number of base models or prune some redundant base models for improving the performance of SVM-based metalearning process. It is well known to us that not all circumstances are satisfied with the rule of “the more, the better” (Yu et al., 2005). That is, some individual SVM base models produced by previous stage may be redundant. These redundant base models may waste computational resources, reduce compu- tational efficiency and lower system performance. Thus, it is necessary to prune some inappropriate individual base models for metamodel construc- tion. For this purpose, principal component analysis (PCA) technique is used to perform the model selection or model pruning task.
9.2 SVM-based Metamodeling Process 169 The PCA technique (Jolliffe, 1986), an effective feature extraction method, is widely used in signal processing, statistics and neural comput- ing. The basic idea in PCA is to find the components (s1, s2, …, sp) that can explain the maximum amount of variance possible by p linearly trans- formed components from data vector with q dimensions. The mathematical technique used in PCA is called eigen analysis. In addition, the basic goal in PCA is to reduce the dimension of the data (Here the PCA is used to re- duce the number of individual base models). Thus, one usually chooses p ≤ q. Indeed, it can be proven that the representation given by PCA is an op- timal linear dimension reduction technique in the mean-square sense (Jolliffe, 1986). Such a reduction in dimension has important benefits. First, the computation of the subsequent processing is reduced. Second, noise may be reduced and the meaningful underlying information identified. The following presents the PCA process for individual model selection (Yu et al., 2005).
Assuming that there are n individual base models, through model train- ing and validation, every base model can generate m classification results, which can be represented by a result matrix (Y):
⎥⎥
⎥⎥
⎦
⎤
⎢⎢
⎢⎢
⎣
⎡
=
nm n
n
m m
y y
y
y y
y
y y
y Y
L M O M M
L L
2 1
2 22
21
1 12
11
(9.1)
where yij is the jth classification/prediction result with the ith base model.
Next, we deal with the result matrix using the PCA technique. First, ei- genvalues (λ1, λ2, …, λn) and corresponding eigenvectors A=(a1, a2, …, an) can be solved from the above matrix. Then the new principal components are calculated as
Y a
Zi = iT (i =1, 2, …, n) (9.2)
Subsequently, we choose m (m ≤ n) principal components from existing n components. If this is the case, the saved information content is judged by
) /(
)
(λ1 λ2 λm λ1 λ2 λn
θ = + +L+ + +L+ (9.3)
If θ is sufficiently large (e.g., θ > 0.8) or θ is larger than a specified threshold, enough information has been saved after the feature extraction process of PCA. Thus, some redundant base models can be pruned and re- combining the new information can increase efficiency of SVM classifica- tion system without a substantial reduction in performance. Through ap-
170 9 Credit Risk Analysis with a SVM-based Metamodeling Ensemble plying the PCA technique, we can obtain the appropriate numbers of base models for metamodel generation. Suppose the PCA technique selects m SVM base models from n initial base models, then the (n-m) base models are pruned. The m SVM base models can be represented as f1, f2, …, fm.
Once the appropriate numbers of SVM base models are selected, these selected SVM base models can produce a set of SVM predic- tion/classification results (yˆ1,yˆ2,L,yˆk,L,yˆm) where yˆ is the prediction k result of the kth SVM base model. These prediction results from different SVM base models contain different information that each SVM base model captured. Different from original data set, these results can be de- fined as “metadata”. These metadata can form a new training set called as
“meta-training set (MT)”. In order to obtain good performance, re- combining these results will be of importance for final prediction results.
The following subsection gives some solutions to generate a metamodel.
9.2.3.4 Stage IV: SVM Metalearning and Metamodel Generation
Through previous stage, some diverse SVM base models are selected. The subsequent task is how to construct a metamodel based on metalearning strategy using the metadata produced by these selected SVM base models.
Actually the metamodel formulation process is an information integration process of the selected base models. Suppose there is m selected SVM base models. Through training, validation and generalization, m SVM base model outputs, i.e., yˆ1,yˆ2,L,yˆk,L,yˆm are generated. The main question of SVM metamodel is how to integrate these outputs into an aggregate out- put, which is assumed to be a more accurate output, adopting a suitable metalearning strategy. That is, how to integrate these information produced by the selected based models using an appropriate metalearning strategy.
There are many different metalearning strategies for classification and re- gression problems respectively.
Typically, there are five general metalearning approaches in the existing literature (Vilalta and Drissi, 2002). It is worth noting that there exist many variations on these general approaches in the existing studies.
(1) Stacked generalization;
(2) Boosting;
(3) Dynamic bias selection;
(4) Discovering meta-knowledge; and (5) Inductive transfer.
Stacked generalization works by combining a number of different learn- ing algorithms. The metadata is formed by the classifications of those dif-
9.2 SVM-based Metamodeling Process 171 ferent learning algorithms. Then another learning algorithm learns from this metadata to predict which combinations of algorithms give generally good results. Given a new learning problem, the classifications of the se- lected set of algorithms are combined (e.g. by weighted voting) to provide the final classification results. Because each algorithm is deemed to work on a subset of problems, a combination is hoped to be more flexible and still able to make good classifications.
Boosting is related to stacked generalization, but uses the same learning algorithm multiple times, where the examples in the training data get dif- ferent weights over each run. This produces different classifications, each focused on rightly predicting a subset of the data, and combining those classifications leads to better (but more expensive) results.
Dynamic bias selection works by altering the inductive bias of a learn- ing algorithm to match the given problem. This is done by altering some key aspects of the learning algorithm, such as the hypothesis representa- tion, heuristic formulae or parameters.
Discovering meta-knowledge works by inducing knowledge (e.g. rules) that expresses how each learning method will perform on different learn- ing problems. The metadata is formed by characteristics of the data (e.g.
general, statistical etc.) in the learning problem, and characteristics of the learning algorithm (e.g., algorithm type, parameter settings etc.). Another learning algorithm then learns how the data characteristics related to the algorithm characteristics. Given a new learning problem, the data charac- teristics are measured, and the performance of different learning algo- rithms can be predicted. Hence one can select the algorithms best suited for the new problem, at least if the induced relationship holds.
Inductive transfer is also called as “learning to learn”. It studies how the learning process can be improved over time. Metadata consists of knowl- edge about previous learning episodes, and is used to efficiently develop an effective hypothesis for a new task.
From the above descriptions, we find that the existing metalearning technique is built on linear assumption, for example, major voting and weighting voting in stacked generalization and boosting. However, linear metalearning techniques are still insufficient for some complex and diffi- cult problems such as credit risk evaluation. This is another key problem presented in Section 9.1. As a remedy, this chapter proposes a nonlinear metalearning technique to construct a metamodel for credit risk evaluation.
This nonlinear metamodeling approach uses another SVM named “meta”
SVM, which is different from the base SVM classifier, to construct a SVM metamodel for classification problems. Concretely speaking, in this nonlinear metalearning approach, the outputs of SVM base models are seen as the inputs of “meta” SVM. That is, the SVM-based nonlinear
172 9 Credit Risk Analysis with a SVM-based Metamodeling Ensemble metalearning technique can also be viewed as a nonlinear metamodeling system that can be represented by
ˆ ) , ˆ , ˆ ,
ˆ (y1 y2 ym
y=ϕ L (9.4)
where fˆ is an output of the metamodel, and (yˆ1,yˆ2,L,yˆm)is the output vector of base models, φ(·) is a nonlinear function determined by the
“meta” SVM model. (yˆ1,yˆ2,L,yˆm) can be seen as a metadata set, which can formulate a meta-training set (MT), as illustrated in Fig. 9.3. Because this metamodel is generated by training a meta-training set or learning from a meta-training set, this process is usually called as “SVM metamod- eling” or “SVM metalearning”. The SVM metalearning process is illus- trated as Fig. 9.4.
Fig. 9.4. Graphical illustration for SVM-based metalearning process
To summarize, the proposed SVM-based nonlinear metamodeling proc- ess consists of the above four stages. Generally speaking, suppose that there is an original data set DS which is divided into three parts: training set (TR), validation set (VS) and testing set (TS). The training set is usually preprocessed by various sampling methods in order to generate diverse training subsets {TR1, TR2, …, TRn} before they are applied to the SVM base learners: L1, L2, …, Ln. After training, the diverse SVM base models f1, f2, …, fn are generated. Through the validation set and testing set, these base models are verified and some independent base models are chosen us- ing the PCA technique. Afterwards the whole training set TR was applied and the corresponding results (yˆ1,yˆ2,L,yˆm) of each base model are used