• Tidak ada hasil yang ditemukan

Empirical Analysis of GA-based Input Features Evolution

7.5 Experimental Results

7.5.1 Empirical Analysis of GA-based Input Features Evolution

124 7 Evolving Least Squares SVM for Credit Risk Analysis

30, …, 90) training data to perform the feature selection. Here we assume that the LSSVM training with less than 20% training data is inadequate.

This testing process includes three steps. First of all, we randomly select some input variable using GA and pass them to LSSVM. Second, the data- set is randomly separated two parts, training samples and validation sam- ples, in terms of i%, respectively. Third, we select i% training data to train LSSVM model and then test the trained LSSVM with (100–i)% validation data in terms of five-fold cross-validation procedure. Accordingly, the se- lected features using GA-based feature selection procedure are shown in Table 7.1.

Table 7.1. Selected key features by GA-based feature selection procedure Data Selected Feature ID

Partition (i%) Dataset 1 Dataset 2 Dataset 3 20 1, 2, 5, 7, 8,10 1, 2, 4, 5, 6, 9, 11 2, 3, 5, 7, 10, 15, 17 30 2, 3, 5, 7, 11 2, 4, 5, 8, 12 3, 7, 11, 12, 17, 19 40 2, 3, 7, 11 1, 2, 5, 6, 9 1, 3, 7, 8, 11, 17 50 1, 2, 3, 11 1,5, 6, 10, 14 2, 3, 7, 8, 13, 17 60 1, 2, 7, 10 2, 5, 7, 10, 13 3, 7, 12, 13, 17 70 2, 7, 8, 11 1, 2, 5, 6, 12 1, 3, 7, 14, 15, 17 80 1, 2, 7, 11 1, 2, 5, 6, 13 3, 7, 10, 12, 15, 17 90 2, 7, 10, 11 5, 6, 7, 9, 11 3, 7, 12, 17, 19

As can be seen from Table 7.1, it is easy to see which features are key determinants of credit evaluation. For each data partition, different key features are clearly highlighted. We partially attribute this finding to the strong correlation among credit assessment related features. However, we find that the key determinants for different credit datasets are different.

For the Dataset 1 (England corporation credit), ROCE (2), QACL (7) and CHAUD (11), can be considered to be some key drivers affecting cor- poration credit because they appear many times in eight experiments. For Dataset 2, applicant’s income (5) and applicant’s employment status (6) can be seen as two important key drivers for England consumers’ credit. In addition, year of birth (1) and number of children (2) are two important factors. While for the third dataset, credit history (3), present employment (7) and job (17) are the key drivers for German consumers’ credit. The rea- son leading to this difference is that different datasets include different credit type. For example, Dataset 1 is a corporation credit dataset, while Datasets 2 and 3 are two consumers’ credit datasets. Furthermore, in two consumers’ credit datasets, both applicant’s income (the fifth feature for Dataset 2 vs. the seventeenth feature for Dataset 3) and employment status (the sixth feature for Dataset 2 vs. the seventh feature for Dataset 3) are

7.5 Experimental Results 125 key drivers to affect the credit evaluation. With these key determinants, two benefits can be obtained. On the one hand, this makes considerable sense because these key features are significantly related to credit evalua- tion from this analysis. Therefore the decision-makers of credit-granting institutions will add more weight to these features for future credit evalua- tion. On the other hand, we can reduce data collection and storage re- quirements and thus reducing labor and data transmission costs, which are illustrated below.

From Table 7.1, the GA-based feature selection procedure reduces data dimensions for different credit dataset. For Dataset 1, most data partitions choose at most four features except one that selects six features at data par- tition i = 20% and another that selects five features at data partition i= 30%

(The main reason may be that too less data are used). On average, the GA- based feature selection method selects four features. This reduces the data dimensionality by (12-4)/12 ≈ 66.67% for Dataset 1. For Dataset 2 and Dataset 3, the GA-based feature selection procedure chooses five and six features on the average, respectively. The amount of data reduction for Dataset 2 is (14-5)/14 = 64.29%, while for Dataset 3, the amount of data reduction reaches (20-6)/20=70%. This indicates that we can reduce data collection, computation and storage costs considerably through the GA- based feature selection procedure.

Although we only select several typical features to classify the status of credit applicants, the classification performance of the model is still prom- ising, as illustrated in Fig. 7.3. Note that the value in Fig. 7.3 is the average classification performance in terms of five-fold cross validation experi- ments.

Fig. 7.3. Performance of feature evolved LSSVM on three credit datasets

126 7 Evolving Least Squares SVM for Credit Risk Analysis

As can be seen form Fig. 7.3, several finding can be observed. First of all, before 70%~80% partition rate, the classification performance gener- ally improves with the increase of data partition. The main reason is that too less training data (e.g., 20%~30% partition rates) is often insufficient for LSSVM learning. Second, when 90% partition rate is used, the predic- tions show the slightly worse performance relative to 80% partition rate.

The reason is unknown and it is worth exploring further with more ex- periments. Third, in the three testing cases, the performance of the Dataset 1 is slightly better than the other two cases. The possible reason is that Dataset 1 has a relatively small number of features than other two datasets and real reason is worth further exploring in the future. In summary, the evolving LSSVM learning paradigm with GA-based feature selection is rather robust. The hit ratios of almost all data partition are above 77.7%

except that the data partition is set to 20% for Dataset 1 (75.67%). Fur- thermore, the variance of five-fold cross validation experiments is small at 1%~4%. These findings also imply that the feature evolved LSSVM model can effectively improve the credit evaluation performance.

7.5.2 Empirical Analysis of GA-based Parameters Optimization