Experimental Results and Discussion - Back-Propagation Approach for Handling the Two-Class Imba

Back-Propagation Approach for Handling the Two-Class Imbalance Problem

6. Experimental Results and Discussion

In addition, when the null-hypothesis was rejected, a post-hoc test is used in order to find the particular pairwise method comparisons producing statistically-significant differences. The Holm–Shaffer post-hoc tests are applied in order to report any significant difference between individual methods. The Holm procedure rejects the hypotheses (H_i) one at a time until no further rejections can be done [53]. To accomplish this, the Holm method ordains the p-values from the smallest to the largest, i.e., p1≤p2 ≤ p_k−1, corresponding to the hypothesis sequence H₁, H2, ..., H_k−1. Then, the Holm procedure rejects H₁to H_i−1if i is the smallest integer, such that pi ≤ α/(k−i). This procedure starts with the most significant p-value. As soon as a certain null-hypothesis cannot be rejected, all of the remaining hypotheses are retained, as well [54]. The Shaffer method follows a very similar procedure to that proposed by Holm, but instead of rejecting H_iif p_i ≤α/(k−i), it rejects Hiif pi≤α/ti, where tiis the maximum number of hypotheses that can be true given that any(i−1)hypotheses are false [55].

Figure 1.Results of the non-parametric statistical Holm and Shaffer post-hoc test.

The fill circles mean that for these particular pairs of classifiers, the null hypothesis was rejected by both test. The color of the circles is the darkest at p-values close to zero, i.e., when the statistical difference is the most significant.

0 2000 4000 6000 8000 10000

6 8 10 12 14 16 18

Number of samples used in the training

AUC Average Ranks

Standard BP ADASYN

SL-SMOTE

B-SMOTE ROS SMOTE

SMOTE-ENN SMOTE-TL

SPIDER-1 SPIDER-2 ADOMS

SDSAS SDSAO DyS

DOS

CNN CNNTL NCL

OSS RUS

TL Under-sampling

Over-sampling Dynamic Sampling

Figure 2.Number of samples used in the training process by the studied methods in contrast to the Area Under the receiver operating characteristic Curve (AUC) average ranks obtained in the classification test. The x axis represents the average ranks (the best performing method should have the rank of one or close to this value). We previously used the ten-fold cross-validation method. The number shown in the y axis corresponds to the average training fold size.

119

In addition, Table 2 indicates that the class Imbalance Ratio (IR) is not determinant in order to get high AUC values, for example CAY7, SAT2, SEG1, SEG5 and 92A5 datasets present high values of AUC no matter their IR; also in these datasets, most over-sampling methods and dynamic sampling approaches are very competitive.

Other datasets support this fact, i.e., IR is not critical in the classification performance, for example the SEG4 and SEG5 datasets have the same IR, but the classification performance (using the standard BP) is very different (values of AUC of 0.999 and 0.630, respectively). This confirms was was presented in other works, in that other features of the data might become a strong problem for the class imbalance [2].

For example: (i) the class overlapping or noisy data [39,42,56,57]; (ii) the small disjuncts;

(iii) the lack of density and information in the training data [58]; (iv) the significance of the borderline instances [13,59] and their relationship with noisy samples; and (v) the possible differences in the data distribution for the training and testing data, also known as the dataset shift [7].

In order to strengthen the result analysis, a non-parametrical statistical and post-hoc tests are applied (see Section 5.3): Friedman and Iman–Davenport tests report that considering reduction performance distributed according to chi-square with 20 degrees of freedom, the Friedman statistic is set at 329.474, and the p-value computed by the Friedman test is 1.690×10⁻¹⁰. However, considering reduction performance distributed according to the F-distribution with 20 and 680 degrees of freedom, the Iman and Davenport statistic is 30.233, and the p-value computed by their test is 2.588×10⁻⁸⁰. Then, the null hypothesis is rejected, i.e., the Friedman and Iman–Davenport tests indicate the existence of significant differences in the results.

Due to these results, a post-hoc statistical analysis is required.

Figure 1 shows the results of the non-parametric statistical Holm and Shaffer post-hoc tests. The rows and columns constitute the studied methods; as a consequence, it represents all C×C pairwise classifier comparisons. The filled circles mean that for these particular pairwise methods (for Ci×C_j; i=1, 2, ..., C and i6= j), the null hypothesis was reject by the Holm–Shaffer post-hoc tests. Therefore, the color of circles is the darkest when the p-values are close to zero; this means that the statistical difference is significant.

120

Table2.Back-propagationclassificationperformanceusingtheAreaUnderthereceiveroperatingcharacteristicCurve(AUC). Theresultsrepresenttheaveragedvaluesbetweentenfoldsandtheinitializationoftendifferentweightsoftheneural network.Thebestvaluesareunderlinedinordertohighlightthem.ROS,RandomOver-Sampling;SDSAO,Selective DynamicSamplingApproachusingROS;DyS,DynamicSampling;SDSAS,SelectiveDynamicSamplingApproachapplying SMOTE;SMOTE-TL,SMOTEandTL;SMOTE,SyntheticMinorityOver-samplingTechnique;SL-SMOTE,Safe-LevelSMOTE; SMOTE-ENN,SMOTEandEditingNearestNeighbor(ENN)rule;ADOMS,AdjustingtheDirectionOfthesyntheticMinorityclasS method;B-SMOTE,Borderline-SMOTE;DOS,DynamicOver-Sampling;RUS,RandomUnder-Sampling;ADASYN,Adaptive SyntheticSampling;SPIDER1and2,frameworksthatintegrateaselectivedatapre-processingwithanensemblemethod; NCL,NeighborhoodCleaningRule;TL,Tomeklinksmethod;STANDARD,back-propagationwithoutanypre-processing;OSS, One-SidedSelectionmethod;CNN,CondensedNearestNeighbor;CNNTL,CondensedNearestNeighborwithTL(formore detailsseeSections3and4). DATAROSSDSAODySSDSASSMOTE-TLSMOTESL-SMOTESMOTE-ENNADOMSB-SMOTEDOSRUSADASYNSPIDER-1SPIDER-2NCLTLSTANDARDOSSCNNCNNTL CAY00.9760.9860.9850.9780.9760.9760.9760.9760.9770.9680.9840.9700.9670.9370.9380.9370.9360.9330.8030.8330.795 CAY10.9690.9790.9750.9680.9690.9690.9700.9700.9700.9660.9850.9550.9650.7350.7890.7450.7170.7520.9310.9160.918 CAY20.9680.9590.9580.9670.9690.9680.9680.9690.9680.9680.9520.9590.9640.9520.9570.9520.9520.9490.9610.9600.958 CAY30.9850.9850.9830.9810.9860.9850.9840.9850.9840.9730.9910.9500.9750.9480.9490.9400.9290.9410.8090.8260.839 CAY40.9740.9940.9910.9710.9710.9680.9710.9680.9680.9640.9620.9220.9670.9140.9360.8880.8650.8460.9460.9340.956 CAY50.9520.9220.9560.9430.9520.9500.9490.9510.9510.9500.9080.9330.9510.7810.8340.7730.7690.7720.6660.6180.617 CAY60.9800.9560.9560.9800.9810.9810.9800.9820.9800.9690.9560.9600.9730.9460.9460.9520.9490.8300.8500.7870.875 CAY70.9910.9830.9830.9900.9900.9900.9910.9910.9910.9830.9860.9880.9670.9860.9850.9840.9840.9850.7760.8240.788 CAY80.9750.9370.9350.9660.9760.9720.9710.9720.9700.9640.9230.9250.9580.9330.9330.9350.9340.9350.8170.8260.825 CAY90.9150.9230.9200.9100.9170.9160.9150.9150.9160.9110.8980.8960.9090.8750.8680.8600.8480.8340.8790.8490.872 CAY100.9670.9790.9650.9680.9630.9680.9680.9690.9680.9650.9730.8850.9700.8830.9020.8600.8770.9220.8330.7860.808 FELT00.9790.9820.9800.9790.9780.9780.9770.9770.9780.9710.9770.9770.9510.9760.9760.9770.9760.9770.9520.9550.937 FELT10.9760.9660.980.9750.9760.9730.9750.9760.9760.9680.9710.9700.9580.9640.9640.9650.9640.9650.9470.9460.945 FELT20.9760.9470.9600.9760.9740.9750.9760.9740.9750.9590.9690.9590.9630.9140.9210.9180.9010.8900.9520.9480.948 FELT30.9770.9840.9870.9780.9770.9780.9770.9770.9780.9680.9870.9710.9560.9740.9750.9690.9710.9700.9640.9660.962 FELT40.9830.9920.9760.9850.9830.9830.9840.9830.9830.9770.9880.9810.9640.9680.9720.9720.9680.9680.9680.9690.961 SAT00.9200.9100.9160.9170.9150.9150.9160.9140.9160.9181.0000.9130.9070.9090.9120.9090.8940.8810.8990.8810.865 SAT10.9850.9880.9880.9840.9860.9830.9860.9840.9850.9830.9960.9830.9760.9830.9830.9820.9820.9810.9820.9810.976 SAT20.9810.9890.9830.9800.9820.9800.9810.9830.9800.9770.9610.9800.9690.9770.9800.9770.9760.9760.9710.9660.964 SAT30.9610.9650.9580.9600.9580.9620.9620.9610.9630.9600.9110.9570.9550.9570.9580.9500.9550.9430.9540.9560.945 SAT40.8570.8670.8030.8630.8660.8490.8580.8580.8580.8441.0000.8440.8540.7460.7790.7920.7570.5810.7690.7110.776 SAT50.9440.9450.9280.9250.9440.9440.9440.9420.9410.9201.0000.9170.9130.8470.8230.8550.8530.8420.9270.9210.927 SEG00.9980.9650.9700.9950.9930.9940.9960.9920.9880.9970.8950.9950.9930.9930.9920.9940.9930.9940.9940.9930.994 SEG11.0001.0001.0001.0001.0001.0001.0001.0000.9881.0000.9910.9991.0001.0001.0000.9990.9991.0000.9250.9090.914 SEG20.9780.9790.9960.9790.9750.9770.9790.9740.9770.9810.9830.9780.9650.9820.9830.9810.9800.9770.9650.9640.963 SEG30.9730.9670.9760.9630.9710.9730.9720.9700.9110.9710.9520.9610.9610.9690.9660.9700.9740.9570.9610.9600.957 SEG40.8720.8360.9070.9260.9270.9060.8520.9260.5140.8630.8500.8860.9030.7870.8110.6500.6560.6300.7930.7410.840 SEG50.9991.0001.0000.9990.9940.9930.9810.9920.9800.9980.9570.9750.9900.9950.9980.9940.9900.9990.9180.9960.994 SEG60.9951.0001.0000.9950.9950.9950.9950.9950.9940.9950.9700.9950.9870.9950.9950.9950.9950.9950.9120.9220.887 92A00.9370.9630.9400.9450.9470.9420.9210.9370.9430.9270.8620.9280.9390.9260.9180.9210.9260.8450.9240.8940.922 92A10.8810.9020.9420.9100.9100.8960.8250.9100.9180.8670.9480.9080.8990.8540.8650.6820.6580.7040.7770.7870.868 92A20.8530.8610.8580.8610.8340.8480.8450.8440.8560.8390.8330.8500.8420.8510.8280.8430.8430.8380.8290.8460.786 92A30.8800.8690.8580.8740.8400.8790.8800.8520.8820.8810.8670.8790.8770.8600.8020.8170.8490.8760.7740.8290.683 92A40.9870.9970.9970.9810.9750.9800.9770.9740.9820.9860.9830.9740.9730.9770.9740.9750.9760.9680.9770.9750.975 92A50.9951.0001.0000.9930.9870.9780.9650.9850.9890.9901.0000.9740.9770.9880.9870.9680.9550.9710.9460.9020.912 Ranks5.5006.1296.1716.4717.0007.1437.4437.6437.8579.6009.87111.70012.97113.58613.77114.80015.58616.02916.87117.22917.629

121

In accordance with Table 2 and Figure 1, most methods of over-sampling present a better classification performance than the standard BP with statistical significance.

The under-sampling methods do not present a statistical difference with respect to standard BP performance, and all dynamic sampling approaches improve the standard BP performance with statistical differences.

ADASYIN, SPIDER-1 and SPIDER-2 (over-sampling methods) and RUS, NCL and TL (under-sampling methods) show the trend of improving the classification results, but they do not significantly improve the standard BP performance. Then, the OSS, CNN and CNNTL classify worse than standard BP; this notwithstanding, these approaches do not show a statistical difference with it.

SDSAO, SDSAS and DyS are statistically better than ADASYIN, SPIDER-1 and SPIDER-2 (over-sampling methods) and also than all under-sampling approaches studied in this work. With a statistical difference, the DOS performance is better than CNN, CNNTL, OSS and TL.

Table 2 shows that the trend is that ROS presents a better performance than the proposed method (SDSAO and SDSAS), and that DyS shows a slight advantage over SDSAS; however, in accordance with the Holm–Shaffer post-hoc tests, statistical difference in the classification performance does not exist among these methods (see Figure 1).

In general terms, most over-sampling methods and dynamic sampling approaches are successful methods to deal with the class imbalance problem, but with respect to the training dataset size, SDSAS, SDSAO and DyS use significantly fewer samples than the over-sampling approaches. They employed about 78% less samples than most over-sampling methods; in addition, SDSAS, SDSAO and DyS still use fewer samples than the standard BP trained with the original training dataset.

They use about 60% less samples; these facts stand out in Figure 2. However, the DyS method applies the ROS in each epoch or iteration (see Section 4), whereas SDSA only applies the ROS or SMOTE one time before ANN training (see Section 2).

Figure 2 shows that the under-sampling methods employ significantly fewer samples than the rest of the techniques (except dynamic sampling approaches with respect to RUS, NCL and TL); however, their classification performance in most of the cases is worse than the standard BP (without statistical significant) or is not better (with statistical significant) than the standard BP.

On the other hand, the worst methods studied in this paper (in agreement with Table 2) are those based on the CNN technique (OSS, CNN and CNNTL), i.e., those that use a k−NN rule as the basis and achieving an important size reduction of the training dataset. In contrast, NCL, which is of the k−NN family, also improves the classification performance of the back-propagation; however, the dataset size reduction reached for this method is not of CNN’s magnitude; in addition, it only eliminates majority samples. The use of TL (TL and SMOTE-TL) seems to increase

122

the classification performance, but it does not eliminate too many samples (see Figure 2), except by CNNTL, which we consider to cancel the positive effect of TL by the important training dataset reduction. SMOTE-ENN does not seem to improve the classification performance of SMOTE in spite of including a cleaning step that removes both majority and minority samples. The methods that have achieved the enhancing of the classifier performance are those that only eliminate samples from the majority class.

Furthermore, analyzing only the selective samples methods (SL-SMOTE, B-SMOTE, ADASYN, SPIDER-1 and SPIDER-2), those are the ones in which the more appropriate samples are selected to be over-sampled. It is considered that in the result presented in Figure 2, SL-SMOTE and B-SMOTE obtain the best results, whereas the advantages of ADASYN, SPIDER-1 and SPIDER-2 are not clear (RUS often outperforms these approaches, but without statistical significance; Figure 1).

SL-SMOTE, B-SMOTE and the proposed method do not show statistical significance in their classification results, but the number of samples used by SDSA in the training stage is fewer than employed for SL-SMOTE and B-SMOTE (see Figure 2).

Focusing on the dynamic sampling approaches’ analysis, SDSAO presents a slight advantage in performance than DyS and SDSAS, whereas DOS does not seem to be an attractive method. However, the aim of DOS is to identify a suitable over-sampling rate, whilst reducing the processing time and storage requirements, as well as keeping or increasing the ANN performance, to obtain a trade-off between classification performance and computational cost.

SDSA and DyS improve the classification performance, including a selective process, but while DyS tries to reduce the oversampling ratio during the training (i.e., it applies the ROS method in each epoch with different class imbalance ratios;

see Section 4), the SDSA only tries to use the “best samples” to train the ANN.

Dynamic sampling approaches are a very attractive way to deal with a class imbalance problem. They face two important topics: (i) improving the classification performance; and (ii) reducing the classifier computational cost.

Dalam dokumen Applied Artificial Neural Networks (Halaman 133-138)