Experimental Results - Prediction of Protein methylation sites of lysine residues using machine

From literature, we see that non of the study used evolutionary based features. However, it showed an amazing output in case of other PTM prediction. Again, evolutionary information extracted from PSSM and transforming it to profile bigram also showed an amazing feature extraction technique for PTM. Therefore, we extracted the evolutionary based profile bigram as features first and applied different machine learning algorithms to identify the influence of this feature in case of methylation site prediction. The results that we have achieved using evolutionary based profile bigram as extracted features for 10 fold cross validation is given on table 4.1.

From table 4.1 we can see that most of the classifiers (except Gaussian NB and Bernoulli NB) is having a good performance in case of different evaluation criteria. We can also say that Naive Bayes based algorithm, were we follow the principle of conditional probability is not going with this prediction method where we have used evolutionary based information converted to

4.2 Experimental Results

Classifier ACC(%) MCC Precision (%)

Roc AUC score (%)

F1 score Sensitivity (%)

Specificity (%)

Logistic Re- gression (LR)

97.4 0.94 97.8 97.4 0.96 96.9 97.6

Decision Tree (DT)

92.1 0.83 92.7 91.1 0.90 91.5 93.1

Gradient Boosting (GB)

95.1 0.90 95.7 94.3 0.93 93.2 95.8

Gaussian NB 73.5 0.49 84.5 73.3 0.67 56.6 90.0

AdaBoost 98.1 0.96 97.6 98.3 0.97 97.4 98.6

BernoulliNB 87.3 0.75 82.9 87.4 0.87 94.1 81.8

MLP 97.2 0.94 97.8 97.0 0.97 96.6 97.9

SVM 98.7 0.96 98.2 97.9 0.98 98.4 98.8

Table 4.1: Results of different classifiers using evolutionary based profile bigram as features for 10 fold cross validation

profile bigram as feature set. We can also observe that neural network based algorithm MLP and ensemble learning based algorithm AdaBoost and Gradient Boosting is having very nearer result to single and basic classifier Logistic regression and SVM. It shows the generosity of our extracted features and also shows the impact of evolutionary based features in case of methylation site prediction.

In order to investigate more precisely, we also generated the independent test set results for evolutionary based profile bigram extracted features for different machine learning algorithms.

Our observation of its performance are given on table 4.2. From the table we can see that here also all the different classifiers except the naive bayes based classifiers are providing a good result in case of different performance evaluation criteria. Here the decision tree classifier is giving a little less result compared to other ensemble and single classifiers, however, we cannot ignore the results as it is having accuracy, precision ROC AUC score, sensitivity and specificity above 90%. In case of other present ensembles and single classifiers, they are providing accuracy, precision ROC AUC score, sensitivity and specificity over 95%. From both table 4.1 and table 4.2 where the 10 fold cross validation and independent test set results are given for evolutionary based profile bigram features we can say state evolutionary based information is having a great influence in methylation site prediction. They are having a very promising result in different performance measure criteria when we are using different kind of machine learning algorithms.

To investigate more and in the hope to improve the results more, we incorporate the pre-

4.2 Experimental Results

Classifier ACC(%) MCC Precision (%)

Roc AUC score (%)

F1 score Sensitivity (%)

Specificity (%)

Logistic Re- gression (LR)

97.1 0.93 96.8 97.2 0.96 97.3 96.9

Decision Tree (DT)

91.2 0.82 90.7 91.3 0.91 91.5 90.9

Gradient Boosting (GB)

96.2 0.92 95.5 96.3 0.96 96.8 95.8

Gaussian NB 74.2 0.52 88.4 73.9 0.67 54.9 93.0

AdaBoost 97.1 0.94 96.8 97.1 0.97 97.3 96.9

BernoulliNB 85.9 0.72 82.5 86.1 0.86 90.6 81.3

MLP 97.8 0.94 96.5 97.8 0.97 98.8 96.5

SVM 97.3 0.94 98.3 97.6 0.97 97.9 98.2

Table 4.2: Results of different classifiers using evolutionary based profile bigram as features for independent test set

dicted structural based features with evolutionary based bigram profile and try to apply the same algorithms to identify how they perform with combinational feature. From literature we already identify that structural information has great influence in methylation site prediction.

Therefore, we want to incorporate two different important features together in this prediction technique to achieve better results. In this regard, we build our model LyMethSE using incorporated combinational feature of evolutionary based bigram profile information and predicted structural information. As from table 4.1 and table 4.2 we identify that evolutionary based information is working very well with SVM. We use SVM as base classifier to predict the methylation site in our model LyMethSE. The results that we achieve using combinational feature extraction method for our model for 10 fold cross validation are given on table 4.3.

From the table we can see that LyMethSE is having a satisfying result in comparison with other classifiers. In table 4.3 we can find that in LyMethSE is having 98.5% accuracy, 0.97 MCC, 98.6% precision value, 98.5% Roc AUC score, 0.99 F1 score, 98.5% sensitivity and 98.5%

specificity. In comparison with other classifiers, LyMethSE is outperforming in all the evaluation criteria. However, we can not ignore the results that we are having for other different classifiers. Except the naive bayes based classifiers, almost all other classifiers are performing very well with combinational features and also providing a very promising results in different performance measurement criteria.

To investigate the combinational feature more, we also generate results for independent test

4.2 Experimental Results

Classifier ACC(%) MCC Precision (%)

Roc AUC score (%)

F1 score Sensitivity (%)

Specificity (%)

Logistic Re- gression (LR)

97.3 0.95 97.6 97.3 0.97 97.1 97.6

Decision Tree (DT)

91.4 0.83 92.1 91.4 0.91 90.9 91.9

Gradient Boosting (GB)

96.1 0.92 97.2 96.1 0.96 95.0 97.3

Gaussian NB 74.6 0.51 84.3 74.7 0.71 60.9 88.4

AdaBoost 97.7 0.96 97.8 97.7 0.98 97.8 97.7

BernoulliNB 86.6 0.74 82.1 86.6 0.88 94.4 78.9

MLP 97.0 0.94 97.5 97.0 0.98 96.5 97.5

Rotation For- est

93.0 0.86 93.6 93.0 0.93 92.5 93.5

LyMethSE 98.5 0.97 98.6 98.5 0.99 98.5 98.5

Table 4.3: Results of LyMethSE along with other classifiers using 10 fold cross validation

score for the same classifiers that we have used for 10 fold to compare the results. The results that we achieve for independent test set are given in table 4.4.

Classifier ACC(%) MCC Precision (%)

Roc AUC score (%)

F1 score Sensitivity (%)

Specificity (%)

Logistic Re- gression (LR)

98.0 0.96 99.1 98.0 0.98 96.9 99.1

Decision Tree (DT)

93.5 0.87 95.3 93.5 0.93 91.5 95.5

Gradient Boosting (GB)

96.2 0.92 96.8 96.2 0.96 95.5 96.8

Gaussian NB 73.5 0.50 84.9 73.6 0.69 57.6 89.5

AdaBoost 98.4 0.97 100 98.4 0.98 96.9 100

BernoulliNB 89.2 0.79 85.4 89.2 0.89 94.6 83.7

MLP 97.5 0.95 99.5 97.5 0.97 95.5 99.5

Rotation For- est

91.9 0.84 93.1 91.9 0.92 90.6 93.2

LyMethSE 98.9 0.98 100 98.9 0.99 97.8 100

Table 4.4: Results of LyMethSE along with other classifiers using independent test set

As shown in Table 4.4, LyMethSE also outperforms and produced a significant result for independent test set as well in comparison with other eight classifiers. LyMethSE achieves

4.2 Experimental Results

98.9%, 100%, 97.8%, 100%, 0.98, and 0.99 in terms of Accuracy, Precision, Sensitivity, Speci- ficity, MCC, and F1-score, respectively. This result scheme is having best result in case of all the evaluation criteria. Achieving consistently higher results for both 10-fold cross-validation and independent test set demonstrates the generality of LyMethSE and its preference over other classifiers with combinational feature. Other different machine learning algorithms are performing well here too and the results they produced is having a very near one with LyMethSE. It proves the success of our feature extraction method and generation of combinational features with all the important information to predict methylation site.

As Results of combinational features demonstrated in Table 4.3 and Table 4.4 reflects the performance of different classifiers for combinational feature of evolutionary bi-profile and structural information. If we analyze the results more, we can get that the classification results are not varying much and almost all the classifiers are proving prediction accuracy results which is more than 90% in average (except GNB). As shown in these tables, all classifiers attain high prediction performance (excluding GNB) which demonstrates the effectiveness of using the feature of bi-gram profile to extract evolutionary information from PSSM and combining them with predicted structural information to tackle this problem. Using this combination of feature extraction process helps to keep the feature size constant even for different window sizes. This helps our model to obtain discriminatory information and also to keep the computational cost low for large amount of data. Therefore, not only LyMethSE is providing higher performance with respect to different evaluation scores, but also process large amount of protein sequences with a fixed size features which is beneficiary in terms of time consumption.

As from the discussion above we can say that using evolutionary based bi profile feature we have a very good result for different classifiers. However, combining it with predicted structural information help us to improve the result a little more. To compare and show the difference of using evolutionary based bi profile as features and also using combination of it with predicted structural information as feature, we generate a bar chart of MCC score and f1 score for both of the results that we achieve for this two feature extraction methods. The bar chart of MCC and f1 score for evolutionary based bigram profile as features and combination of it with predicted structural information as feature for 10 fold cross validation is shown in figure 4.1. We generate the same thing for independent test score as well which is given on figure 4.2.

As our dataset was imbalanced and we balanced it using under sampling method, we focus to show the comparison result based on MCC and f1 score. From the bar chart we can say that for most of the classifiers, it worked a little better for combinational feature in terms of MCC

4.2 Experimental Results

Figure 4.1: (a) MCC score and (b) f1 score of classifiers results for 10 fold cross validation using evolutionary based bigram as feature and combination of evolutionary based bigram with predicted structural information as feature

and f1 score for both cross validation and independent test set. It shows that combining the predicted structural information with the evolutionary based bigram profile help the model to improve its results in terms of different evaluation criteria.

To provide more insight to our achieve results, we generate the Receiver Operating Charac- teristic (ROC) curves for different used classifiers and our model LyMethSE. the ROC curves for 10-fold cross-validation and independent test set for different classifiers along with LyMethSE are given in Figure 4.3.

In figure 4.3, the x-axis shows the False Positive Rate or FPR and the y-axis shows the True Positive Rate or TRP. TPR and FPR basically refers to the sensitivity and specificity, respectively. We want to reach the optimum point for the TRP and FRP graph where the TRP touches the maximum value and FRP remains in minimum value. The left ROC curve reflects the results for 10-fold cross-validation whereas the right one is for independent test results.

As shown in this figure, LyMethSE achieves better results compared to other classifiers which demonstrates the generality and effectiveness of this model. However, we can see from the ROC curve that some other classifiers such as Adaboost(AB), Logistic Regression(LR) and Gradient Boosting(GB) is having a promising result in the curve too. Note that the other classifiers that are shown in the curve are also using the same extracted features which are extracted using our novel approach. Promising and competitive results that are achieved by other classifiers using

4.2 Experimental Results

Figure 4.2: (a) MCC score and (b) f1 score of classifiers results for independent test set using evolutionary based bigram as feature and combination of evolutionary based bigram with predicted structural information as feature

Figure 4.3: ROC Curve for TPR vs FPR using (a) 10-fold cross validation and (b) independent test set

4.2 Experimental Results

Figure 4.4: Precision-Recall Curve for LyMethySE model and other classifiers

our extracted features, demonstrates their effectiveness of this feature extraction technique in methylation site prediction task. From figure 4.3 (a) we can see that the AUC score of LyMethSE for cross validation is 0.994 and for the independent test set in figure 4.3 (b) is 0.999. Though in case of cross validation, Adaboost(AB) and Gradient Boosting(GB) and in case of independent test set Adaboost(AB) and Logistic Regression(LR) is having the same value for area under the ROC curve, by analyzing the curves, we can say that LyMethSE is holding the best area for both of the cases.

As it is discussed in the literature, achieving even the best AUC for ROC curve does not guarantee to have the best area under the curve for Precision-Recall (PR) curve [67]. Therefore, in figure 4.4, the precision-recall curves for LyMethSE compared to our employed classifiers are shown. In this curve, x-axis shows the level of recall and y-axis shows the level of precision.

As shown in Figure 4.4, LyMethSE demonstrate AUC of 0.999 for PR curve which is in line with those reported for ROC curve and better than those reported for other classifiers. From the curve we also can see that Logistic regression(LR) and AdaBoost(AB) is also having the same area under PR curve. However, from the curve we can analyze and find out that Ly- MethSE is holding the best result for precision and recall. In other way, the outstanding results which is near to LyMethSE is actually reflecting the benefit of using combinational features of evolutionary and predicted structural information in the field of methylation site prediction.

4.2 Experimental Results

As we discussed about different investigation methods to prove the generosity of our model and feature extraction technique, we did all this experiments using a dataset which was being balanced using KNN based under sampling method. In order to ensure if we are not using any important data while under sampling, we decide to implement a over sampling method as well and check the outcome to compare if our balancing method is able to extract the important information. In order to do this, we applied a most common and widely used minority class over sampling method Synthetic Minority Over-sampling Technique (SMOTE). In our dataset, there were 1116 positive instances which indicates the methylated site data and 40857 negative instances which indicated non-methylated sites. We applied the SMOTE to over sample the positive class using euclidean distance. After applying this balancing method, we have 40857 positive and negative instances. In this balanced dataset with combinational feature of evolutionary bigram and predicted structural information, we applied the previously best performing algorithms for independent test set. The results that we get is given on table 4.5. From the results we can identify that all the best performed machine learning algorithms with knn based under sampling method are giving much lower results with SMOTE over sampling method.

Therefore, we can come to a point that this over sampling method is adding unnecessary information and thus our important information are not considered properly. However, it also proves that using knn based under sampling method of majority class here performed well for our dataset to extract and keep all the important information.

Classifier ACC(%) MCC Precision (%)

Roc AUC score (%)

F1 score Sensitivity (%)

Specificity (%)

Logistic Re- gression (LR)

73.2 0.46 72.4 73.1 0.74 75.6 70.6

Decision Tree (DT)

93.3 0.86 91.5 93.3 0.93 95.7 90.9

Gradient Boosting (GB)

86.9 0.73 85.1 86.8 0.87 89.7 84.0

AdaBoost 82.4 0.64 80.9 82.4 0.83 85.2 79.5

MLP 97.8 0.95 97.6 97.8 0.97 98.0 97.6

SVM 92.0 0.84 89.2 92.0 0.92 95.8 88.2

Table 4.5: Results of different classifiers using combinational features and SMOTE minority class over sampling method

The comparison of results achieved using LyMethSE compared to other classifiers indicates the preference of our proposed model compared to other classifiers used for this task. In

Dalam dokumen Prediction of Protein methylation sites of lysine residues using machine learning algorithms (Halaman 34-43)