5.2 Scalar Evaluation Measures
5.2.3 Comparison
5.2 Scalar Evaluation Measures 97
Ensemble Methods
The scalar evaluation measures are given for the RAkEL and ECC ensemble methods for dataset 2 in table 5.12. The highest value for each measure is shown in bold. As for dataset 1, the RAkEL method produces no blanket positive predictions, which means that it can provide predictive information for every label. The highest values forT NR,MCCandDPproduced by the RAkEL technique are all for tenofovir. This means that RAkEL performs best at predicting this label, which is confirmed by the fact that theAUCvalue is also high (0.70).
The ECC technique trained and validated using dataset 2 shows the best results out of all the experiments performed. It has the highest mean values for theT PR,MCC,DPandAUCevaluation measures across all experiments on both datasets. ECC has a particular high value (and low standard deviation) for theAUC measure for the efavirens label (0.83) and has a T PR value of 0.97 for the same label. This means that ECC correctly detects resistance to efavirens in 97% of cases for dataset 2. The ECC technique also has a T PRvalue above 0.95 for six different labels (efavirens, emtricitabine, delavirdine, nevirapine, etravirine and lamivudine). For the same 6 labels, ECC has anAUCof over 0.71 in each case.
5.2 Scalar Evaluation Measures 98
Table 5.13. Problem transformation methods evaluation measure differences. The values are generated by subtracting the values generated for dataset 1 from the corresponding values for dataset 2. The absolute values are shown and bold values indicate improved performance on dataset 2, while grey values imply that performance was better on dataset 1.
BR-SVM BR-NB HOMER
TPR TNR MCC DP TPR TNR MCC DP AUC TPR TNR MCC DP AUC efavirens 0.00 0.00 - - 0.39 0.59 0.11 0.45 0.21 0.28 0.60 0.21 0.72 0.16 didanosine 0.16 0.04 0.22 0.56 0.16 0.10 0.07 0.17 0.03 0.14 0.16 0.03 0.10 0.04 emtricitabine 0.00 0.00 - - 0.01 0.08 0.07 0.16 0.11 0.24 0.50 0.14 0.19 0.16 delavirdine 0.00 0.00 - - 0.40 0.57 0.11 0.39 0.16 0.33 0.52 0.12 0.39 0.06 stavudine 0.08 0.06 0.00 0.05 0.07 0.07 0.01 0.04 0.01 0.01 0.09 0.10 0.26 0.04 nevirapine 0.00 0.00 - - 0.39 0.59 0.11 0.45 0.21 0.28 0.60 0.21 0.72 0.16 tenofovir 0.16 0.00 0.17 0.43 0.20 0.10 0.05 0.07 0.08 0.02 0.01 0.04 0.12 0.06 etravirine 0.01 0.00 0.05 0.72 0.24 0.47 0.18 0.43 0.10 0.09 0.27 0.16 0.49 0.07 abacavir 0.08 0.10 0.18 0.43 0.03 0.05 0.02 0.06 0.01 0.13 0.11 0.01 0.02 0.06 lamivudine 0.00 0.00 - - 0.01 0.08 0.07 0.16 0.11 0.24 0.50 0.14 0.19 0.16 zidovudine 0.02 0.02 0.01 0.01 0.06 0.04 0.02 0.01 0.02 0.01 0.05 0.05 0.14 0.01 mean 0.03 0.02 0.10 0.37 0.12 0.21 0.07 0.22 0.09 0.15 0.28 0.07 0.20 0.06 std dev 0.02 0.02 0.08 0.17 0.02 0.06 0.00 0.01 0.01 0.11 0.25 0.04 0.06 0.00
BR-SVM- Binary relevance with support vector machine base classifier,BR-NB- Binary relevance with naive Bayes base clas- sifier,HOMER - Hierarchy of multi-label classifiers,TPR- True positive rate, TNR- True negative rate, MCC- Matthews correlation coefficient,DP- Discriminative power,AUC- Area under the receiver operating characteristic curve
5.2 Scalar Evaluation Measures 99
and subtracting the corresponding value produced for dataset 1 (0.26 shown in table 5.7). Only the absolute values of the differences are shown in the tables and a positive difference (i.e. the technique performs better on dataset 2) is shown in bold, while negative differences (the technique performs better on dataset 1) are shown in grey. Note that a reduced standard deviation is considered an improvement, so the standard deviation values shown in bold indicate that the technique had a lower standard deviation on dataset 2.
Problem Transformation Methods
Table 5.13 shows the evaluation measure differences for the problem transformation techniques. There is no difference for the six labels that BR-SVM produces blanket positive predictions for. Most of the differences indicate that BR-SVM performs better on dataset 2. This is confirmed by the fact that the mean value for every evaluation measure has improved for dataset 2. The most notable increase for the BR-SVM technique is theDPvalue for etravirine, which increased from 0.31 to 1.04 – an increase of over 300%. The value for all evaluation measures increased for both didanosine and abacavir, which implies on dataset 2 BR-SVM performs better in all cases for these drugs.
The evaluation measure difference for the BR-NB method shown in table 5.13 indicates improved per- formance overall on dataset 2. This is evident from the fact that the mean values forT NR,MCC,DPand AUC have all increased. The only mean value that did not increase wasT PR, which decreased by 0.12.
This implies that the BR-NB method is on average 12% worse at detecting cases of resistance. The standard deviations for each evaluation measure improved or stayed the same forT PR, T NR andMCC. Standard deviation decreased by 0.01 for both DP andAUC, meaning that the range of values obtained for these measures was distributed more widely for dataset 2 than for dataset 1.
As is the case with BR-NB, T PR for the HOMER technique has decreased for dataset 2. The mean value for theT PRevaluation measure has decreased by 0.15. The only two instances where theT PRvalue increased were for stavudine and zidovudine (both drugs in dataset 2 to which patients are overwhelm- ingly more susceptible than resistant). The evaluation measure that has increased most significantly for the
5.2 Scalar Evaluation Measures 100
HOMER method isT NR. TheT NRvalue has increase by 0.60 for efavirens and nevirapine and by 0.50 or more for emtricitabine, delavirdine and lamivudine. In the case of efavirens, this means that HOMER is 5 times better at predicting susceptibility to efavirens for dataset 2 than for dataset 1, which is a considerable improvement. The improvement is similar for nevirapine, emtricitabine, delavirdine and lamivudine.
Algorithm Adaptation Methods
The scalar evaluation measure differences for the algorithm adaptation techniques are shown in table 5.14.
It can be seen that theT NRvalues for MLkNN have all either increased or stayed the same, meaning that MLkNN is better at detecting susceptibility to drugs on dataset 2 than on dataset 1. While the mean value increased for four out of the five measures, the standard deviation decreased for four out of the five. This implies that overall predictive power increased, but the range of values for the evaluation measures was spread more widely. Although not shown in table 5.14, it is important to notice that the MLkNN method only has two blanket positive predictions on dataset 2 (emtricitabine and lamivudine) compared to six on dataset 1 (efavirens, emtricitabine, delavirdine, nevirapine, etravirine and lamivudine), which is a large improvement and implies that MLkNN can provide meaningful predictive information for four more labels on dataset 2 than on dataset 1.
The values for all four evaluation measures decreased for the PCT technique for efavirens, emtricitabine, delavirdine, nevirapine and lamivudine. These are all drugs to which many more patients are resistant than susceptible to (in both datasets). For all the other labels, the evaluation measure values have increased, with the exception of theT NRvalue for didanosine, which has decreased by a small amount (0.03). The most sig- nificant evaluation measure improvement for dataset 2 for the PCT technique is theDPvalue for tenofovir, which has increased by 0.54. This represents a more than 100% improvement for that measure. Overall, considering the sum of all the improvements (2.49) and comparing this to the sum of all the decreases in performance (3.16), it appears that the PCT method performs slightly better on dataset 1 than on dataset 2.
5.2 Scalar Evaluation Measures 101
Table 5.14. Algorithm adaptation methods evaluation measure differences. The values are generated by subtracting the values generated for dataset 1 from the corresponding values for dataset 2. The absolute values are shown and bold values indicate improved performance on dataset 2, while grey values imply that performance was better on dataset 1.
MLkNN PCT
TPR TNR MCC DP AUC TPR TNR MCC DP
efavirens 0.00 0.02 - - 0.14 0.04 0.06 0.14 0.61
didanosine 0.01 0.11 0.13 0.48 0.04 0.08 0.03 0.04 0.07
emtricitabine 0.00 0.00 - - 0.13 0.02 0.01 0.05 0.22
delavirdine 0.00 0.02 0.09 - 0.06 0.04 0.05 0.13 0.59
stavudine 0.03 0.03 0.00 0.16 0.02 0.07 0.01 0.08 0.21
nevirapine 0.00 0.02 - - 0.14 0.04 0.06 0.14 0.61
tenofovir 0.01 0.01 0.10 - 0.02 0.17 0.02 0.20 0.54
etravirine 0.01 0.02 - - 0.11 0.00 0.01 0.01 0.03
abacavir 0.12 0.04 0.08 0.19 0.02 0.06 0.02 0.07 0.17
lamivudine 0.00 0.00 - - 0.13 0.02 0.01 0.05 0.22
zidovudine 0.01 0.00 0.04 0.53 0.07 0.14 0.02 0.13 0.34
mean 0.01 0.02 0.05 0.42 0.06 0.04 0.02 0.00 0.08
std dev 0.01 0.01 0.03 0.12 0.02 0.07 0.01 0.05 0.31
MLkNN- Multi-labelk-nearest neighbours,PCT- Predictive clustering trees,TPR- True positive rate,TNR- True negative rate, MCC- Matthews correlation coefficient,DP- Discriminative power,AUC- Area under the receiver operating characteristic curve
5.2 Scalar Evaluation Measures 102
Table 5.15. Ensemble methods evaluation measure differences. The values are generated by subtracting the values generated for dataset 1 from the corresponding values for dataset 2. The absolute values are shown and bold values indicate improved performance on dataset 2, while grey values imply that performance was better on dataset 1.
RAkEL ECC
TPR TNR MCC DP AUC TPR TNR MCC DP AUC
efavirens 0.24 0.45 0.10 0.18 0.13 0.03 0.28 0.36 - 0.15
didanosine 0.12 0.26 0.18 0.49 0.07 0.11 0.05 0.17 0.40 0.07 emtricitabine 0.13 0.26 0.07 0.03 0.05 0.02 0.21 0.24 0.38 0.15 delavirdine 0.22 0.59 0.27 0.83 0.10 0.04 0.28 0.18 - 0.11 stavudine 0.12 0.03 0.08 0.20 0.08 0.09 0.01 0.11 0.27 0.12
nevirapine 0.24 0.47 0.12 0.25 0.12 0.03 0.27 0.33 - 0.21
tenofovir 0.04 0.04 0.11 0.33 0.04 0.11 0.02 0.06 0.10 0.09 etravirine 0.20 0.38 0.18 0.54 0.09 0.03 0.23 0.29 1.15 0.15 abacavir 0.09 0.10 0.03 0.12 0.00 0.00 0.04 0.05 0.13 0.03 lamivudine 0.08 0.19 0.03 0.30 0.06 0.03 0.25 0.25 0.30 0.16 zidovudine 0.01 0.01 0.01 0.01 0.02 0.02 0.02 0.00 0.01 0.05
mean 0.10 0.25 0.11 0.24 0.07 0.02 0.14 0.18 0.45 0.12
std dev 0.07 0.13 0.01 0.10 0.01 0.05 0.13 0.06 0.11 0.00
RAkEL- Randomk-labelsets,ECC- Ensemble of classifier chains,TPR- True positive rate,TNR- True negative rate,MCC- Matthews correlation coefficient,DP- Discriminative power,AUC- Area under the receiver operating characteristic curve
5.2 Scalar Evaluation Measures 103
Ensemble Methods
The evaluation measure differences for the ensemble methods are given in table 5.15. As with BR-NB and HOMER, most of theT PRvalues have decreased in dataset 2 for RAkEL. The only labels for which theT PR values have increased are stavudine and tenofovir. Both of these labels correspond to drugs to which many more patients are susceptible than resistant. The only other drug for which this is the case is zidovudine and theT PRvalue for zidovudine has only decreased by a very small amount (0.01). This means that overall, for theT PRmeasure, RAkEL has improved for the drugs to which patients are more susceptible than resistant, and worsened for drugs to which patients are more resistant than susceptible. The values forT NR,MCC, DPandAUChave all increased on average.
TheAUCvalue has increased for all labels for dataset 2 for the ECC technique, with the value for nevi- rapine increasing by the largest amount (0.21). TheMCCvalues have also all either increased or remained the same and all but one of theDPvalues (zidovudine, which has only decreased by 0.01) have increased.
The mean value for every evaluation measure has increased and the standard deviations have all improved (i.e. decreased) or remained the same. This means that the ECC has improved by all measures on dataset two, with an average of a 14% increase for theT NRmeasure and a 12% increase forAUC. Overall, the ECC technique has showed the most improvement on dataset 2, with 42 out of the total of 55 evaluation measures increasing and only 9 measures decreasing.