6.2 Experimental Results 127
patients were susceptible to drugs when very few such examples exist in the training datasets. The perfor- mance of the BR-SVM and MLkNN techniques on dataset 2 are similarly poor, although slightly better than on dataset 1. It follows that BR-SVM and MLkNN are not useful in a real world clinical environment if the proportions of patients resistant and susceptible to the various drugs are not well balanced.
Dataset Effects
The poor performance of BR-SVM is surprising, since SVMs are a popular technique and have been applied to many problems. However, the results obtained are consistent with previous investigations on the appli- cation of SVMs using imbalanced data, which found that SVMs performed poorly in this case [103, 104].
Furthermore, it has been shown that SVMs will favour the majority class [80], which is exactly what can be seen in tables 5.7 and 5.10. These tables illustrate the blanket positive predictions generated by the SVMs.
The results obtained for MLkNN also correspond with previous studies that show that the nearest neighbours technique does not perform well on imbalanced datasets [105].
Techniques that perform well on the highly imbalanced labels (such as efavirens and emtricitabine) perform less well on the relatively balanced labels (such as didanosine and abacavir). The techniques that exhibit this behaviour are BR-NB, HOMER, RAKEL and PCT on dataset 1 and all of the techniques except BR-SVM and MLkNN on dataset 2. In general, this effect is less prominent on dataset 2 since the models produce more consistent predictions (as shown by the reduced standard deviations). The fact that the pre- dictive models seem not to be able to simultaneously produce accurate predictions for highly imbalanced as well as relatively balanced labels supports the idea that no single model should be used to assess resistance to all drugs. The results of multiple models should be combined into an ensemble to produce the best re- sults. Using ensembles orcommitteesin this context has been studied before and was shown to increase the performance of classification models trained for predicting response to HIV therapy [75].
It is also important to note that models that perform well on imbalanced labels do so regardless of whether patients are more resistant to the corresponding drug or whether they are more susceptible. An
6.2 Experimental Results 128
example of this can be seen in figure 5.10 on page 93, where it can be seen that the ECC technique performs well at predicting resistance to nevirapine, to which patients are overwhelmingly more resistant than sus- ceptible. The same method also performs well at predicting susceptibility to tenofovir, to which patients are much more susceptible than resistant.
Classification Technique Effects
The remarkably different results obtained by the BR-SVM and BR-NB techniques demonstrate the effect that the choice of base classifier can have (see tables 5.9 and 5.12). Therefore, the comparison of the perfor- mance of different multi-label classification techniques that make use of base classifiers is only meaningful when each uses the same base classifier. In this study the same base classifier is used for all of the BR-NB, HOMER, RAkEL and ECC techniques. PCT does not require a base classifier.
The BR-NB and HOMER techniques show very similar performance, with the main difference being that the HOMER method has a much lower mean standard deviation on dataset 2. This is consistent with the literature [30]. The RAkEL and ECC techniques both also produce lower mean standard deviations on dataset 2. It can thus be suggested that the HOMER, RAkEL and ECC methods produce more consistent predictions as the training dataset grows, while the consistency of the predictions produced by BR-NB remains unchanged. Although the performance of the PCT method on dataset 2 is not improved, the mean standard deviation is still less than on dataset 1. The mean standard deviations of the BR-SVM and MLkNN methods are higher on dataset 2 than on dataset 1.
The performance of the RAkEL technique corresponds with what has been reported by previous studies [27]. In this study it can be seen that RAkEL performs better than both the binary relevance method (BR- SVM and BR-NB) and MLkNN. Furthermore, it has also been demonstrated that RAkEL does not perform as well as ECC [106], which is supported by our results.
The meanAUCvalues for the ECC method (the best performing method in terms of theAUCmeasure) are 0.63 on dataset 1 and 0.75 on dataset 2. These results are comparable to the results obtained by two
6.2 Experimental Results 129
recent studies [8, 36], both of which investigate the application of classification techniques to the problem of HIV treatment failure. One study [8] made use ofrandom forestclassifiers trained using 14891 TCEs and did not include genotype resistance test information. The TCEs were extracted from a research database containing patient records from North America, Europe, Australia and Japan. Therandom forest models obtainedAUCvalues in the range 0.74−0.77. The other study [36] used the ECC method and was trained on a dataset that consisted of 600 training examples that included GRT data. The training examples were extracted from a North American research database and the ECC models achievedAUCvalues in the range 0.77−0.91. The models examined in this study were trained using less data, different machine learning techniques and data from a rural South African context that does not included GRT information. The fact that the results achieved are still comparable to those found in the literature is a promising outcome.
The Accuracy Paradox
Many of the techniques (BR-NB, HOMER, MLkNN, RAkEL) show a reduced meanT PR on dataset 2, which could be interpreted to mean that the techniques performed worse in general on dataset 2 than on dataset 1. However, this is not the case and the reduced T PR values are observed due to the accuracy paradox [95]. The accuracy paradox occurs when a predictive model demonstrates inflated performance on imbalanced datasets. An example of the accuracy paradox is given in section 2.2 using equation 2.1 (simple accuracy). Other examples of the accuracy paradox are the blanket positive predictions produced by BR-SVM and MLkNN in some cases. These blanket positives result in aT PRvalue of 1.0, which implies that the techniques identify positive examples in 100% of cases. However, it is known that these techniques do not provide any useful predictive information at all because theT NRvalue is 0.0.
In order to avoid misinterpreting the accuracy paradox as good performance (or interpreting reduced values for a single metric as overall worsened performance), it is important to consider multiple evaluation measures. This is the fundamental reason whyROC analysis is used (includingAUC) in addition to four different scalar evaluation measures in this study. IfT PRandT NRare considered together, it can be seen
6.2 Experimental Results 130
that all techniques performed better on dataset 2 than on dataset 1. If all evaluation measures are considered, it can be seen that all but thePCT method have performed better on dataset 2 than on dataset 1 (although PCT did show improved standard deviation values on dataset 2).
Sensitivity And Specificity
The scalar evaluation measure values obtained indicate that many of the techniques (BR-NB, HOMER, PCT, RAkEL and ECC) are able to detect resistance to one or more drugs in over 95% of cases (as measured by T PR). The PCT technique is able to do so for three different drugs (efavirens, delavirdine and nevirapine) on dataset 1, while the ECC method achieves aT PRof 95% or higher for six drugs on dataset 2 (efavirens, emtricitabine, delavirdine, nevirapine, etravirine, lamivudine). However, although theT PRvalues are high, they are often accompanied by a low T NRvalue, indicating that the rate of false positives is high. This means that if the predictive models were used to prioritise patients for genotype resistance testing, a high number of patients could be tested unnecessarily. Fortunately this effect is reduced when the models are trained using dataset 2 and it is expected to further diminish as the size of the training dataset grows.
Despite this potential for suboptimal GRT prioritisation, there are still use cases where the models (still trained on relatively small datasets) could be used effectively in resource-limited settings. For example, if GRT is not available and/or if the care provider decides that a regimen change is necessary, the false positives can be interpreted as the models making conservative predictions. Clinically, it is important to try to retain patients as long as possible on first line treatment regimens since second line therapy costs as much as 2.4 times more than first line therapy and compromises treatment outcomes [7], but if the decision is made to switch regimen, conservative prediction results could be used to help design a treatment regimen to which the virus is susceptible by avoiding drugs for which the models produceT PRvalues close to 1.0.
Techniques like PCT, BR-NB and ECC have high values for all measures for those drugs on which they perform well. This should give us confidence that the decision support information provided is reliable and not influenced by a single outlier evaluation measure value. Conversely, we should have less confidence in
6.2 Experimental Results 131
the performance of a technique for a drug if only one of the measures show good performance, for example, the MLkNN technique has a highT NRvalue (0.97) for stavudine, but all other metrics indicate very poor performance. This includes techniques that produce blanket positive predictions and therefore have aT PR value of 1.0. Despite this apparently highT PRvalue, these techniques provide no information that can be used to support decisions.
Data Dependencies
It has been shown that feature dependencies can sometimes have an effect on the performance of classifica- tion techniques [107, 108]. The simplifying assumption has been made (discussed in section 4.1.4) that all features are independent of each other. However, in reality, this is not the case sinceviral loadhas an effect onCD4 countbecause the virus has an impact on the human immune system (measured usingCD4 count).
This dependency (and possibly others) could have a negative impact on the performance of the classifica- tion techniques that make use of the naive Bayes classifier, which makes explicit use of the independence assumption. It has been shown that taking feature dependencies into account has the most positive effect on domains where decision tree learners (such as the PCT technique) perform significantly better than naive Bayes [107]. This is not the situation in the experiments that have been conducted, so there is not likely going to be large gain in performance by explicitly representing feature dependencies.
In this study the simplifying assumption has been made that all labels are independent of each other. We know from the literature that this is not true and this effect can even be observed in our own results [4]. For example, it can be seen in figure 5.7 on page 87 that both the BR-NB and HOMER method perform well for the emtricitabine and tenofovir labels. The drugs corresponding to these labels are both members of the same drug group (nucleotide analogue reverse transcriptase inhibitorsor NRTIs). This effect can also be seen in figure 5.8 on page 89, where it can be seen that the PCT method performs best for the efavirens and nevirapine labels. The drugs corresponding to these two labels are both members of thenon-nucleoside reverse transcriptase inhibitor (NNRTI) drug group. The fact that performance across related labels is consistent could mean that the classifiers do somehow implicitly encode label dependencies.
6.2 Experimental Results 132
It is possible that the label independence assumption could have a negative impact on the performance of the classification techniques. Some studies have investigated the effect of explicitly modelling label dependencies and have obtained promising results [109, 110]. A very recent study conducted with the objective of predicting HIV drug resistance found that the ECC multi-label classification technique modelled label dependencies implicitly [36]. This implicit label dependency modelling can be seen in figure 3.3 on page 43, where it can be seen that each classifier in the classifier chain is trained using the labels from the previous classifiers as input. Although label dependencies have not been explicitly modelled in this study, we are still able to provide some evidence that such dependency modelling would improve performance, since the one technique used that implicitly models label dependencies is also the technique that performs best over all.
Temporal Nature of the Data
The patient data used to construct the training datasets consist of longitudinal records. This means that the information stored in the clinical information system for each patient has been captured over a period of time and each piece of information associated with a patient is also associated with a point in time. This temporal nature of the data has not been explicitly modelled in the features used for training, but some features do have an implicit temporal aspect. For example, thebaseline viral loadfeature represents a viral load test performed on a patient within a certain time frame around the date on which the patient first started on antiretroviral therapy. Another example isrecent CD4 gradient, which calculates the rate of change of the CD4 count measurements taken for a patient in the last year.
There exists another temporal aspect to the dataset, which is the fact that each patient captured in the system has been added to the database at a specific point in time. In other words, patients who were infected a long time ago were added to the system before recently infected patients. This temporal element has the potential to have an impact on our predictive models because of the way that HIV evolves. Recently infected patients may be infected with a strain of the virus that has already evolved resistance to some drugs (referred