To aid in disease diagnostics and predictions in general (Aim 3)3)

This study has conducted two research works under this aim. The first one is to propose a generalized predictive model for the medical datasets of different diseases using a feature ranking based approach. The second one is to present a predictive model to predict the survivability of the ICU patients using the lab test data by exploiting the feature ranking and feature grouping techniques.

7.3.1 A general methodology for disease diagnosis using feature ranking

One of the complex and challenging tasks of medical informatics is the classification of medical data. Due to its complex nature, in the literature, different approaches were suggested. We have revisited this problem in this work and have made an effort to present a generalized approach for disease diagnosis. In particular, we have proposed a methodology based on feature ranking that uses a strategy to pick the highly ranked features and then train and develop the predictive model.

To elaborate, we first review whether all features are important for the classification task or not.

This is achieved by applying multiple feature ranking algorithms through 10-fold cross validation.

Then based on the feature ranking, a sub-set of the top-ranked features is formed using the 10-fold cross validation performance. Finally, the Random Forest algorithm is applied on the selected features to train the final model. We actually form several subsets of top-ranked features and train a model corresponding to each subset. In the sequel, we have a repertoire of have multiple models to compare and choose from one another.

Extensive experiments have been performed on 10 benchmark datasets and their results are promising. The models based on feature ranking and selection have performed consistently better than the baseline (without feature ranking) model. Additionally, the best models have been found competitive with the state of the art. Furthermore, this research work has performed an ablation study in order to verify the significance of the features selected for the model and showed the positive contributions of the important features in classification tasks. In summary, this research has established the highly accurate predictors for 10 different diseases and also presented a general methodology that is expected to work well for other diseases that exhibit similar characteristics in the data.

In the applied machine learning techniques available in the literature, the use of feature ranking and selection strategy is not new. To the best of our knowledge, in the context of Medical data classification, the strategies were not comprehensively explored before this current research work. From this ground, this can be seen as the first attempt to apply these strategies using Random Forest as the final classifier in the context of Medical data classification. Additionally, the ablation study conducted here is also new in this application domain. Therefore, it is believed that the rigorous experiments along with the appropriate ablation studies have advanced the state of the art though no new theoretical advancement has been produced in the context of machine learning. Actually, through the extensive experiments on 10 different benchmark datasets, this research work has been able to show that the approach is indeed useful and sufficiently general.

In addition to that, this research work has interpreted the top ranked features of each dataset for their contributions in prediction tasks from a medical point of view as well. Therefore, in this situation, in the real (clinical) sense, while one might argue that ranking and selection of features

should always lead to a better predictor/classifier, this research work has added value of explaining the phenomenon. Thus, it is assumed that a healthcare professional would be more confident and comfortable (so to speak) in using the proposed predictors.

In this research work, extensive experiments with other classifiers, such as, SVM, Bayes Net- work, MLP, etc. have been conducted and found the better contribution of Random Forest across all datasets compared to other classifiers. In this application domain, the use of 10 different datasets is also another important attribute of this research work as only one study [147] found in the literature, to the best of our knowledge, used all these datasets for performing the experiments albeit with a different goal: they developed the evolutionary ELM models focusing on compressing the neural networks by decreasing the number of neurons in the hidden layers. In contrast, this research work focuses on seeking a general approach for classifying medical data. Particularly, their study was inconclusive since the testing accuracy was often shown to be greater with higher numbers of hidden neurons and often with lower numbers. On the contrary, this research work presents a general methodology that is expected to perform well consistently across all medical datasets on which the data exhibits similar characteristics in it’s pattern.

7.3.2 Survivability prediction of ICU patients

In recent years, Clinical Decision Support Systems (CDSS) have gained significant attention for their role and impact on the advancement of healthcare quality, safety, performance and effective- ness. CDSS-enabled advanced data analytics was found to work more accurately and on time than conventional systems. Timely and efficient CDS or CP recommended actions can assist caregivers to take the appropriate steps in time. In this domain, survival prediction or deterioration prediction of critical care patients, such as, patients from the intensive care unit (ICU), is an active research area. Predicting early death can help us to take effective and prompt steps for these patients.

Most of the existing researches in this regard are focused on vital signs. Prior to the current research work, only a few studies were available in the literature on mortality prediction using lab samples. We have proposed feature ranking based approaches hybridized with other feature vector compaction (FVC) techniques using common feature counting (CFC) and vacuum count

(VC) ensemble techniques to predict mortality more accurately using data from the laboratory experiments. This study proposes feature ranking based approaches, the modification of FVC techniques and CFC as an ensemble strategy to advance the current state of the art. It has been shown from the findings of detailed studies that the suggested methodology outperforms other models using lab test data for predicting mortality. The research also illustrates how the CFC can lead to improving the model’s performance. In addition, it demonstrates the effect of the Vertical(V)-Horizontal(H) FVC techniques along with Vacuum Count (VC) for better efficiency.

This study also compares the performance of several standard classifier algorithms in this context, such as, the J48 classifier, the NaiveBayes (NB) classifier, the RandomForest (RF) classifier and the Support Vector Machine (SVM) classifier. Finally, these proposed models can be applied to other fields of clinical decision support systems where data exhibits similar characteristics.

Dalam dokumen Department of Computer Science and Engineering in partial fulfilment of the requirements for the degree of (Halaman 182-185)