Department of Computer Science and Engineering in partial fulfilment of the requirements for the degree of

Broadly, it has three components which are discussed below in the context of the applications of information technology. Respiratory diseases are diseases of the airways and other structures of the lung (please refer to Chapter 2 - Section 2.1 for the details).

An association identification problem of maternal HIV linking with lung function

Accordingly, the rest of this chapter describes the problems dealt with in this thesis, noting their context. Second, it briefly describes the lung function prediction problem from the voice sound files.

Lung function prediction from the voice sound files

A general methodology for diagnosis and health state predic- tiontion

Survival prediction of Intensive Care Unit (ICU) patients

Feature interpretation in disease informatics

The aims/objectives of this thesis

To extract and select the most important/relevant features (e.g. for objective 2, the features most informative with respect to lung function (forced expiratory volume in one second (FEV1), see Chapter 2 - Section 2.3.1 for details on FEV1 ) from the data set (Aims 1 – 3) Develop a computational model to investigate the associations of important maternal HIV genes with lung function (Aim 1).

The contributions of this thesis

In addition, REACTOME pathways [ 7 ] analysis using Enrichr [ 8 ] was performed to interpret the association of the hub genes with lung function. A computational approach to identify speech and breathing segments from a voice-recorded sound file has been developed.

Organization of this thesis

Subsequently, the experimental set-up and the analyzes of the performance of different models have been presented together with a comparison with the state of the art. Within respiratory diseases, this study is primarily concerned with asthma, chronic obstructive pulmonary disease (COPD) and lung diseases.

Respiratory diseases

It is caused by inflammation of the tubes that carry air in and out of the lungs. COPD includes two conditions: 1) emphysema - damage to the air sacs in the lungs and 2) chronic bronchitis - long-term inflammation of the airways.

Figure 2.1: Human Respiratory System. This figure displays the different parts of the human respiratory system

Early life exposure and maternal HIV exposure

Lung function and measurement

The amount of air exhaled can be measured during the first (FEV1), second (FEV2) and/or third seconds (FEV3) of the forced breath. Health care professionals measure tidal volume (i.e., the lung's ability to hold and process air) to study or diagnose respiratory and other health problems.

Gene expression and profile

The person's normal lung function will be measured first (also known as his/her baseline breathing). Gene expression profiling measures which genes are expressed in a cell at a given time.

Figure 2.2: An overview of gene expression. This figure shows the transcription phases

Bioinformatics

Machine learning techniques

Reinforcement - Reinforcement machine learning algorithms are a learning method that interacts with its environment by producing actions and detects errors or rewards. Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification and regression tasks.

Performance metrics for machine learning techniques

The value of the ROC curve ranges from 0.0 to 1.0, with a higher value indicating a more stable model. It is the average over the cases of the absolute values of the differences between predictions and the corresponding actual values.

Figure 2.4 shows an example of the ROC spaces of a classifier. The ROC curve of a good classifier is closer to the top left of the graph

Main Problems Handled in this Thesis

A subset of the authors in the DCHS reported that HIV exposure was associated with altered lung function in HIV-exposed, uninfected (HEU) infants at both 6 weeks and 2 years of age, with impairments associated with uncontrolled maternal HIV [39, 48]. Thus, the pitch and quality of breathing sounds can potentially be used to predict lung function.

Table 2.1: Summary of the related studies in this domain

Conclusion

We now analyze how our study is fundamentally different from the research efforts mentioned above. Further, we find that most of the above studies focus on a single machine learning/data mining technique.

Methods

An overrepresentation analysis of the gene network modules, identified from the WGCNA analysis, on the DEGs (based on p-value <0.01) for maternal HIV was performed using a hypergeometric test. To explore the association of HEU-associated co-expressed hub genes and child lung function, linear regression analysis was performed on tidal volumes (TV) at 6 weeks and 2 years of age for each hub gene (kME>0.7) of the HEU highly associated modules.

Figure 3.1: Drakenstein Child Health Study (DCHS) Schema. This figure displays the study plan of the DCHS

Results

Enrichment table of top 20 GO (Gene Ontology) biological processes enriched by differentially expressed genes for maternal HIV exposure with their normalized enrichment scores, p-value and p-adjustment value. Enrichment table of top 11 KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways enriched by differentially expressed genes for maternal HIV exposure with their normalized enrichment scores, p-value and p-adjustment value.

Discussion

In the gene phenotype color band, red represents a strong positive association and blue represents a strong negative association with maternal HIV exposure. This figure shows modules significant for maternal HIV using p-values of differentially expressed genes and modules identified through WGCNA.

Figure 3.6: Association WGCNA modules with phenotypes. Each square represents the association of a phenotype with a module with red representing strong positive association and blue strong negative association

Conclusion

Finally, assessing gene expression in cord blood does not necessarily provide insight into changes in gene expression in other tissues. It is therefore possible that the association of gene expression in umbilical cord blood with lung function does not represent a causal relationship, but that gene expression in umbilical cord blood may simply be a biomarker for adverse development in utero.

Methods

Breath cycle standard deviation: is the standard deviation of the breath cycle duration in the audio file. For example, MRF(r)(bl)v(s1) refers to a regression model based on a random forest that only uses features from the breaths of a voice audio file.

Results

These results represent the performance of Mi(r)(nrm)v(s1), Mi(r)(nrm)v(s2),Mi(r)(nrm)v(all) models in terms of mean square error of root mean square (RMSE). The effects of a balanced allocation of training and testing groups are investigated through experimentation with our regression models (eg, MRF(r)(bl)v(all) and MRF(r)(nrm)v(all )).

Figure 4.6: Impact of speech and breathing features on prediction. Effects of features either from speech only or from breathing only or from both on predicting FEV1%

Discussion

In contrast to cough sounds, this research work has used the recorded sound file to predict lung function with the aim of self-monitoring disease symptoms. Second, only a limited number of samples were found for the category of lung function abnormalities as mentioned in [113].

Conclusion

Therefore, this research work proposed a technique to predict lung function from recorded voice audio files by extracting the necessary features. Finally, the proposed technique outperforms other techniques in predicting lung function from voice audio.

Methods

In this dataset, the value of the 6th property (ie, beverage) is greater than 5 for 88 records and less than or equal to 5 for the remaining 257 records. This data set describes a retrospective sample of men in the Western Cape region of South Africa at high risk of heart disease.

Figure 5.1: Model construction overview 5.1.3 Feature Ranking and Selection

Results

ModelSHtIII : Based on the results of ranger, SVM, the features are excluded and the rest are used to build this model. ModelPkSII : According to the results of the evaluators, the InfoGainAttributeEval functions are excluded to build this model.

Discussions

We have also performed an ablation study on each data set to verify the importance of the features selected for the model and to confirm their positive contribution to the classification task. On the other hand, from our experiments it is clear that the feature ranking and selection strategy always gives better results than the baseline, and therefore we present a general method that is expected to perform well across all medical datasets (which have similar characteristics in the data pattern) consistently.

Conclusion

Although the use of feature classification and selection strategy is not new to the applied machine learning literature to the best of our knowledge, it has not been comprehensively explored in the context of medical data classification prior to the current research work. From this perspective, this can be seen as the first attempt where a feature classification and selection scheme has been applied in the context of medical data classification using a random forest as the final classifier.

Methods

Finally, the ICUSstays table, which contains the patient's length of stay, was used in this research. We also use orthogonal clustering (ORCU) which was introduced in [189] by a subset of authors.

Table 6.1: Age information of the groups Group Age

Results

As an example, when features are ranked by InfoGain, 1/4 of the total features are selected. As an example, when the features are sorted by SVM ranking, 3/4 of the total features are selected.

Discussions

Finally, the lower right panel of Figure 6.12 focuses on the MIMIC-III Senior data set and the superiority of the combination. Feature set A contains 17 features (preprocessed) used in the calculation of SAPS-II scores.

Figure 6.11: ROC Curve of some ensemble models for each dataset

Conclusion

And needless to say, ML and BI techniques are used intensively in each of the steps above. One of the key focuses of this research is to analyze respiratory diseases (e.g. COPD, asthma, lung diseases etc.) and various types of analyzes about them.

Association of pre-natal HIV exposure with UCB gene ex- pression and lung function (Aim 1)

Therefore, this study applied a number of ML and BI techniques to find out the association of prenatal maternal HIV with UCB gene expression linked to lung function. To interpret the biological relevance of the traits (i.e. DEGs) for maternal HIV, fGSEA [6] was performed.

Prediction of lung functions using speech and breath sound samples (Aim 2)samples (Aim 2)

These models can predict normal versus abnormal lung function from the sound files with reasonable accuracy. Interestingly, these models can also predict the severity of lung function abnormalities from the sound files.

To aid in disease diagnostics and predictions in general (Aim 3)3)

To our knowledge, in the context of medical data classification, the strategies were not fully explored before this current research work. Therefore, in this situation, in a real (clinical) sense, while it can be argued that the ranking and selection of features.

Feature interpretation in disease informatics

To help CDS or CP, this thesis presented a feature rank-based ensemble classifier for survival prediction of the ICU patients. Although there are certain limitations of this study in this problem context, the discussion on the feature identification and feature importance in the light of the biological and/or medical/clinical point of view illustrates the informative information for the future research works.

Future Research Directions and Ongoing Works

Several advanced ML algorithms (e.g. Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), MultilayerPerceptron (MLP), Naive Bayes (NB), AdaBoostM1 etc.) have been trained (see Appendix F - Section F.3.3 for more information). In this case of the LSTM pipeline, it has been observed that after running 20 epochs, the training models more than fit the training datasets.

Concluding remarks

This graph shows the accuracy of the LSTM-based pipeline for the number of epochs. -axis represents the number of epochs and Y-axis represents the accuracy of the LSTM technique.

Figure 7.1: Performances of the state of the art algorithms to predict early exacerbation

Linear Regression

The additional constant b0, the so-called intercept, is the prediction the model would make if all X's were zero (if that is possible). And the model's prediction errors are generally assumed to be independent and identically normally distributed.

Random Forest

In this way, a random forest only makes trees that depend on each other by penalizing the accuracy. There is a rule of thumb that can be implemented for selecting subsamples from observations using random forest.

Support Vector Machine (SVM)

This can be calculated as the perpendicular distance from the line to the support vectors. The main purpose of SVM is to divide the data sets into classes to find a maximum marginal hyperplane (MMH) and this can be done in the following two steps.