Broadly, it has three components which are discussed below in the context of the applications of information technology. Respiratory diseases are diseases of the airways and other structures of the lung (please refer to Chapter 2 - Section 2.1 for the details).
An association identification problem of maternal HIV linking with lung function
Accordingly, the rest of this chapter describes the problems dealt with in this thesis, noting their context. Second, it briefly describes the lung function prediction problem from the voice sound files.
Lung function prediction from the voice sound files
A general methodology for diagnosis and health state predic- tiontion
Survival prediction of Intensive Care Unit (ICU) patients
Feature interpretation in disease informatics
The aims/objectives of this thesis
To extract and select the most important/relevant features (e.g. for objective 2, the features most informative with respect to lung function (forced expiratory volume in one second (FEV1), see Chapter 2 - Section 2.3.1 for details on FEV1 ) from the data set (Aims 1 – 3) Develop a computational model to investigate the associations of important maternal HIV genes with lung function (Aim 1).
The contributions of this thesis
In addition, REACTOME pathways [ 7 ] analysis using Enrichr [ 8 ] was performed to interpret the association of the hub genes with lung function. A computational approach to identify speech and breathing segments from a voice-recorded sound file has been developed.
Organization of this thesis
Subsequently, the experimental set-up and the analyzes of the performance of different models have been presented together with a comparison with the state of the art. Within respiratory diseases, this study is primarily concerned with asthma, chronic obstructive pulmonary disease (COPD) and lung diseases.
Respiratory diseases
It is caused by inflammation of the tubes that carry air in and out of the lungs. COPD includes two conditions: 1) emphysema - damage to the air sacs in the lungs and 2) chronic bronchitis - long-term inflammation of the airways.
Early life exposure and maternal HIV exposure
Lung function and measurement
The amount of air exhaled can be measured during the first (FEV1), second (FEV2) and/or third seconds (FEV3) of the forced breath. Health care professionals measure tidal volume (i.e., the lung's ability to hold and process air) to study or diagnose respiratory and other health problems.
Gene expression and profile
The person's normal lung function will be measured first (also known as his/her baseline breathing). Gene expression profiling measures which genes are expressed in a cell at a given time.
Bioinformatics
Machine learning techniques
Reinforcement - Reinforcement machine learning algorithms are a learning method that interacts with its environment by producing actions and detects errors or rewards. Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification and regression tasks.
Performance metrics for machine learning techniques
The value of the ROC curve ranges from 0.0 to 1.0, with a higher value indicating a more stable model. It is the average over the cases of the absolute values of the differences between predictions and the corresponding actual values.
Main Problems Handled in this Thesis
A subset of the authors in the DCHS reported that HIV exposure was associated with altered lung function in HIV-exposed, uninfected (HEU) infants at both 6 weeks and 2 years of age, with impairments associated with uncontrolled maternal HIV [39, 48]. Thus, the pitch and quality of breathing sounds can potentially be used to predict lung function.
Conclusion
We now analyze how our study is fundamentally different from the research efforts mentioned above. Further, we find that most of the above studies focus on a single machine learning/data mining technique.
Methods
An overrepresentation analysis of the gene network modules, identified from the WGCNA analysis, on the DEGs (based on p-value <0.01) for maternal HIV was performed using a hypergeometric test. To explore the association of HEU-associated co-expressed hub genes and child lung function, linear regression analysis was performed on tidal volumes (TV) at 6 weeks and 2 years of age for each hub gene (kME>0.7) of the HEU highly associated modules.
Results
Enrichment table of top 20 GO (Gene Ontology) biological processes enriched by differentially expressed genes for maternal HIV exposure with their normalized enrichment scores, p-value and p-adjustment value. Enrichment table of top 11 KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways enriched by differentially expressed genes for maternal HIV exposure with their normalized enrichment scores, p-value and p-adjustment value.
Discussion
In the gene phenotype color band, red represents a strong positive association and blue represents a strong negative association with maternal HIV exposure. This figure shows modules significant for maternal HIV using p-values of differentially expressed genes and modules identified through WGCNA.
Conclusion
Finally, assessing gene expression in cord blood does not necessarily provide insight into changes in gene expression in other tissues. It is therefore possible that the association of gene expression in umbilical cord blood with lung function does not represent a causal relationship, but that gene expression in umbilical cord blood may simply be a biomarker for adverse development in utero.
Methods
Breath cycle standard deviation: is the standard deviation of the breath cycle duration in the audio file. For example, MRF(r)(bl)v(s1) refers to a regression model based on a random forest that only uses features from the breaths of a voice audio file.
Results
These results represent the performance of Mi(r)(nrm)v(s1), Mi(r)(nrm)v(s2),Mi(r)(nrm)v(all) models in terms of mean square error of root mean square (RMSE). The effects of a balanced allocation of training and testing groups are investigated through experimentation with our regression models (eg, MRF(r)(bl)v(all) and MRF(r)(nrm)v(all )).
Discussion
In contrast to cough sounds, this research work has used the recorded sound file to predict lung function with the aim of self-monitoring disease symptoms. Second, only a limited number of samples were found for the category of lung function abnormalities as mentioned in [113].
Conclusion
Therefore, this research work proposed a technique to predict lung function from recorded voice audio files by extracting the necessary features. Finally, the proposed technique outperforms other techniques in predicting lung function from voice audio.
Methods
In this dataset, the value of the 6th property (ie, beverage) is greater than 5 for 88 records and less than or equal to 5 for the remaining 257 records. This data set describes a retrospective sample of men in the Western Cape region of South Africa at high risk of heart disease.
Results
ModelSHtIII : Based on the results of ranger, SVM, the features are excluded and the rest are used to build this model. ModelPkSII : According to the results of the evaluators, the InfoGainAttributeEval functions are excluded to build this model.
Discussions
We have also performed an ablation study on each data set to verify the importance of the features selected for the model and to confirm their positive contribution to the classification task. On the other hand, from our experiments it is clear that the feature ranking and selection strategy always gives better results than the baseline, and therefore we present a general method that is expected to perform well across all medical datasets (which have similar characteristics in the data pattern) consistently.
Conclusion
Although the use of feature classification and selection strategy is not new to the applied machine learning literature to the best of our knowledge, it has not been comprehensively explored in the context of medical data classification prior to the current research work. From this perspective, this can be seen as the first attempt where a feature classification and selection scheme has been applied in the context of medical data classification using a random forest as the final classifier.
Methods
Finally, the ICUSstays table, which contains the patient's length of stay, was used in this research. We also use orthogonal clustering (ORCU) which was introduced in [189] by a subset of authors.
Results
As an example, when features are ranked by InfoGain, 1/4 of the total features are selected. As an example, when the features are sorted by SVM ranking, 3/4 of the total features are selected.
Discussions
Finally, the lower right panel of Figure 6.12 focuses on the MIMIC-III Senior data set and the superiority of the combination. Feature set A contains 17 features (preprocessed) used in the calculation of SAPS-II scores.
Conclusion
And needless to say, ML and BI techniques are used intensively in each of the steps above. One of the key focuses of this research is to analyze respiratory diseases (e.g. COPD, asthma, lung diseases etc.) and various types of analyzes about them.
Association of pre-natal HIV exposure with UCB gene ex- pression and lung function (Aim 1)
Therefore, this study applied a number of ML and BI techniques to find out the association of prenatal maternal HIV with UCB gene expression linked to lung function. To interpret the biological relevance of the traits (i.e. DEGs) for maternal HIV, fGSEA [6] was performed.
Prediction of lung functions using speech and breath sound samples (Aim 2)samples (Aim 2)
These models can predict normal versus abnormal lung function from the sound files with reasonable accuracy. Interestingly, these models can also predict the severity of lung function abnormalities from the sound files.
To aid in disease diagnostics and predictions in general (Aim 3)3)
To our knowledge, in the context of medical data classification, the strategies were not fully explored before this current research work. Therefore, in this situation, in a real (clinical) sense, while it can be argued that the ranking and selection of features.
Feature interpretation in disease informatics
To help CDS or CP, this thesis presented a feature rank-based ensemble classifier for survival prediction of the ICU patients. Although there are certain limitations of this study in this problem context, the discussion on the feature identification and feature importance in the light of the biological and/or medical/clinical point of view illustrates the informative information for the future research works.
Future Research Directions and Ongoing Works
Several advanced ML algorithms (e.g. Random Forest (RF), Support Vector Machine (SVM), Logistic Regression (LR), MultilayerPerceptron (MLP), Naive Bayes (NB), AdaBoostM1 etc.) have been trained (see Appendix F - Section F.3.3 for more information). In this case of the LSTM pipeline, it has been observed that after running 20 epochs, the training models more than fit the training datasets.
Concluding remarks
This graph shows the accuracy of the LSTM-based pipeline for the number of epochs. -axis represents the number of epochs and Y-axis represents the accuracy of the LSTM technique.
Linear Regression
The additional constant b0, the so-called intercept, is the prediction the model would make if all X's were zero (if that is possible). And the model's prediction errors are generally assumed to be independent and identically normally distributed.
Random Forest
In this way, a random forest only makes trees that depend on each other by penalizing the accuracy. There is a rule of thumb that can be implemented for selecting subsamples from observations using random forest.
Support Vector Machine (SVM)
This can be calculated as the perpendicular distance from the line to the support vectors. The main purpose of SVM is to divide the data sets into classes to find a maximum marginal hyperplane (MMH) and this can be done in the following two steps.