Analysis and Prediction of Cholera Disease using Machine Learning Algorithms

This thesis entitled "Analysis and Prediction of Cholera Disease Using Machine Learning Algorithms", submitted by Roisujaman Shabab, ID in the Department of Software Engineering, Daffodil International University, has been accepted as satisfactory in partial fulfillment of the degree requirements Bachelor. of Science in Software Engineering and approval for its style and content. I declare that the title of the thesis "Analysis and prediction of cholera disease using machine learning algorithms" was completed by me under the supervision of Ms. Bachelor of Science from Daffodil International University.

In this study, we examined Cholera disease and its mortality rate from previous years in different countries. We have carried out an exploratory data analysis in which data analysis of cholera cases in Bangladesh has been carried out from 1996 to 2000. In addition to the disease prediction, data analysis of different countries has been carried out and correlations have been made so that the interrelationships between each indicator can be identified very quickly. are being found.

Background
Motivation of the Research
Problem Statement
Research Questions
Research Objective
Research Scope
Thesis Organization

The motivation of this proposed research is to utilize the ML approach that predicts the mortality rate of cholera cases based on the existing dataset and extract meaningful insights. Most of the previous research has been done through statistical analyses, but very little work has been done on the prediction of cholera diseases using machine learning algorithms. Many evaluation matrices show how good the model is, which has not been shown in previous research.

To find out the best algorithm that works well in predicting cholera disease on demographic population data of cholera cases. Since we have studied the clinical aspect, it will be possible to identify the death of cholera disease through our model precisely as well as find out what kind of factor is working with this disease. We have shown the correlation with the interrelationship between each feature so that it will serve as a benchmark in the research environment.

Previous literature
Previous research on Analysis Effectiveness in Determining the Epidemic
Previous research on forecasting Cholera disease
Research Gap
Summary

A benchmark was developed by simulating the state of the system and the predictive capabilities of the new tools in the early stages of the 2010 Haiti cholera outbreak, using only the knowledge available at the time. The study of patterns and the creation of computational systems that can learn and make predictions is one of the applications of machine learning techniques. The authors (Daisy et al., 2020) present a new exploration of the potential of a machine learning approach to predict environmental cholera risk in coastal India, which has a population of more than 200 million people, using critical climate variables derived from atmospheric, terrestrial and oceanic satellites.

The authors (Badkundri, Valbuena, Pinnamareddy, Cantrell, & Standeven, 2019) proposed the Cholera Artificial Learning Model (CALM), a series of four Extreme Gradient Machine Learning (XGBoost) models to predict the number of new cholera cases in Yemen. encountered the governor in a period of two weeks to two months. CALM uses rainfall data, historical cholera case and death data, civil war casualties, and experience among governors from different time frames to create a new machine learning approach. A study (Leo, Luhanga, & Michael, 2019) proposed machine learning techniques to model a cholera epidemic with seasonal weather changes, thus solving the data imbalance problem.

The performance of the seven models was also assessed using sensitivity, specificity and balanced accuracy measures. Overall, the findings helped us better understand the critical functions of machine learning techniques in healthcare data. This paper explains how to use data from multiple sources and machine learning techniques to predict the probability of Cholera outbreaks in different areas over time.

The results of the experiments show that combining solar terms with ROSE resampling and the random forests method results in (AUC) with balanced sensitivity and specificity. Based on the review of the above research, it can be said that most of the research has been completed with analysis but using machine learning technology related to Disease Prediction and Bangladesh. By looking at the above literature review, it is obvious that the majority of research concentrated on statistical analysis or time series data that dealt with forecasting the outbreak, but as far as we know, no sufficient studies have been carried out regarding cholera disease. Case fatality prediction using machine learning approach.

A variety of machine learning algorithms were used in our research, and Cholera disease was identified by selecting the best algorithm from there.

Research Dataset

Data Preprocessing

Algorithm Selection

The Gradient boosting algorithm works relatively well for regression-type problems (González-Recio, Jiménez-Montero, & Alenda, 2013). The difference between Gradient boosting and Ada boosting is that Adaptive boosting gradually reduces the error by updating the weight of the wrong predictive air sample. To optimize this loss function, each weak learner changes its alternative weak learner model so that the next weak learner is better than the previous one.

On the other hand, Gradient Boosting consists of 3 components, weak learner, loss function optimization and additive model.

Data Analysis & Experimental Result

Experimental Result & Evaluation Matrix
Exploratory Data Analysis

This section was divided into several phases: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and R-Square. It represents the difference between the original value and the expected value, calculated by averaging the fundamental difference over the entire data set. It represents the difference between the original value and the expected value, calculated by dividing the average difference in the data set by the square.

It represents the coefficient that indicates how well the values match compared to the original values. Without analyzing a stool sample, it is almost impossible to distinguish a single patient with Cholera from a patient infected with another pathogen that causes acute watery diarrhea. Because of the disease's rapid spread, a study of clinical characteristics of several patients who are part of a suspected outbreak of acute watery diarrhea can help detect Cholera.

While the treatment of patients with acute watery diarrhea is the same regardless of the condition, identifying Cholera is critical because of the risk of a widespread outbreak. Despite a slight increase in the number of Cholera cases, we can see that Bangladesh has made progress in its Cholera war. This dataset has some data quality issues such as missing values, data shape mismatch and invalid number input.

The largest number of confirmed deaths in 2016 were in the Democratic Republic of the Congo, Somalia, Haiti, the United Republic of Tanzania, Yemen, South Sudan, Kenya, Malawi, Nigeria and the Dominican Republic. The countries with the most Cholera cases in 2016 were Haiti, the Democratic Republic of the Congo, Yemen, Somalia, the United Republic of Tanzania, Kenya, South Sudan, Malawi, the Dominican Republic and Mozambique. The countries with the highest death rates in 2016 were Niger, Congo, Zimbabwe, Nigeria, Angola, Somalia, the Democratic Republic of Congo, Malawi, Dominican Republic and Uganda.

The total number of outbreaks, the average death rate, and the mortality rate have all decreased over time, but Cholera disease is ravaging a few countries. 18 ©Daffodil International University Figure 5: Exploratory data analysis for the top 10 countries with the most cholera diseases. 19 ©Daffodil International University Figure 6: Finding the correlation between the independent variables in terms of cholera disease.

Figure 4: Accuracy Graphs of the algorithms that have been applied

Findings and Contributions

24 ©Daffodil International University without testing hypotheses about a model not found in previous research. Previous research has not used biostatistics, which is very important in the case of analyzing the disease we are observing. 23 Figure 7: Investigation of cholera situation in Bangladesh 24 ix © Daffodil International University LIST OF ABBREVIATIONS Abbr.

Explanation HASH(0x7f5ae9ca9dc8) ML Machine Learning DL Deep Learning x ©Daffodil International University ABSTRACT In this study, we have examined the cholera disease and its mortality rate from previous years in terms of different countries. Keywords: Machine Learning (ML), RMSE, MSE, Gradient Boosting and Cholera Disease ix ©Daffodil International University CHAPTER 1 INTRODUCTION 1.1 Background Cholera disease is not only a global problem but also a historical problem in the world. Increased levels of ocean chlorophyll are linked to an increase in the severity of cholera in Bangladesh.

CALM can also learn complex non-linear relationships found in epidemiological phenomena thanks to 6 ©Daffodil International University machine learning and comprehensive feature engineering. 7 ©Daffodil International University 2.4 Research Gap Based on the review of the above research, it can be said that most of the research has been completed with analytics but with the help of machine learning technology in the context of Disease Prediction and Bangladesh. 8 ©Daffodil International University CHAPTER 3 RESEARCH METHODOLOGY Figure 1 illustrates the architecture diagram of the proposed research.

The third weak learner is better than the second, so as the periodicity of the weak learner increases, the amount of 11 ©Daffodil International University error in the model decreases and the model becomes a stronger learner. Daffodil International University CHAPTER 4 RESULTS AND DISCUSSION 4.1 Data Analysis and Experimental Result The results analysis section is divided into three segments: Experimental Results and Model Evaluation, Exploratory Data Analysis and Comparative Analysis. 14 ©Daffodil International University Table 1: Accuracy Ratio of Gradient Boosting Algorithm MAE % MSE% RMSE% Accuracy% R .

18 ©Daffodil International University Figure 6: Finding the correlation between independent variables in terms of cholera disease. 19 © Daffodil International University Figure 7: Investigation of the average number of cholera deaths from 1950 to 2010 20 © Daffodil International University Figure 8: Investigation of cholera status in Bangladesh from 1973 to 2000 Total number of outbreaks, the average death rate and all have decreased over time, but cholera disease has affected some countries. It is essential to test the hypothesis of a model because it is never possible to choose a suitable model without hypothesis testing.

Limitations

Recommendations for Future Works