Study and Analysis of Covid-19 Patients Using Machine Learning Methods

(1)

37

Study and Analysis of Covid-19 Patients Using Machine Learning Methods

Yasser M. Alginahi¹ and Mohammad Zubair Khan²

1Adrian College, MI, USA & University of Windsor, ON, Canada, and ²Department of Computer Science, College of computer Science and Engineering, Taibah University, Saudi Arabia

[email protected]

Abstract. Machine Learning (ML) systems has been used in healthcare to recognize and diagnose diseases using patient’s data. The use of ML in technology has reformed and improved healthcare by automatically detecting and diagnosing diseases which in turn improve patient’s health and saves lives. Therefore, in this study, ML algorithms are used to predict death and recovery of patients.

Using several ML algorithms the death or recovery of patients was predicated. The Naïve Bayes and Bagged Trees algorithms provided the best performance rates of 79% and 77% respectively.

However, in terms of accuracy, the Medium Tree and ensemble method Boosted Tree classification algorithms showed 89% accuracy. This study showed that using ML technology could alert healthcare providers to provide faster treatment for high risk COVID-19 patients which in turn save lives and improve quality of healthcare service.

1. Introduction

Today, the novel corona virus, COVID-19, is a term that is known by everyone on this globe. COVID-19 stands for COrona VIrus Disease 2019. According to the World Health Organization (WHO), the COVID-19 virus is a new virus linked to the same family of viruses as Severe Acute Respiratory Syndrome (SARS) and some types of common cold^[1]. On the 11th of March 2020, the WHO announced COVID- 19 outbreak as a pandemic^[2]. This virus has put high pressure on healthcare services worldwide and many countries are struggling to cope with this crisis. This virus has spread like a fire and could not be contained due to its characteristics, incubation period and behavior as until today scientists are still discovering new things about the way it behaves. COVID-19 started in Wuhan China in December 2019 and by July 2020 spread to 213 countries and the number of

infection cases are rising every day ^[4]. In addition, the behavior of people as well as limited healthcare resources in some countries caused the virus to spread more than other countries. Therefore, from New York City to the isolated villages of the Amazon to any place on this earth where humans exist, this unprecedented spread of the COVID-19 has brought the good and bad with it. The closeness of families, the collaboration between scientists, the goodness of people and humanity, and the lower pollution of the environment are few examples of the good things this pandemic brought with it. On the other hand, the crash of the stock market, the losing of jobs, the closing of businesses and closing of schools are some examples of the drawbacks of this pandemic. The reality is that we are now living in a new era as we refer to it the corona virus era or COVID-19 era and many

(2)

of us are wondering when we will go back to our normal routines.

Recently, the research community have seen a surge in rapid publications that it has never seen before, and most of these publications are related to the novel corona virus. A quick search on Google on September 13, 2020 with the term “coronavirus” provides 2,910,000,000 results, and the term COVID19 provides 6,600,000,000 results. A search on Google Scholar with the term “Covid-19”

provides 1,300,000 results and the term

“coronavirus” produces 720,000 results of scholarly scientific publications published in scientific journals and conferences. This record number of publications in a short period of time shows the need to come up with a remedy to this virus that has caused a distribution to the billions of people’s lives on this earth. Some of the scientific publications which were published early during the pandemic in March and April of 2019 are seen to have high impact, it is seen on Google Scholar that some research papers were cited by over 1500 papers in a matter of 2 – 3 months^[3]. Therefore, researchers nowadays are putting lots of effort to come up with research ideas related to COVID-19 and link it to their research specializations. COVID-19 research is not only limited to the healthcare sector or the science fields, such as Biology and Chemistry. Hence, it is a topic of research that has been affecting our lives from may dimensions and as a result lots of research publications are generated from different fields. Research from the science, technology, engineering, health, social sciences, law enforcement and many more areas have recently been published. In addition, the manufacturing sector has seen a surge in products related to COVID-19. Factories have been forced to manufacture products which will help in the reduction of the spread of this virus, some auto industries are manufacturing ventilators, distilleries are producing sanitizing

products, such as hand sanitizers and alcohol, pharmaceutical companies manufacturing masks and trial medicines, and the technology sector developing apps that track the number of cases as well as apps to track that contacts of some of the infected cases. Pharmaceutical companies and universities are also racing to come up with an effective vaccine for this virus.

It is predicted by health organizations in many countries that this virus is not going to leave us soon and as a result businesses and people are struggling to find alternatives to help them cope with their daily activities. Currently, many people are working and planning their coming year to work and study online, for example, around the globe many universities, colleges and schools have planned at least the first semester of the coming academic year to be online. Therefore, with the surge in the number of COVID-19 cases, it is reported that the number of recorded cases during the month of July reach one- million in a matter of 4 – 5 days more and more countries are reporting more newer cases. In September 2020, this has increased to having a million cases in a matter of 3 – 4 days. In some countries, the number of cases increased after partial opening of some businesses and allowing people to go out in limited group numbers as a stage 1 or 2 planning in order to introduce a strategy to go back to normal activities. With the expected second wave of this pandemic many things could change and millions of people could be infected with this virus that doesn't seem to be wiped out even though expectations were saying that the warmer weather will slow it's spread. Unfortunately, the opposite is seen with many cases being reported this summer in USA, India, Brazil, Russia, Peru, Colombia, Mexico, South Africa, Spain and Argentina count for 60- 70% of the daily reported cases^[4]. Many countries are not testing in large scale and therefore “the total number of cases of COVID- 19 is not known. That’s partly because not

(3)

everyone with COVID-19 is tested ^[5]”.

According to newscientist.com, the reporting of COVID-19 cases varies drastically from one country to the other, for example, USA (35%), Ghana (95%), Oman (95%), Sweden (19%) and Yemen (3%). However, the symptomless cases are not reflected in these statistics and they could account to about half of all the infected cases ^[6]. Therefore, at the time this paper was prepared for publication, on September 13, 2020, according to worldometer.com, ^[4], the total number of cases reported, recovered and died are given in Table 1.

Table 1. The statistics of COVID-19 metrics on September 13, 2020.

Metric Count

Total Cases 29,182,627

Total Deaths 928,281

Total Recovered 21,027,197

Active Cases 7,227,149

Serious Critical 60,467

Total Cases/1M pop 3,722

Total Deaths/1M pop 119.1

*M/pop = 1 Million/Population

The healthcare centers in many countries are reporting daily statistics on the COVID-19.

These statistics are gathered by the central healthcare centers in each country and reported to the WHO, as well as reproduced by other organizations which also make them available as datasets for research purposes, such as the European Centre for Disease Prevention and Control, the US National Institutes of Health, and many others. These datasets can be studied and analyzed to provide predictions and analytical data that could identify certain trends in controlling the spread, reducing the mortality rate, identifying symptoms, identifying the most vulnerable population, …etc. In this study, the novel-corona-virus-2019-dataset from Kaggle.com was used with the help of ML algorithms to predict the accuracy rates for death and recovery ^[14].

This paper is organized as follows, after this introduction the related work is presented

in Section 2. Next, the methodology used in this study is discussed in Section 3, followed by the results and discussion in Section 4 and finally the conclusion in Section 5.

2. Related Work

According to expertsystem.com ^[7],

“Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. ML focuses on the development of computer programs that can access data and use it learn for themselves”.

ML processes data, which can be in different forms, such as text, images, raw data … etc., to provide automatic predications and make decisions. The incorporating of ML in technology has reformed and improved healthcare by automatically detecting and diagnosing diseases which in turn save lives.

The work by Ardabili et al., ^[8], carried out a comparative analysis of ML and soft computing models to predict the COVID-19 outbreak. The study concluded that due to the complexity nature of the COVID-19 outbreak and its non-uniform behavior of the outbreaks in different countries around the globe generalized models are not feasible as confirmed by other studies that such outbreaks are not likely to be replicated elsewhere ^[9]. The comparative analysis concluded that the Multi- Layered Perceptron (MLP) and the Adaptive Network-based Fuzzy Inference Systems (ANFIS) produced promising results and the authors recommend that ML is an effective tool to model the outbreak ^[8].

The online platform CoronaTracker evolved as a result of the COVID-19 pandemic, it is an ongoing source for latest reliable news, statistics and analysis on COVID-19. It used the Susceptible, Exposed, Infectious, Removed (SEIR) model which is the standard model for the spread of a virus; it was used to forecast the

(4)

trajectory of the COVID-19 outbreak from January 20 until March 3, 2020 and it is an ongoing project that carries continuing investigations on COVID-19. The objective of this platform is to provide scientific-based data analysis, prediction, UpToDate statistics and verified authentic news.

The work by Yan, et al.^[11] presented a study of 404 infected patients in Wuhan, China, A database of blood sample results with three biomarkers namely: lactic dehydrogenase (LDH), high-sensitivity C reactive protein (hs- CRP) and lymphocyte with the aid of ML tools were used to predict the survival of patients.

The study found that “relatively high levels of LDH alone seem to play a crucial role in distinguishing the vast majority of cases that require immediate medical attention.”

Therefore, with 90% accuracy rate, discovering this crucial predictive biomarker of disease severity in patients provided a simple method for decisionmakers to identify high risk cases and prioritize patient’s hospitalization which leads to a reduction in death rate.

The work by Printer et al.^[12], proposed a ML approach using adaptive network-based fuzzy inference system (ANFIS) and MLP- imperialist competitive algorithm (MLP-ICA).

This hybrid model is used to predict time series of infected individuals and mortality rate in Hungary. The model provided predictions on when the COVID-19 outbreak and the mortality rate will drop noticeably. This model was verified over a period of time and it is expected that it retains its accuracy if no significant interruption occurs.

Jiang et al. ^[13], proposed an AI framework with predictive analysis abilities to provide automatic clinical decision support, in addition, it identified the combination of characteristics that predict the different outcomes in COVID-19, for example developing an acute respiratory distress

syndrome is a severe outcome of the virus. This study was carried out in Wenzhou, China on 53 real patient’s data from two hospitals. Even though the dataset used is small and incomplete it achieved 70 – 80% accuracy rate in predicting severe cases and shows the potential of using AI in building prediction models in COVID-19 research.

From the works above, the potential of using ML in COVID-19 related research areas is very promising especially in developing classification and prediction models. Therefore, in this research work the objective is to predict the death and recovery accuracy rates using different ML algorithms.

3. Methodology

The flowchart for the methodology used in this study is shown in Fig. 1. First, the dataset was preprocessed and divided into training and testing datasets. Next, features were selected for processing. Then, the ML system was designed using different algorithms and the processing was carried out on the training data. Following this, the testing dataset was used to test the ML system and Finally, the results were evaluated.

Fig. 1. Research Methodology followed in this study.

The dataset used in this study contains the features shown in Table 2. The data was preprocessed using MATLAB 2020a on

Data Preprocessing

Features Selection

Machine Learning Algorithms Results Evaluation

(5)

windows 10 corei7 platform. The dataset was first preprocessed to get a clean dataset file.

3.1 Preprocessing

The preprocessing step is considered the most important step in any ML system. In this study, the kaggle dataset was first preprocessed by taking care of missing data and choosing the features most relevant to the objectives of this research study. The data was preprocessed using MATLAB 2020a on windows 10 corei7 platform. Table 2 shows the features collected from patients. There were many columns and only specific columns/features were selected to be used in this study. In the proposed ML system the data must be all numeric; therefore, the type of death and recovery data was converted to binary numbers 1 and 0. The death of the patient was represented by 0 and recovery by 1. This was done using the grp2idx command in MATLAB this was applied to every unique value in the column. The dataset has many missing values (blank cells) which may generate error during processing; hence these cells were set to “NA”. For some patients, the date of admission to hospital was missing or the date when the patient started showing some symptoms was missing. In this case, whichever date was recorded was replicated to the other missing cell. The total number of days spent in hospital with an outcome was recorded. The outcome was either death or recovery.

3.2 Feature Selection

In the feature selection stage, the Principal Component Analysis (PCA) was used to diminish the dimensionality of the predictor space. Diminishing the dimensionality can create classification models in Classification Learner that help prevent overfitting. PCA linearly converts forecasters in order to eliminate redundant dimensions and produces another arrangement of factors known as principal components.

3.3 Machine Learning System

In this work, ML algorithms available in Matlab tool were used. The following agorithms were used: SVM, Naïve Bayes, Medium Tree, Coarse Tree, and ensembles methods like RUS Boosted Tree, Boosted Tree, and Bagged Trees. The results and discussion section provides the outcome using these algorithms.

Using assessment metrics, the performance and accuracy of the proposed algorithm is presented based on the parameters used in the confusion matrix (Table 3) these parameters were used to calculate the Accuracy and Area Under consideration (AUC).

In the above confusion matrix, Table 3:

TP = “A positive instance which was detected positive by the algorithm”.

FP = “A negative instance which was detected positive by the algorithm”

FN = “A positive instance which was detected negative by the algorithm”

TN = “A negative instance which was detected negative by the algorithm”

Based on above parameters the brief description of matrices is as follows:

The Accuracy is the “ratio of correctly classified instances to the total number of instances”.

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

The AUC evaluation shows how well a classifier can differentiate between death and recovery classes.

𝐴𝑈𝐶 =1 + 𝑇𝑃_𝛾− 𝐹𝑃_𝛾 2

(6)

4. Results and Discussion

The performance accuracy of the ML algorithms used in the study is given in Table 4 and Fig. 2.

In terms of accuracy, the Medium Tree and ensemble method Boosted tree classification algorithms perform very well with value 89% accuracy. The worst performance is given by RUS-Boosted Trees with accuracy equals to 53.1%.

The performance of various ML Algorithms in terms of AUC are given in Table 5, and figures 3 and 4 provide the AUC results for the recovery and death cases, respectively.

In terms of AUC values, the best performance is given by Naïve Bayes with 79%

and second best is Bagged Trees with value 77%.

Table 2. Features provided in the novel-corona-virus-2019-dataset file.

Feature Description Type

id Patient ID Numeric

location The location where the patient’s belogn to. String Categorical

country Patient’s native country String Categorical

gender Patient’s gender String Categorical

age Patient’s age Numeric

sym_on The date the patient started noticing the symptoms Date

hospital_vis Date when the patient visited the hospital Date

vis_wuhan whether the patient visited Wuhan, China Numeric Categorical

from_wuhan whether the patient belongs to Wuhan, China Numeric Categorical death whether the patient passed away due to covid-19 Numeric Categorical

Recov whether the patient recoved Numeric Categorical

symptom1, symptom2, symptom3, symptom4, symptom5

Symptoms noticed by the patient String Categorical

Table 3. Confusion Matrix.

Actual Values

Predicted Values

Died Recover

Died True Positive (TP) False Positive (FP) Recover False Negative (FN) True Negative (TN)

Table 4. The performance analysis (accuracy) of ML algorithms.

RUS-Boosted trees

Boosted Trees Bagged Trees Optimizable SVM

Naïve Bayes Medium Tree Coarse Tree

53.1% 89% 88.8% 88.6% 86.5% 89% 88.8%

(7)

Fig. 2. Accuracy for various ML Algorithms.

Table 5. The performance analysis (AUC) of ML algorithms.

RUS-Boosted

trees Boosted Trees Bagged Trees Optimizable

SVM Naïve Bayes Medium Tree Coarse Tree

0.76 0.71 0.77 0.74 0.79 0.75 0.67

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

RUS-Boosted trees

Boosted Trees

SVM Naïve Bayes Medium Tree Coarse Tree

Accuracy

(8)

Fig. 3. AUC for various ML algorithms for recovery cases.

(9)

Fig. 4. AUC using various ML algorithms for death class.

5. Conclusions

The use of ML in technology has reformed and improved healthcare by automatically detecting and diagnosing diseases which in turn save lives. In this study, ML algorithms were used to predict death and recovery of patients.

The ML algorithms used were: SVM, Naïve Bayes, Medium Tree, Coarse Tree, RUS Boosted Tree, Boosted Tree, and Bagged Trees. In terms of AUC values, the best performance is given by

Naïve Bayes and Bagged Trees with 79t% and 77% respectively. In terms of accuracy, the Medium Tree and ensemble method Boosted Tree classification algorithms perform very well with 89% accuracy. The worst performance was given by RUS-Boosted trees with accuracy equals to 53.1%. This study shows that using ML in predicting the death or recovery of COVID-19 patients could help high risk patients by providing early hospitalization and/or treatment.

(10)

References [1] https://www.who.int/docs/default-

source/coronaviruse/key-messages-and-actions-for-covid- 19-prevention-and-control-in-schools-march-

2020.pdf?sfvrsn=baf81d52_4

[2] https://www.euro.who.int/en/health-topics/health- emergencies/coronavirus-covid-

19#:~:text=WHO%20announced%20COVID%2D19,on%

2011%20March%202020.

[3]https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5

&q=covid-19&btnG=&oq=covid

[4] https://www.worldometers.info/coronavirus/

[5] Read, J.M., Bridgen, J.R., Cummings, D.A., Ho, A. and Jewell, C.P., Novel coronavirus 2019-nCoV: early estimation of epidemiological parameters and epidemic predictions. medRxiv. 2020;2020.01.23.20018549.

[6] https://www.newscientist.com/article/mg24632873-000- how-many-of-us-are-likely-to-have-caught-the-

coronavirus-so-far/#ixzz6SfusDpNg

[7] https://expertsystem.com/machine-learning-definition/

[8] Ardabili, S. F., Amir, M., Pedram, G., Filip, F., Annamaria, R., Varkonyi-Koczy, U. R., Timon, R. and Peter, M. A., "COVID-19 outbreak prediction with machine learning." Available at SSRN 3580188 (2020).

[9] Remuzzi, A. and Remuzzi, G., “COVID-19 and Italy:

what next?” The Lancet, 395(10231): 1225-1228, April 11, 2020.

[10] Hamzah, F. Binti, A., Lau, C., Nazri, H., Ligot, D. V., Lee, G. and Tan, C. L., "CoronaTracker: worldwide COVID-19 outbreak data analysis and prediction." Bull World Health Organ, 1 (2020): 32.

[11] Yan, Li, Hai-Tao, Zhang, Jorge, Goncalves, Yang, Xiao, Maolin, Wang, Yuqi, Guo, Chuan, Sun, et al. "A machine learning-based model for survival prediction in

patients with severe COVID-19

infection." MedRxiv (2020).

[12] Pinter, G., Imre, F., Amir, M., Pedram, G. and Richard, G., "COVID-19 Pandemic Prediction for Hungary; a Hybrid Machine Learning Approach." Mathematics, 8 (6) (2020): 890.

[13] Jiang, X., Megan, C., Anasse, B., Junzhang, W., Xinyue, J., Jianping, H., Jichan, S. et al. "Towards an artificial intelligence framework for data-driven prediction of coronavirus clinical severity." CMC:

Computers, Materials & Continua, 63 (2020): 537-51.

[14] https://www.kaggle.com/sudalairajkumar/novel-coronavirus-2019-dataset

(11)

ىضرم ليلحتو ةسارد

Covid-19

قرط مادختساب يللآا ملعتلا

يحانجلا رساي و

1

ناخ ريبز دمحم

2

ادنك ،ويراتنوأ ،روسدنو ةعماجو ةيكيرملأا ةدحتملا تايلاولا ،نغيشتيم ،نايردأ ةيلك

1

،

2و

،بساحلا مولع مسق

،ةبيط ةعماج ،بساحلا ةسدنهو مولع ةيلك ةيدوعسلا ةيبرعلا ةكلمملا

[email protected]

صلختسملا .

ملعتلا ةمظنأ ( يللآا

Machine Learning

ىلع فرعتلل مدختست ةيحصلا ةياعرلا يف )

يف يللآا ملعتلا ةمظنأ مادختسا ىدأ دقو .ضيرملا تانايب مادختساب اهصيخشتو ضارملأا ةيحصلا ةياعرلا نيسحتو حلاصإ ىلإ ايجولونكتلا

، ضارملأا نع يئاقلتلا فشكلا للاخ نم

اهصيخشتو

، يرملا ةحص نسحت اهرودب يتلاو مادختسا مت ،ةساردلا هذه يف ،كلذل .حاورلأا ذقنتو ض

ةافو عقوت متيس تايمزراوخ ةدع مادختسابو .مهيفاعتو ىضرملا ةافوب ؤبنتلل يللآا ملعتلا تايمزراوخ لا تايمزراوخ تطعأ دقو .ىضرملا يفاعت وأ ـ

Naïve Bayes

و

Bagged Trees

تلادعم لضفأ

ةبسنب ءادأ 97

٪ و 99

٪ و .يلاوتلا ىلع فينصت تايمزراوخ ترهظأ ،ةقدلا ثيح نم ،كلذ عم

ةطسوتملا ةرجشلا (^MediumTree)(ensemble method Boosted Tree)

ةززعملا ةعومجملا ةرجشلاو

ةقد 97 ريخأو .٪ ةياعرلا يمدقم هبنت نأ نكمي يللآا ملعتلا ةينقت مادختسا نأ ةساردلا هذه ترهظأ ا

سوريف ىضرمل عرسأ جلاع ميدقتل ةيحصلا ( ةروطخلا يلاع انوروك

COVID-19

يف دعاسي امم )

إ ةيحصلا ةياعرلا ةمدخ ةدوج نسحتو حاورلأا ذاقن .

(12)