Supervised By ID: 211-25-948 Syed Washfi Ahmad

This thesis entitled “Stroke Prediction Using Machine Learning Techniques”, submitted by Syed Washfi Ahmad, ID No to the Department of Computer Science and Engineering, Daffodil International University has been accepted as satisfactory in partial fulfillment of the requirements for the degree of M. Sc. Department of Computer Science and Engineering Faculty of Natural Sciences and Information Technology Daffodil International University. I hereby declare that this research was conducted by me under the guidance of Md.

I would like to express my heartiest gratitude to the honorable Professor and Head of the Department of CSE, Professor Dr. I would like to thank all our students at Daffodil International University who participated in this discussion during the completion of the course work.

Logistic regression, Decision Tree Classifier, AdaBoost Classifier, Gaussian Classifier, K-Nearest Neighbor Classifier, Random Forest Classifier and XGBoost Classifier were used in the study. Moreover, the proposed study achieved an accuracy rate of 94 percent, with the Random Forest classifier outperforming other classifiers. Random Forest has the lowest false positive and false negative rates compared to other methods.

As a result, Random Forest is almost the ideal classifier for stroke prediction, which doctors and patients can use to prescribe and diagnose a probable stroke early.

Introduction

Motivation
Objectives
Expected Outcome
Report Layout

According to the American Heart Association (AHA), ischemic stroke accounts for 87 percent of all strokes. Bangladeshis had the highest risk of strokes among three South Asian countries, according to research (Bangladesh, Sri Lanka and Pakistan). It is long past time for our country's health system to conduct the extensive research needed to determine the components behind the rising number of stroke patients in Bangladesh.

As a result, an intelligent decision support system based on Machine Learning (ML) techniques will be beneficial in the early diagnosis and prediction of stroke, reducing the severity of the condition. The goal of this research is to build a model that can be used to predict stroke early using machine learning techniques. Create a general model and algorithm to aid in the early diagnosis of a stroke in the brain.

Use the spanning tree technique and the nature of the characteristic to select the optimal feature subset. Calculate the prediction accuracy by analyzing the stroke data using the provided approaches and other machine learning algorithms. Use different performance evaluation matrices to evaluate the prediction performance of the proposed method and compare it with the performance of other methods.

This system helps produce an expected result in my stroke prediction system depending on the data set provided. After completing all the necessary procedures of the proposed system, the system was ready to predict a real-world database. The expected result is to predict stroke from real-world datasets with the achieved accuracy of 94% accuracy.

Literature Review

Scope of the problem

Because it is related to the heart, it is thought to be the most attractive disease. The research topic was chosen in retrospect because of the high number of people who died as a result of a stroke. Finally, the study worked on this to develop a better technique that encourages us to reduce the number of people dying in their advanced age groups.

Challenges

Methodology

Data Preprocessing .1 Handling Missing Data

Data Encoding
Feature Selection
Handling Imbalanced Dataset
Splitting the Data

The average imputation approach is used in this study because it is a frequently used imputation technique that is fast, simple, and straightforward to apply. For example, the JobType attribute has values such as Private, Self-employed labeled as 1 and 2. Attribute importance is one of the most important steps of the Machine Learning model development process.

The result of the feature importance score is a set of features together with their importance statistics. Using a random forest, feature importance can be calculated as the average impurity reduction is calculated from all decision trees in the forest. Also, this result is independent of whether the data are linear or non-linear.

The problem when dealing with unbalanced data sets is that most machine learning approaches will miss the minority class, resulting in poor performance, despite the fact that the performance of the minority class is often the most important. As a result, there is an imbalance that can cause the model to perform poorly in the future. The Synthetic Minority Oversampling Technique (SMOTE) is a kind of data extension for the minority population.

To put it another way, SMOTE looks at minority class instances and uses k nearest neighbors to find a random nearest neighbor, after which a synthetic instance is constructed in feature space randomly. Changing parameters of an algorithm to perfectly fit the training data usually leads to an overfitting algorithm that performs poorly on actual test data. For this reason, we divide the data set into distinct, discrete subsets on which we train different parameters.

The partitioned approach is used to separate the training and test data in this study.

Figure 3.2.3: Random Forest Feature Importance Score

Research Subject & Instrumentation

Classification Algorithms

Decision Tree
Random Forest
Naïve Bayes
K-Nearest Neighbor

By evaluating large amounts of data or constructing predictive models, machine learning methods or classification algorithms can provide trustworthy findings and learn from previous calculations. The two options are unlabeled data used for unsupervised learning and labeled data used for supervised learning. Machine learning algorithms use a variety of supervised and unsupervised learning approaches, as shown in Table 3.3.1.

Among other supervised learning approaches, Decision Tree is an approach that can be used to solve classification and regression problems, but most often it is used to solve classification problems. In this tree-structured classifier, internal nodes contain dataset properties, branches provide decision rules, and each leaf node provides the conclusion. Leaf nodes are the result of these decisions and no longer contain branches.

On the other hand, decision nodes are used to make a decision and have multiple branches. Another supervised machine learning algorithm generally used to solve classification and regression problems is Random forest. It generates decision trees from multiple samples, using the majority vote for classification and the mean for regression.

One of the most important features of the random forest algorithm is that it can handle data sets with continuous and categorical variables, as in regression and classification. To categorize data, Naive Bayes uses Bayes' theorem, assuming that the probability of one attribute A is completely independent of the probability of another attribute B. Bayes' theorem is a theory that explains how to determine the probability of a hypothesis based on prior information.

The probability is the probability of predictor given class, here posterior means the posterior probability of class/target given the predictor. Evidence is the prior probability of a prediction, whereas Prior means the prior probability of a class. To estimate the distance to the nearest neighbor, the Manhattan, Minkowski, Euclidean and Hamming distance formulas are usually used.

Figure 3.3: Basic Steps of Machine Learning

Proposed Model

Results

Experimental Results

From the dataset, stroke patients and non-stroke patients have been labeled as the predicted no stroke and predicted stroke class, respectively. Using seven machine learning classifiers called Naive Bayes, Logistic Regression, AdaBoost Classifier, XGB Classifier, K-Nearest Neighbors, Decision Tree and Random Forest, we have trained. From the dataset, 80% of the data was used for training purposes and the rest of the data was used for validation and testing.

In the accuracy analysis of these classification algorithms, the accuracy of the Random Forest classifier is the highest and the value is 94.704% and it performs better than any other classifier. The confusion matrix of XGBoost Classifier, Ada Boost Classifier, Logistic Regression (LR), Naive Bayes and Random Forest are shown in Figure 4.2 to Figure 4.6 respectively.

Figure 4.2: Confusion matrix for XGBClassifier

Fold Mean

Conclusion

Future Work

Sung SF, Hsieh CY, Kao Yang YH, Lin HJ, Chen CH, Chen YW, Hu YH, “Developing a stroke severity index from administrative data was feasible using data mining techniques,” Journal of Clinical Epidemiology, Volume 68, Issue 11, Pages. Sahoo, Saravanan Nallaperumal, “Designing an Artificial Neural Network Model for the Prediction of Thromboembolic Stroke” International Journal of Biometrics and Bioinformatics, Volume 3, Pages.

Plagiarism Report