HANDLING IMBALANCE DATA IN MODELLING CORONARY HEART DISEASE

(1)

International Journal of Social Science Research (IJSSR) eISSN: 2710-6276 | Vol. 4 No. 4 [December 2022]

Journal website: http://myjms.mohe.gov.my/index.php/ijssr

HANDLING IMBALANCE DATA IN MODELLING CORONARY HEART DISEASE

Annurun Najwa Mohd Zamhari¹, Nur Fatihah Mohd Fazli², Zafirah Abd Rahim³, Balkish Mohd Osman⁴*, Shamsiah Sapri⁵and Zuraidah Derasit⁶

1 2 3 4 5 6 Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Shah Alam,

MALAYSIA

*Corresponding author: [email protected]

Article Information:

Article history:

Received date : 23 September 2022 Revised date : 8 November 2022 Accepted date : 28 November 2022 Published date : 7 December 2022 To cite this document:

Mohd Zamhari, A. N., Mohd Fazli, N.

F., Abd Rahim, Z., Mohd Osman, B., Sapri, S., & Derasit, Z. (2022).

HANDLING IMBALANCE DATA IN MODELLING CORONARY HEART

DISEASE. International Journal of Social Science Research, 4(4), 39-51.

Abstract: Coronary heart disease (CHD) is a leading cause of death worldwide. Early CHD risk prediction could minimise mortality by providing an early treatment to the patients. Studies in the healthcare industry used patient data to predict the risk of CHD using data mining techniques. In this study, two data mining techniques, decision tree and logistic regression, were used to model CHD using Framingham Heart Study (FHS) data. Random Undersampling technique was applied to the dataset to address the imbalance classification as the percentage of participant with CHD is only 15.19%. The finding in this study indicates that decision tree with entropy splitting algorithm is the best predictive model for predicting participants who were at risk of developing CHD with accuracy of 65.57%. The ability of the model to predict the positive outcomes is much higher after handling the imbalance classification than without handling the imbalance classification with sensitivity of 82.24% vs 8.88% respectively. There are various factors that may raise the chance of getting CHD, some of which may be managed whereas others cannot. These results indicate that age, gender, systolic blood pressure, BMI, glucose level and total cholesterol are the significant risk factors for predicting CHD.

Keywords: Coronary Heart Disease, Decision Tree, Imbalance Classification, Logistic Regression, Random Undersampling, Sensitivity.

(2)

40

1. Introduction

Coronary heart disease (CHD), also known as coronary artery disease (CAD) is one of the diseases categorised as cardiovascular disease. CHD has the highest mortality rate of all the non-communicable diseases throughout the world. World Health Organization (WHO) and Centres for Disease Control and Prevention (CDC) reported that heart disease is the major reason of death for the people in United Kingdom, Canada and Australia. In 2016, CHD continued to be the leading cause of death toll among Malaysians, accounting for 13.5% of the global total. WHO predicts by 2030, approximately 23.6 million people will suffer from heart disease (Ouf & ElSeddawy, 2021). Heart disease affects a significant proportion of both older and younger people around the world.

Examination and treatment of patients with CHD are required to deliver quality medical services but this procedure can be complicated due to the various unknown factors that can influence the level of risk assessment. There are several factors that may increase the risk of developing CHD and some of the factors can be controlled which usually associated with lifestyle but some are not. For instance, smoking, high blood pressure, diabetes and obesity are considered modifiable. However, those that cannot be controlled are age, sex and family history.

In addition to the current increasing trend of heart disease, emerged a new virus known as Coronavirus (Covid-19) that worsen the situation. All countries are affected by Covid-19 including Malaysia. In dealing with a new pandemic caused by Covid-19, several movement controls orders (MCO) were introduced. During MCO, people are adopting a sedentary lifestyle and spending their time at home. People are not able to go out as much as they want to do some physical activities due to the closure of sports facilities and recreational areas. This has led to isolation where people prefer to stay at home, tend to practice unbalanced diet and not getting involved in physical activities. Hence, harming the health state of the person.

There are numerous health repercussions that affect people's well-being in the short and long term. Obesity become more prevalent as a result of this sedentary lifestyle. Being sedentary has an adverse effect on heart health as people who are physically inactive have a higher relative risk of developing CDH and overall mortality (Azahar et al., 2021). Moreover, isolation at home can also lead to social isolation and mental breakdown, which will undoubtedly enhance the tendency to smoke. Pandemics exacerbate the severity of CHD, which is primarily caused by smoking, sedentary lifestyles and poor eating habits.

2. Literature Review

Many theories have been proposed to determine the risk factors of CHD. Although the literature covers a wide range of theories, this study concentrates on the input variables which include the patients’ demographic, lifestyle and clinical information. This research examines the statistical analysis method utilized in earlier research, such as decision tree and logistic regression, as well as handling the imbalance classification in binary outcome.

(3)

The chances of getting CHD might be differ between men and women. Women have a lower risk of getting CHD than men. In addition, women usually developed CHD at an older age while men developed CHD at a younger age (Gao et al., 2019). CHD can also develop from people lifestyle, such as smoking. In 2012, the Global Adult Tobacco Survey (GATS) estimated that Malaysia had 44.7 million smokers where three out of four are men. This led to statistics of males suffering from CHD in Malaysia is higher than women (Abdullah et al., 2017).

Overweight and obesity do play a role in the development of CHD (Katta et al., 2021).

According to a survey conducted in 2019 by the National Health Monitoring Service (NHMS), 30.4% of Malaysian population are overweight, while 19.7% are obese. In addition, the survey revealed that 15% of children with age ranges from 5 to 17 years old experienced overweight while 14.8% were obese (Jailani et al., 2021). According to Mahmood et al. (2014), obesity is typically associated with hypertension and diabetes where these factors are generally blamed for the increased in the risk of CHD among obese people.

Binary response variable involved in this study tends to experience an imbalance dataset.

Imbalance dataset happens when one of the two classes of the binary target variable is poorly represented. This will result in inadequate information regarding the rare events and imbalanced data will definitely impact the data mining techniques. The existing of imbalance classification, which is also called rare event, must be resolved first in order to construct a good and useful predictive model (Yap et al., 2014). This study is addressing the imbalance data by using random undersampling method to model CHD.

According to Arbain & Balakrishnan (2019), imbalanced data need to be handled as it may affects the performance of the model and resulting in overfitting. Overfitting usually occurs when the model was trained in details and degrades the model's performance. This problem can be handled by using random undersampling method.

Decision tree is a type of supervised learning algorithm that is very useful for classification of binary targets. According to Kumar & Sharma (2016), decision trees are commonly used in real life and particularly beneficial in illness and heart diagnosis. Decision tree and other data mining classification techniques have been used in various research papers to predict heart disease. According to Methaila et al. (2014) different data mining techniques such as decision tree, Naïve Bayes and neural network will be beneficial for predicting heart disease and the result shows that decision tree has the highest accuracy of 99.62%. Another study on heart prediction used Naïve Bayes, Decision Tree, Random Forest and Support Vector Machine reveals that decision tree had the highest accuracy with 89.6% (Saraswathi et al., 2022).

Logistic regression technique is used to model a binary outcome. It is also one of the techniques used for predictive modelling. Study by Latifah et al. (2020) that utilised the Framingham Heart Study (FHS) dataset, revealed that logistic regression has a higher accuracy (85.04%) compared to random forest when predicting the 10-year risk of having CHD. Another study by Chauhan (2018) that used five types of data mining techniques which were k-nearest neighbours (KNN), support vector machine (SVM), decision tree, random forest and logistic regression, showed that logistic regression outperformed other models with 88.86% accuracy.

(4)

42

3. Problem Statement

World Health Organization (WHO) stated that for the past fifteen years, the coronary heart disease (CHD) has been among the top ten causes of death worldwide and Malaysia is not excluded. In 2016, CHD was the contributing factor for mortality among Malaysians. This mortality rate can be minimized with early detection of the disease and patients can be referred to specialist for appropriate treatment earlier. CHD has been a major concerned in Malaysia as the number of CDH cases kept on rising over the last 40 years. Formerly, CHD used to affect the elderly population, but now it is also affecting adults as well. Therefore, there is an urgent need to address this problem.

The study on CHD has caught the interest of many researchers’ attention such as the risk factors, modelling the CHD and prediction of getting the CHD. Various studies have been conducted to explore data mining models in predicting CHD, such as decision tree and logistic regression, however most of the studies done, did not address the issue of imbalance which might result in producing a bias classification model, as the CHD dataset was typically imbalanced. Therefore, in this study, the inclusion of demographic and lifestyle profile as well as clinical information and the application of imbalanced technique before modelling the CHD is aspired to close the gap. The results obtained from this research can offer an alternative method in predicting the CHD.

4. Method

This section explains the dataset that involved in this study, method of handling the imbalanced dataset by using undersampling technique. For modelling the CHD, decision tree and logistic regression models were used.

4.1 The Dataset

This study used a secondary data that is publicly available on the Kaggle website. The data involved residents of the town of Framingham, Massachusetts. The dataset provides information on 4238 residents with 15 attributes. Each of the attribute is a potential risk factor consist of demographic, lifestyle and clinical information. The description of the attribute is shown in Table 1.

Table 1: Description of Risk Factors

No. Variable Name Role Variable Type Description

1. TenYearCHD Target Binary 10-year risk of coronary heart disease (CHD) 1: Yes 0: No

2. gender Input Binary Sex

1: male 0: female

3. Age Input Interval The age of the patient

(From 32 to 70 years old)

4. education Input Ordinal

Education level 1: High School

2: High School Diploma / GED 3: College

4: Degree 5. currentSmoker Input Binary

Whether or not the patient is current smoker 1: Participant is a current smoker

0: Participant is non-smoker currently

6. cigsPerDay Input Nominal

The number of cigarettes that the person smoked on average per day

Category 1: 0

(5)

Category 2: 1-5 Category 3: 6-19 Category 4: 20 and more

7. BPMeds Input Binary

Whether or not the patient was on blood pressure medication

1: on a blood pressure medication 0: not on blood pressure medication

8. prevalentStroke Input Binary

Whether or not the patient has previously had a stroke

1: has had occurrences of stroke 0: no prevalence of stroke 9. prevalentHyp Input Binary

Whether or not the patient was hypertensive 1: prevalence of hypertension

0: no prevalence of hypertension

10. diabetes Input Binary

Whether or not the patient had diabetes 1: has diabetes

0: no diabetes

11. totChol Input Interval Total cholesterol level (mg/dL) (From 0.0 to 696.0 mg/dL) 12. sysBP Input Interval Systolic blood pressure (mmHg)

(From 83.5 to 295.0 mmHg) 13. diaBP Input Interval Diastolic blood pressure (mmHg)

(From 48.0 to 142.5 mmHg)

14. BMI Input Interval Body Mass Index (kg/m²)

(From 15.5 to 56.8 kg/m²) 15. heartRate Input Interval Heart rate in bpm

(From 44.3 to 143.0 bpm) 16. glucose Input Interval Glucose level (mg/dL)

(From 40.0 to 394.0 mg/dL)

4.2 Imbalanced Dataset

The train dataset comprises of 4,238 observations where 84.8% (3,594) of the observations were residence with no CHD whilst only 15.2% (644) of the residence have CHD, suggested that imbalanced classification exist in the train dataset. Building a model on a dataset with classification imbalance would produce a model that would be biased towards the majority class. Furthermore, balanced data produce better prediction result. Hence, the imbalanced classification problem was treated prior to model building by employing the random undersampling technique. The random undersampling technique works by randomly selected the number of observations in the majority class to be equal to the number of observations in the minority class. Random undersampling will randomly select 644 observations in the majority class to equate to the number of observations in the minority class.

4.3 Model Building

SAS Enterprise Miner was used to run all the analysis in this study including model building.

The model employed were Decision Tree and Logistic Regression. 70% of the data was used to train the models while the remaining 30% was utilized to validate the models.

(6)

44

4.3.1 Decision Tree Model

Decision tree is a support tool with a tree-like structure that models probable outcomes.

Generally, the trees were built using a dataset that were randomly selected from the training data. Decision tree uses the target variable to determine how each attribute should be partitioned. At each split, attributes were considered and one attribute was chosen to be split.

In the end, the decision tree breaks the data into nodes, defined by the splitting rules at each step (Arora et al., 2017). For this study three splitting algorithm, Gini, Entropy and Chi-Square will be considered.

4.3.2 Gini

Gini is one of the splitting algorithms used in the construction of the decision tree to decide the optimal split from a root node and subsequent splits. The Gini measure of a node is the sum of the squares of the proportions of the classes.

where 𝑝_𝑖 is the probability of an object being classified to a specific class.

4.3.3 Entropy

Entropy is a measure of the randomness in the information being processed. Entropy measures the impurity, disorderliness or lack of information in decision tree and determines how a decision tree chooses to split data. The higher the entropy, the harder it is to draw any conclusions from that information. In contrast, as a decision tree becomes purer, more orderly and more informative, its entropy approaches zero. The best attribute gives the greatest reduction in entropy.

4.3.4 Chi-square

Chi-square measures the statistical significance of the differences between the child nodes and their parent nodes. It is measured as the sum of squared standardized differences between observed and expected frequencies of target variable for each node.

𝐶ℎ𝑖 − 𝑆𝑞𝑢𝑎𝑟𝑒= 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 − 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 ²

𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 3

(7)

4.3.5 Logistic Regression Model

A logistic regression predicts the probability that an observation falls into one of two categories of a dichotomous dependent variable based on one or more independent variables that can be either continuous or categorical. Depending on the Y of the specific situation and event, logistic regression is utilised to predict Y=1. If the probability of Y=1 is greater than 0.5, then it will be predicted as 𝑌̂ = 1 and if it is smaller than 0.5, it will be predicted as 𝑌̂ = 0. The logistic regression equation, log odds will be calculated using the formula:

There are several variable selection methods for logistic regression technique that will be used in this study which are forward selection method, backward elimination method and stepwise selection method. Variable selection method eliminates irrelevant variables, thus reduce the number of input when developing a predictive model. Hence, the complexity of the model is reduced and simultaneously improved the performance of the model.

Forward selection method starts with a set of input variables, where the variable will be added to the model one at a time. If there are n input variables, then the first step will consider n different models with one input variable. The variable whose model scores best on some test becomes the first variable to be included in the forward selection model. At each step, each variable that is not in the model is tested for inclusion in the model. The most significant of these variables is added to the model. When there are no more significant variables to be added in the model, the process stops.

Backward elimination method starts with fitting a model using all n input variables. Using a statistical test, the least significant variable is dropped from the model and the model is refit without it. The process continues until all remaining variables in the model are statistically significant or some stopping criterion, such as a minimum number of variables desired, is reached.

Stepwise selection method combines the algorithm from both forward selection method and backward elimination method. It allows the variable added earlier in the model to be dropped out and variables that are dropped at one point to be added back in the model.

5. Results and Discussion

This section presents the research findings from the Framingham Heart Study (FHS) dataset.

The description of the dataset can be classified into demographic, lifestyle and clinical information.

Table 2 showed that female participants are slightly higher (52.6%) than male participants (47.4%). Majority of them are high school graduates (47%), followed by Diploma/ GED graduates (27.3%). The remaining are college (14.4%) and degree (11.3%) graduates. The number of smoker vs non-smoker is about the same with 653 (50.7%) and 635 (49.3%) participants respectively. Though the difference is quite small, but we cannot deny that the second largest group is heavy smokers, with 394 participants (30.6%) smoking 20 cigarettes or more every day. The clinical information revealed that 56 participants (4.3%) had taken a blood pressure medicine which means that they are already at stage 2 of hypertension where this condition can increase the risk of heart disease. Fourteen participants (1.1%) who previously

𝑙𝑛 𝑝

1− 𝑝 =𝛽₀+𝛽₁𝑋₁+𝛽₂𝑋₂+⋯+𝛽_𝑝𝑋_𝑝 4

(8)

46

had a stroke are more prone to suffer another stroke. Furthermore, 524 participants (40.7%) have had a history of hypertension, which can lead to other heart complications. Not only that, 49 participants (3.8%) already have diabetes that can also increase the risk of heart disease.

Table 2: Categorical Risk Factors

Variable Category Frequency Percentage

Gender Male

Female

611 677

47.4 52.6 Education High School

Diploma/ GED College Degree

605 351 186 146

47 27.3 14.4 11.3

currentSmoker Smoker

Non-smoker

653 635

50.7 49.3

cigsPerday 0

1-5 6-19 20 and more

635 183 76 394

49.3 14.2 5.9 30.6

BPMeds Medication

No medication

56 1232

4.3 95.7 prevalentStroke Stroke

No stroke

14 1274

1.1 98.9 prevalentHyp Hypertension

No hypertension

524 764

40.7 59.3

diabetes Diabetes

No diabetes

49 1239

3.8 96.2

Based on the information from Table 3, the average age of the participants is 51 years old while the youngest and oldest participants are age 33 and 70 years old respectively. This indicates that most of the participants are senior citizens. The average BMI is 26.044 kg/m² which indicates that on the average, the participants are overweight. The minimum and maximum BMI of the participants are 15.5 kg/m² (underweight) and 38.049 kg/m² (obese) respectively.

The average total cholesterol level is 238.562 mg/dL which is considered borderline high. In addition, the average systolic blood pressure is 136.919 mmHg which can contribute to hypertension. However, the average for diastolic blood pressure, heart rate and glucose level, are all within the normal ranges.

i. Table 3: Continuous Risk Factors.

Variable Minimum Maximum Mean Standard Deviation

age 33 70 51.437 8.7010

BMI 15.5 38.049 26.044 4.1312

totChol 80.4203 387.437 238.562 49.0857

sysBP 83.5000 198.467 136.919 23.3971

diaBP 48.0000 118.626 84.525 12.5786

heartrate 48.0000 111.959 76.604 12.1457

glucose 40.0000 153.847 82.404 17.6692

Before proceed to model building, checking for missing values was done. Imputation using tree surrogate method was used to handle missing values. In addition, the dataset observed in Table 4 indicates that there is an imbalance issue where the percentage of participant with CHD is only 15.19%. Undersampling method was applied to balance the dataset before building the predictive model. The original dataset has 4238 data, however after applying the undersampling method to cater for imbalance issue, only 1288 data left.

(9)

Table 4: Summary Statistics (Original Dataset and Undersampling).

Original dataset Undersampling Variable Numeric

Value

Frequency Percent (%) Frequency Percent (%)

TenYearCHD 0 3594 84.8042 644 50

TenYearCHD 1 644 15.1958 644 50

Decision tree model based on the splitting algorithm which are Entropy, Gini and Chi-Square were modelled. The major goal is to evaluate the performance of decision tree in modelling the coronary heart disease (CHD) and determine the risk factors that lead to CHD.

Table 5 showed the performance measure for Entropy, Gini and Chi-Square. According to the performance measure below, each splitting algorithm has highest performance measures for different measurement. Entropy splitting algorithm exhibits the ability of predicting positive outcome much better (0.8225), while Gini splitting algorithm exhibits the ability of predicting negative outcome much better (0.6938) and Chi-Squared splitting algorithm exhibits the highest accuracy (66.73%).

As the focus is to find the best model that is able to predict the possibility of having CHD, therefore higher sensitivity value, indicates better model. Sensitivity is the measure of the proportion of patients that have a risk of developing CHD correctly diagnosed. From the results, a sensitivity value of 0.8224 indicating that it can predict a proportion of patient that has a risk of developing CHD by 82.24%. Hence, decision tree with Entropy splitting algorithm is the best model.

Table 5: Performance Measure for Decision Tree Models Decision Tree

Model Evaluation Accuracy Sensitivity Specificity

Entropy 0.6557 0.8224 0.4884

Gini 0.6576 0.6216 0.6938

Chi-Square 0.6673 0.6757 0.6589

For logistic regression, three different features selection, forward, backward and stepwise, were modelled. As shown in Table 6, logistic regression with backword selection method exhibits a higher sensitivity (0.6795) and accuracy (67.5%) as compared to other selection method.

Sensitivity value of 0.6795 indicates that it can predict a proportion of patient that has a risk of developing CHD by 67.95% and accurately classify both positive and negative outcome with 67.5% accuracy. Hence, logistic regression using backward selection method is the best model for logistic regression.

Table 6: Performance Measure for Models of Logistic Regression Model Accuracy Sensitivity Specificity

Forward 0.6693 0.6602 0.6783

Stepwise 0.6693 0.6602 0.6783

Backward 0.6750 0.6795 0.6705

The comparison between the decision tree and logistic regression model in Table 7 exhibits the higher accuracy (67.5%) and specificity (0.6705) for logistic regression model with backword selection method. Nevertheless, the decision tree with Entropy splitting algorithm exhibits the higher sensitivity (0.8224). As our interest is predicting the patient risk of developing CHD, therefore the decision tree with Entropy splitting algorithm is chosen as the best model due to the higher ability in predicting patient having a risk of developing CHD. The marginal

(10)

48

differences (1.93%) between the accuracy of both models were treaded in for a better model that can predict patient risk of developing CHD.

Table 7: Performance Measure for Two Different Models Model Accuracy Sensitivity Specificity Decision Tree Entropy 0.6557 0.8224 0.4884 Logistic Regression Backward 0.6750 0.6795 0.6705

The comparison between dataset being treated and without being treated for the imbalance issues is showed to support on the selection of the best model. Based on Table 8, decision tree with Entropy splitting algorithm without treated the imbalance issues gives a higher accuracy (85.04%), a higher specificity (0.9875) but a very low sensitivity (0.088), exhibits the imbalance issue in the dataset. Hence the selection of decision tree with Entropy splitting algorithm with treated undersampling technique is the best model. The finding was in line with the prior study by Methaila et al. (2014), Princy et al. (2020) and Saraswathi et al. (2022), which suggested that decision tree was the data mining techniques for modelling the patient risk of getting CHD.

Table 8: Comparison Between Treated vs Untreated Dataset

Based on the best predictive model obtained, the significant attributes were identified. The most significant attribute was determined by the importance value obtained from the decision tree with entropy splitting algorithm. Table 9 exhibits the result from variable importance for decision tree Entropy. The most important attribute is age because it was split first and has an importance value of one followed by gender, systolic blood pressure (sysBP), Body Mass Index (BMI), glucose and total cholesterol level (totChol).

Table 9: Variable Importance for Entropy Variable Importance

age 1.0000

gender 0.3992 sysBP 0.3793

BMI 0.2698

glucose 0.2531 totChol 0.1846 Imbalance

Technique

Model Feature

Selection

Accuracy Sensitivity Specificity Undersampling Decision Tree Entropy 0.6557 0.8224 0.4884

None Decision Tree Entropy 0.8504 0.088 0.9875

None Logistic Regression Backword 0.8498 0.0579 0.9924

(11)

The finding was in line with the prior study by Abdullah et al. (2017) stated the highest mortality rate (0.079%) among men (age group 80 to 84 years old) and (0.061%) women (age group of 85 years old and above) are as a result of CHD. A study by Gao et al. (2019) stated that women have a lower risk of getting CHD than males. A study by Hajar (2017) revealed that the systolic blood pressure effects high blood pressure more compared to diastolic blood pressure, the long-term of hypertension leads to CHD and higher total cholesterol levels are associated with a significant increase in the risk of CHD. A study by Katta et al. (2021) showed that overweight and obesity play a role in the development of CHD, while a study by Tsao &

Vasan (2015) revealed that high glucose level can cause damage to blood vessels and nerves that control the heart which can lead to CHD over the time.

6. Conclusion and Recommendation

This paper aims to predict a patient risk of getting coronary heart disease (CHD) using two data mining classification techniques. The performances for each model were checked by considering the accuracy, sensitivity and specificity values. The highest sensitivity value (0.8224) was used to select the best data mining models as the interest is to predict a patient risk of getting CHD. The marginal differences (1.93%) between the accuracy of both models were treaded in for a better model. The dataset with an imbalance issue was treated before modelling the predictive model to produce a better result. Hence, decision tree with Entropy splitting algorithm, treated with undersampling technique is the best model. Age, gender, systolic blood pressure, Body Mass Index, glucose and total cholesterol level were identified as the risk factors for detecting the CHD.

In this study, both the decision tree and logistic regression models were useful in predicting a patient risk of getting CHD, however, there are other data mining technique that may be appropriate. Therefore, we suggest an ensemble model like Random Forest can be considered for future studies. In addition, other imbalance technique can be applied such as Synthetic Minority Oversampling Technique (SMOTE) and Hybrid to handle the imbalanced data. Other risk factors such as psychosocial issues may also be considered in future studies.

7. Acknowledgement

Praise to God the Almighty for showering us with blessings and giving us strength and time in completing this paper. We would like to express our sincerest gratitude and deepest appreciation to our parents and friends who has assisted us in this endeavour. We are highly indebted to our supervisor for her contribution in providing invaluable guidance and encouragement during the progress of completing this paper. The authors fully acknowledged that this study used a secondary data that is publicly available on the Kaggle website and comes from an ongoing heart study of Framingham.

(12)

50

References

Abdullah, W. M. S. W., Yusoff, Y. S., Basir, N., & Yusuf, M. M. (2017). Mortality Rates Due to Coronary Heart Disease by Specific Sex and Age Groups among Malaysians.

Proceedings of the World Congress on Engineering and Computer Science 2017, 2.

https://oarep.usim.edu.my/jspui/bitstream/123456789/1872/1/Mortality Rates Due to Coronary Heart Disease by Specific Sex and Age Groups among Malaysians.pdf

Arbain, A. N., & Balakrishnan, B. Y. P. (2019). A Comparison of Data Mining Algorithms for Liver Disease Prediction on Imbalanced Data. International Journal of Data Science and Advanced Analytics (ISSN 2563-4429), 1(1), 1–11. http://ijdsaa.com/index.php/welcome/

article/view/2

Arora, A., Gupta, B., Uttarakhand, P., & Rawat, I. A. (2017). Analysis of Various Decision Tree Algorithms for Classification in Data Mining Cite this paper Related papers Analysis of Classificat ion Techniques in Dat a Mining. ijesrt journal Dat a Mining Applicat ion in Enrollment Management: A Case St udy Saurabh. International Journal of Computer Applications, 163(8), 975–8887.

Azahar, N. M. Z. M., Zaki, M. A. A., & Devaraj, N. K. (2021). Health Consequences During Pandemic: A Review. Malaysian Journal of Medicine and Health Sciences, 17(3), 2636–

9346.

Gao, Z., Chen, Z., Sun, A., & Deng, X. (2019). Gender differences in cardiovascular disease.

Medicine in Novel Technology and Devices, 4, 1–6. https://doi.org/10.1016/ J.

MEDNTD.2019.100025

Hajar, R. (2017). Risk Factors for Coronary Artery Disease: Historical Perspectives. Heart Views: The Official Journal of the Gulf Heart Association, 18(3), 109. https://doi.org/

10.4103/HEARTVIEWS.HEARTVIEWS_106_17

Jailani, M., Elias, S. M., & Rajikan, R. (2021). The new standardized malaysian healthy eating index. Nutrients, 13(10). https://doi.org/10.3390/nu13103474

Katta, N., Loethen, T., Lavie, C. J., & Alpert, M. A. (2021). Obesity and Coronary Heart Disease: Epidemiology, Pathology, and Coronary Artery Imaging. Current Problems in Cardiology, 46(3), 100655. https://doi.org/10.1016/J.CPCARDIOL.2020.100655

Mahmood, S. S., Levy, D., Vasan, R. S., & Wang, T. J. (2014). The Framingham Heart Study and the epidemiology of cardiovascular disease: A historical perspective. The Lancet, 383(9921), 999–1008. https://doi.org/10.1016/S0140-6736(13)61752-3

Methaila, A., Kansal, P., Arya, H., & Kumar, P. (2014). Early Heart Disease Prediction Using Data Mining Techniques [Netaji Subhas Institute of Technology, Maharaja Surajmal Institute of Technology]. https://doi.org/10.5121/csit.2014.4807

Ouf, S., & ElSeddawy, A. I. B. (2021). A Proposed Paradigm for Intelligent Heart Disease Prediction System Using Data Mining Techniques. Journal of Southwest Jiaotong University, 56(4), 220–240. https://doi.org/10.35741/ISSN.0258-2724.56.4.19

Princy, R. J. P., Parthasarathy, S., Jose, P. S. H., Lakshminarayanan, A. R., & Jeganathan, S.

(2020). Prediction of Cardiac Disease using Supervised Machine Learning Algorithms.

Proceedings of the International Conference on Intelligent Computing and Control Systems, ICICCS 2020, 570–575. https://doi.org/10.1109/ICICCS48265.2020.9121169 Tsao, C. W., & Vasan, R. S. (2015). Cohort Profile: The Framingham Heart Study (FHS):

overview of milestones in cardiovascular epidemiology. International Journal of Epidemiology, 44(6), 1800–1813. https://doi.org/10.1093/IJE/DYV337

(13)

Vijaya Saraswathi, R., Gajavelly, K., Kousar Nikath, A., Vasavi, R., & Reddy Anumasula, R.

(2022). Heart Disease Prediction Using Decision Tree and SVM. March, 69–78.

https://doi.org/10.1007/978-981-16-7389-4_7

Yap, B. W., Rani, K. A., Abd Rahman, H. A., Fong, S., Khairudin, Z., & Abdullah, N. N.

(2014). An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. Lecture Notes in Electrical Engineering, 285 LNEE, 13–22.

https://doi.org/10.1007/978-981-4585-18-7_2

Chauhan, Y. J. (2018). Cardiovascular Disease Prediction using Classification Algorithms of Machine Learning. International Journal of Science and Research, 9(5), 194-200.

https://doi.org/10.21275/SR20501193934

Kumar, S., & Sharma, H. (2016). A Survey on Decision Tree Algorithms of Classification in Data Mining. Article in International Journal of Science and Research, 5.

Latifah, F. A., Slamet, I., & Sugiyanto. (2020). Comparison of Heart Disease Classification with Logistic Regression Algorithm and Random Forest Algorithm. AIP Conference Proceedings, 2296, 1-9. https://doi/10.1063/5.0030579/FORMAT/PDF