Enhancing Machine Learning Accuracy in Detecting Preventable Diseases using Backward Elimination Method

(1)

Enhancing Machine Learning Accuracy in Detecting Preventable Diseases using Backward Elimination Method

Muhammad Dliyauddin, Guruh Fajar Shidik, Affandy, M Arief Soeleman^*

Faculty of Computer Science, Informatics Engineering Program, Universitas Dian Nuswantoro, Semarang, Indonesia Email: ¹[email protected], ²[email protected],³[email protected],

4,*[email protected]

Correspondence Author Email: [email protected]

Abstract−In the current landscape of abundant high-dimensional datasets, addressing classification challenges is pivotal. While prior studies have effectively utilized Backward Elimination (BE) for disease detection, there is a notable absence of research demonstrating the method's significance through comprehensive comparisons across diverse databases. The study aims to extend its contribution by applying BE across multiple machine learning algorithms (MLAs)—Naïve Bayes (NB), k-Nearest Neighbors (KNN), and Support Vector Machine (SVM)—on datasets associated with preventable diseases (i.e. heart failure (HF), breast cancer (BC), and diabetes). This study aims to elucidate and recommend significant differences observed in the application of BE across diverse datasets and machine learning (ML) methods. This study conducted testing on four distinct datasets—raisin, HF, BC, and early-stage diabetes risk prediction datasets. Each dataset underwent evaluation with three MLAs: NB, KNN, and SVM. The application of BE successfully eliminated non-significant attributes, retaining only influential ones in the model. In addition, t-test results revealed a significant impact on accuracy across all datasets (p-value < 0.05). In specific algorithmic evaluations, SVM exhibited the highest accuracy for the raisin dataset at 87.22%. Additionally, KNN attained the utmost accuracy in the heart failure dataset with an accuracy of 86.31%. In the breast cancer dataset, KNN again excelled, achieving an accuracy of 83.56%. For the diabetes dataset, KNN proved the most accurate, reaching 96.15%. These results underscore the efficacy of BE in enhancing the execution of MLAs for disease detection.

Keywords: Feature Selection; Backward Elimination; Machine Learning Algorithms; Disease Detection; KNN

1. INTRODUCTION

Addressing preventable diseases such as heart failure (HF), breast cancer (BC), and diabetes is of paramount importance, underscoring the urgency to create and execute effective detection strategies. In the realm of informatics and machine learning (ML), pioneering advancements in data analysis, predictive modeling, and pattern recognition hold the key to early identification and intervention. Gjoreski et al. [1] introduced a system utilizing an ML approach to forecast HF based on heart sound reports. Furthermore, Alotaibi [2] achieved notable success in improving accuracy by employing various methods, including decision tree (DT), random forest (RF), logistic regression (LR), Naïve Bayes (NB), and Support Vector Machine (SVM), all with accuracies surpassing 85%.

Another disease important to address is breast cancer (BC), as demonstrated by Mohammed et al. [3] using decision tree (J48), NB, and Sequential Minimal Optimization (SMO), resulting in a notable accuracy of 99.56%

for SMO, which was the highest. Meanwhile, Omondiagbe et al. [4] performed SVM on breast cancer detection, yielding an accuracy of 98.82%. Moreover, certain studies have employed ML for the early detection of diabetes.

For instance, Kopitar et al. [5] demonstrated that XGBoost achieved high accuracy using a simple regression model. Another study utilized a random forest classifier for the timely identification of diabetes, yielding an accuracy of 82%.

ML is a part of artificial intelligence, consisting of a set of algorithms to develop forecasting models. Its approach involves identifying uncovering concealed connections within data in intricate and ever-changing settings, managing data with numerous dimensions and variables, including in the realm of health [6]. ML is designed to extract knowledge from datasets and is formed by algorithms that generate predictions. ML encompasses classification, regression, clustering, and time series prediction [7]. Classification plays a crucial role in data mining. The classifier model is processed by classification algorithms during the training process, and the classification accuracy is assessed during the testing phase [8]. In the training phase, the goal is to locate a function that matches the input data with their corresponding classes for recognized instances, while in the testing phase, the classification is evaluated for its capability to accurately categorize unfamiliar instances (different examples from those employed in the training phase).

Some classification processes utilize all attributes or features for the classification process, but some of them do not have a significant impact on the classification process, leading to poor accuracy performance [9].

However, features or variables in high-dimensional data have a substantial influence on the intricacy and efficiency of algorithms employed for classification. Fortunately, the difficulties of high-dimensional data can be addressed through the process of feature selection (FS) [10]. FS serves as a preprocessing step aimed at tackling the issues of high-dimensional data by choosing a subset of characteristics from a large dataset and improving the classification accuracy or clustering model, resulting in the elimination of irrelevant or noisy, outlier, and irregular data [11]. One way to eliminate excessive and noisy features from a dataset is through FS [7]. FS is the process of discerning and isolating pertinent features (excluding redundant and noisy ones) from a dataset with the aim of obtaining the optimal solution for classification or regression tasks [12].

(2)

FS can also be classified into three approaches: filter method, wrapper method, and embedded method [13].

Filter-based methods assess feature subsets autonomously of the learning algorithm and rely on their own data using specified methods (such as F-score, information gain (IG), etc.) [14]. In addition, wrapper-based methods include classifiers (such as SVM, k-Nearest Neighbor (KNN), etc.), assessment criteria for feature subset, and search algorithms to find subsets that include optimal features [15]. Additionally, embedded methods are utilized during the classifier learning process to discard features predicated on prediction errors in the training data [16].

One of the FS methods in the wrapper-based category is backward elimination (BE), which plays a role in removing variables that do not contribute to the model, leaving only essential variables in the model. However, one drawback of this method is that after removing a variable from the model, it cannot be reintroduced.

Nevertheless, the eliminated variable might have significance in the ultimate model [17]. Several previous studies have employed BE in ML as a tool for disease detection. For example, many studies have applied BE using LR, KNN, NB, DT, RF, gradient boosting classifier (GBC), and SVM to detect diseases such as heart disease, diabetes, cervical cancer and breast cancer [13], [18]–[21]. There is a gap in this field, notably the absence of studies demonstrating the significance of various methods through comparisons across multiple diverse databases.

Therefore, this study applied BE across multiple machine learning algorithms (MLAs) such as NB, KNN, and SVM on various datasets related to preventable diseases, namely HF, BC, and diabetes. The aim is to highlight and suggest the significant differences observed in the application of BE across these diverse datasets and ML methods.

2. RESEARCH METHODOLOGY

2.1 Research Stages

Figure 1 illustrates the stages of this study, comprising three main phases: data preprocessing, FS implementation, and the application of learning algorithms (NB, SVM, and KNN). In the data preprocessing phase, the dataset undergoes processing to identify and handle missing values by determining their averages. Subsequently, the dataset is divided, and the BE process is applied. Following the implementation of BE, a 10-fold validation is carried out. The subsequent stage involves applying the predetermined learning algorithms. The final step encompasses evaluation using the Area Under the Curve (AUC) until the results are obtained.

Figure 1. Research Flowchart 2.2 Data Collection and Testing

This research utilizes secondary data acquired from the UCI (University of California, Irvine) dataset, including datasets related to raisin, HF, BC, and diabetes. Table 1 presents the types of datasets used, along with information about the algorithms, attributes, and the results of data preprocessing.

(3)

Table 1. Types of databases UCI Dataset Algorithms Attribute

Types

Instance Attribute

Number of Attribute

Missing Value

Raisin Classification Integer,

Real

900 8 No

HF Classification,

Regression, Clustering

Integer, Real

299 13 No

BC Classification Integer 116 10 No

Early-stage diabetes risk prediction

Classification - 520 17 Yes

Based on Table 1, it is observed that the algorithm type used by all datasets is classification. In the HF dataset, other algorithms are also available, namely regression and clustering. The attribute types used in the datasets are integers for the raisin, HF, and BC datasets. However, there is no available information about this for the diabetes dataset. Real attribute types are also present for the raisin and HF datasets. The dataset with the largest instance attribute is raisin (900), while the dataset with the highest number of attributes is diabetes (17). Missing values are only found in the diabetes dataset, whereas other datasets do not have missing values.

After completing the stages outlined in the research stages subsection, a significance test is conducted using a t-test on the accuracy data of the algorithm model to determine the difference before and after the implementation of BE. The hypotheses proposed to address this are as follows:

H0: The BE algorithm does not exert a substantial impact on MLAs.

H1: The BE algorithm significantly affects MLAs.

The criteria used in this t-test are as follows: if the calculated t-value > the critical t-value or the p-value <

the significance level of 0.05, then H0 is rejected, and H1 is accepted. This indicates a significant difference between accuracy before and after the implementation of BE. The prerequisite test for this t-test is a normality test using Lilliefors, with the criterion that if Lo(calculated) < L(table), then the sample is normally distributed.

2.3 Naïve Bayes (NB)

The NB algorithm is proven for its simplicity and efficiency [22]. NB calculates probabilities based on the conditions of attributes that can be assumed with C1, C2, C3, ..., Cn, where each assumption has its own label or class, with additional data on attributes, dimensions, or features. Equation (1) illustrates the application of NB.

P(C|X(n) = P(c) × ∏ⁿ_i=nP(X(n)|c) (1)

Where P is the prediction to be calculated, and C is the conditional label of the data derived from the iteration of conditional labels associated with its component attributes.

2.4 k-Nearest Neighbor (KNN)

KNN is commonly used in the domain of data mining to classify objects by determining the nearest distance between the object (query point) and all objects in the training data. Classification is performed based on the k nearest neighbors of the item, with k being a predefined positive integer specified prior to executing the algorithm.

The Euclidean distance is often employed to calculate the distance between objects, and it can be computed using Equation (2).

d_Euclidean(x, y) = √∑^k_i=1(x_i− y_i)² (2)

2.5 Support Vector Machine (SVM)

The SVM algorithm is an ML approach developed from statistical learning theory. It is a supervised classification method based on the minimization of risk associated with the structure. The SVM algorithm can handle nonlinear, high-dimensional, imbalanced, and small-sample data and produce abstraction. The SVM algorithm chiefly deals with binary classification problems, where data samples are expressed as Equation (3).

(x₁, y₁), (x₂, y₂), … , (x_n, y_n) ∈ R^N× Y, Y = {−1,1} (3) Where x_i is the dataset for categorization, y_i is the label category for x_i. For linearly distinct data, the SVM algorithm fulfills y_i(w^Tx_i+ b) ≥ 1,1 ≤ i ≤ n, where w^Tx_i+ b symbolizes the hyperplane, and the parameters w and b respectively symbolize the coefficients and bias. The maximum margin classification algorithm can be characterized as Equation (4).

{ min⁡(¹

2∥ w ∥²),

y_i(w^Tx_i+ b) ≥ 1,1 ≤ i ≤ n (4)

(4)

To attain effective classification for nonlinear and inseparable data, the optimal soft-margin problem is characterized as Equation (5) by incorporating slack variables.

{ min⁡(¹

2∥ w ∥²+ C ∑ⁿ_i=1ξ_i),

ξ_i≥ 0, y_i(w^Tϕ(x_i) + b) ≥ 1 − ξ_i, 1 ≤ i ≤ n (5)

Where ξ_i is the slack variable for i, C is the penalty parameter, and ϕ is the high-dimensional feature projection function associated with the kernel function k(x_i, x_j).

2.6 Backward Elimination (BE)

BE, a method within the wrapper category, is essentially the reverse of forward selection [23]. BE is a frequently employed search strategy in feature subset selection. This method eliminates uninformative features and employs a model building approach with the remaining features to reduce data dimensions while retaining informative features. For strategic decision-making, the relative importance of input variables can be obtained by removing input variables and assessing their impact on the model, which will be retrained without them. Alternatively, the influence of each input variable on the output can be examined using sensitivity analysis methods. As an illustration, if one opts for 50 features in the dataset, each model is constructed using 49 features, omitting one feature at a time. Initially, the feature eliminated is the one that performs the worst in cross-validation. This process continues until only one feature remaining.

3. RESULT AND DISCUSSION

This section will present and discuss the results. The discussion will begin with the BE process and then move on to the accuracy testing results for all designed methods. Following that, the discussion will cover the prerequisite test, specifically the normality test conducted on all datasets before and after the implementation of BE for each MLA. Subsequently, the outcomes of the t-test will be discussed to address the hypothesis regarding the difference in accuracy before and after the application of BE for all MLAs. Towards the end of the discussion, a comparison of the accuracy in this study with several other studies will be explored.

3.1 Backward Elimination (BE) Process

The BE method performs selection by systematically eliminating one attribute at a time from the dataset. In the BE process, multiple elimination rounds are conducted until attributes with a p-value < 0.05 are obtained. The BE process is applied to all datasets, including raisin, HF, BC, and diabetes. Table 2 shows the BE results for the raisin dataset conducted in three phases. Tables 3, 4, and 5 present the BE results for the HF, BC, and diabetes datasets, conducted in nine, five, and nine phases, respectively.

Table 2. Results of elimination using BE on the raisin dataset in three phases Attributes

p-Value Phase

1 2 3

Intercept 0.0001 0.0000 0.0000 Area 0.0000 0.0000 0.0000 MajorAxisLength 0.0564 0.0505 0.0114 MinorAxisLength² 0.8846 0.8394 -

Eccentricity 0.0020 0.0020 0.0003 ConvexArea 0.0000 0.0000 0.0000

Extent¹ 0.7634 - -

Perimeter 0.0000 0.0000 0.0000

Note: Attributes with superscript numbers indicate elimination in the phase corresponding to the superscript number.

Table 3. Results of elimination using BE on the HF dataset in nine phases Attributes

1 2 3 4 5 6 7 8 9

intercept 0.0174 0.0172 0.0221 0.0003 0.0002 0.0002 0.0001 0.0000 0.0000 age 0.0022 0.0022 0.0030 0.0027 0.0025 0.0026 0.0031 0.0042 0.0043 anaemia⁸ 0.9504 0.9574 0.9383 0.9740 0.9837 0.9848 0.9814 0.7786 - creatinine_phosphokinase⁷ 0.1284 0.1254 0.1468 0.1823 0.1838 0.1763 0.1809 -

diabetes⁶ 0.6624 0.6524 0.5049 0.4055 0.4192 0.4179 - -

ejection_fraction 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

(5)

Attributes

1 2 3 4 5 6 7 8 9

high_blood_pressure⁵ 0.7544 0.7560 0.8701 0.8132 0.7989 - - - platelets⁴ 0.7049 0.6948 0.8041 0.7456 - - - -

serum_creatinine 0.0001 0.0001 0.0001 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

serum_sodium³ 0.1315 0.1305 0.1326 - - - - -

sex² 0.2135 0.1523 - - - -

smoking¹ 0.9109 - - - -

time 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 Note: Attributes with superscript numbers indicate elimination in the phase corresponding to the superscript number.

Table 4. Results of elimination using BE on the BC dataset in five phases Attributes

1 2 3 4 5

Intercept 0.0338 0.0612 0.0600 0.0381 0.0214 BMI 0.0195 0.0235 0.0197 0.0144 0.0014 Glucose 0.0000 0.0000 0.0000 0.0000 0.0000 Insulin 0.0022 0.0011 0.0009 0.0009 0.0008 HOMA 0.0054 0.0032 0.0027 0.0026 0.0021 Leptin⁴ 0.7881 0.7159 0.6632 0.6568 - Adiponectin³ 0.7963 0.9942 0.9891 - -

Resistin 0.0458 0.0322 0.0214 0.0182 0.0193

MCP.1² 0.9014 0.8662 - - -

Age¹ 0.3067 - - - -

Note: Attributes with superscript numbers indicate elimination in the phase corresponding to the superscript number.

Table 5. Results of elimination using BE on the diabetes dataset in nine phases Attributes

1 2 3 4 5 6 7 8 9

Intercept 0.1376 0.2826 0.2386 0.2935 0.0785 0.0275 0.0236 0.0105 0.0007 Age⁸ 0.6473 0.5589 0.5497 0.4962 0.8248 0.9165 0.8672 0.8750 - Gender 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 Polyuria 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 Polydipsia 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 Sudden Weight Loss⁷ 0.1435 0.2061 0.2051 0.2196 0.2902 0.2422 0.1756 - -

Weakness⁶ 0.4209 0.3975 0.3924 0.4352 0.3237 0.3715 0.0000 - - Polyphagia⁵ 0.1570 0.1309 0.1306 0.1634 0.1305 - - - - Genital Thrush 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0005 0.0000 0.0000 Visual Blurring⁴ 0.0775 0.0922 0.0916 0.1337 - - - - -

Itching 0.0002 0.0002 0.0002 0.0002 0.0004 0.0004 0.0000 0.0003 0.0003 Irritability 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0154 0.0000 0.0000 Delayed Healing 0.0076 0.0120 0.0101 0.0072 0.0057 0.0108 0.0100 0.0172 0.0173 Partial Paresis 0.0523 0.0374 0.0354 0.0352 0.0214 0.0122 0.0088 0.0072 Muscle Stiffness³ 0.4648 0.3529 0.3513 - - - -

Alopecia² 0.9169 0.9695 - - - -

Obesity¹ 0.1415 - - - -

Note: Attributes with superscript numbers indicate elimination in the phase corresponding to the superscript number. The remaining attributes from the elimination process using BE from Tables 2, 3, 4, and 5 are then processed using predetermined MLAs. Subsequently, the accuracy of each applied ML combination is tested.

Additionally, the MLA also processes the dataset that did not undergo the BE process, allowing for a comparison of their respective results.

3.2 Accuracy Testing Results

Accuracy testing was conducted on all datasets, namely raisin, HF, BC, and diabetes. The testing was performed using RapidMiner tools to obtain accuracy values for each method. The accuracy results include values both before and after the implementation of BE (see Table 6).

(6)

Table 6. Accuracy testing results on several datasets before and after the implementation of BE

Dataset

Accuracy MLAs

NB KNN SVM

Without BE With BE Without BE With BE Without BE With BE

Raisin 83.56% 86.00% 82.89% 83.89% 86.78% 87.22%

HF 75.92% 83.94% 62.20% 86.31% 80.28% 82.97%

BC 61.89% 66.36% 50.76% 83.56% 74.24% 77.50%

Diabetes 89.04% 91.15% 89.62% 96.15% 92.88% 93.46%

Table 6 shows that the accuracy values after the implementation of BE are higher compared to before, and this holds true for all methods and across all datasets. The utmost accuracy value is observed in the diabetes dataset using KNN, amounting to 96.15%, followed by the same dataset using SVM with 93.46%. The most significant increase in accuracy occurs in the BC dataset with the KNN method, rising from an accuracy of 50.76% to 83.56%

with the implementation of BE.

3.3 Normality Test Results

The normality testing is conducted separately based on the MLA. The data used for this testing is derived from the accuracy values obtained in the previous stages. Tables 7, 8, and 9 present the results of the normality testing using the NB, KNN, and SVM methods, respectively.

Table 7. Normality test results for datasets processed using NB No

Without BE With BE

X Z F(Zi) S(Zi

)

F(Zi) -

S(Zi) X Z F(Zi) S(Zi

)

F(Zi) - S(Zi) 1 74.24

%

- 0.93647

0.17451

5 0.25 0.0754851 73

77.50

% -1.2655 0.10284

7 0.25 0.1471532 85 2 75.92

%

- 0.69255

0.24429

5 0.5 0.2557054 22

83.94

%

- 0.12527

0.45015

6 0.5 0.0498435 15 3 83.56

%

0.41669 4

0.66154

9 0.75 0.0884510 36

86.00

%

0.23946 6

0.59462

8 0.75 0.1553720 92 4 89.04

%

1.21233 3

0.88730

8 1 0.1126923 92

91.15

%

1.15129 6

0.87519

5 1 0.1248051 74

Mean 80.69% 84.65%

Std.

Dev. 0.068875 0.056480

Lo 0.255705 0.155372

Table 8. Normality test results for datasets processed using KNN No

X Z F(Zi) S(Zi

)

F(Zi) -

S(Zi) X Z F(Zi) S(Zi

)

F(Zi) - S(Zi) 1 50.76

%

- 1.14331

0.12645

4 0.25 0.1235456 94

83.56

%

- 0.66283

0.25371

8 0.25 0.0037183 3 2 62.20

%

-

0.50862 0.30551 0.5 0.1944896 13

83.89

% -0.607 0.27192

6 0.5 0.2280741 78 3 82.89

%

0.63927 3

0.73867

7 0.75 0.0113225 64

86.31

%

- 0.19754

0.42170

3 0.75 0.3282971 23 4 89.62

%

1.01265 7

0.84438

8 1 0.1556120 87

96.15

%

1.46737 2

0.92886

3 1 0.0711374 1

Mean 71.37% 87.48%

Std.

Dev. 0.180244 0.059102

Lo 0.194490 0.328297

Table 9. Normality test results for datasets processed using SVM No

X Z F(Zi) S(Zi

)

F(Zi) -

S(Zi) X Z F(Zi) S(Zi

)

F(Zi) - S(Zi) 1 61.89

%

- 1.38513

0.08300

7 0.25 0.1669929 06

66.36

%

- 1.39256

0.08187

6 0.25 0.1681235 95 2 80.28

%

- 0.01324

0.49471

8 0.5 0.0052824 01

82.97

% 0.04033 0.51608

5 0.5 0.0160848 56

(7)

No

X Z F(Zi) S(Zi

)

F(Zi) -

S(Zi) X Z F(Zi) S(Zi

)

F(Zi) - S(Zi) 3 86.78

%

0.47165 5

0.68141

3 0.75 0.0685865 33

87.22

%

0.40696 3

0.65798

2 0.75 0.0920175 19 4 92.88

%

0.92671 2

0.82296

2 1 0.1770381 44

93.46

%

0.94526 7

0.82773

9 1 0.1722611 9

Mean 80.46% 82.50%

Std.

Dev. 0.134049 0.11592

Lo 0.177038 0.172261

The reference L(table) value is 0.381, considering four samples and a significance level of 0.05. The Lo values in Tables 7, 8, and 9 indicate that all data have values smaller than 0.05. Therefore, it can be stated that all data are normally distributed, allowing for the continuation to the t-test.

3.4 t-Test Results

A t-test was carried out to assess if there exists a notable difference in accuracy values before and after the addition of BE. Tables 10, 11, and 12 show the outcomes of the t-test for the NB, KNN, and SVM methods, respectively.

Table 10. Results of the t-test for the entire dataset processed using NB t-Test: Paired Two Sample for Means

Mean 0.8069 0.846475

Variance 0.004743827 0.003189969

Observations 4 4

Pearson Correlation 0.92246534

Hypothesized Mean Difference 1

df 3

t Stat -75.57385754

P(T<=t) one-tail 0.000002553 t Critical one-tail 2.353363435 P(T<=t) two-tail 5.10602E-06 t Critical two-tail 3.182446305

Table 11. Results of the t-test for the entire dataset processed using KNN t-Test: Paired Two Sample for Means

Mean 0.713675 0.874775

Variance 0.032487796 0.00349308

Observations 4 4

Pearson Correlation 0.652068325 Hypothesized Mean Difference 1

df 3

t Stat -15.62499606

P(T<=t) one-tail 0.000284848 t Critical one-tail 2.353363435 P(T<=t) two-tail 0.000569697 t Critical two-tail 3.182446305

Table 12. Results of the t-test for the entire dataset processed using SVM t-Test: Paired Two Sample for Means

Mean 0.804575 0.825025

Variance 0.017969202 0.01343735

Observations 4 4

Pearson Correlation 0.998757585 Hypothesized Mean Difference 1

df 3

t Stat -106.4910802

P(T<=t) one-tail 0.000000913 t Critical one-tail 2.353363435

(8)

t-Test: Paired Two Sample for Means

Without BE With BE P(T<=t) two-tail 1.82555E-06 t Critical two-tail 3.182446305

Significance in this t-test is confirmed when the p-value < the significance level of 0.05. Based on Tables 10, 11, and 12, for the NB method, the p-value is 0.000284848; for KNN, it is 0.000284848, and for SVM, it is 0.000000913. Therefore, it can be stated that the p-value < significance level 0.05, indicating a substantial difference in accuracy processed before and after the implementation of BE. In other words, H1 is accepted for all MLAs.

3.5 BE Performance in Disease Detection

Table 13 shows a comparison of accuracy values from this study compared to data from several other studies.

Table 13. Comparison of accuracy values in this study with previous studies Learning

Algorithm and Feature Selection

Dataset Accuracy References

NB + BE HF 87.61% [13]

83.94% This work

BC 66.36% This work

Diabetes 91.15% This work

KNN + BE HF 87.61% [13]

87.95% [24]

86.31% This work

BC 83.56% This work

Diabetes 96.15% This work

SVM + BE HF 82.97% This work

BC 97.02% [19]

77.50% This work

Diabetes 85.71% [18]

93.46% This work

The results of this study indicate an improvement in MLAs, as evidenced by significance testing. This improvement is attributed to the addition of the BE feature selection algorithm. Therefore, it can be concluded that BE can be implemented in the detection of diseases such as HF, BC, and diabetes. Subsequent research can be tested on various MLAs and different types of datasets with larger data sizes.

4. CONCLUSION

One of the challenges in classification is high-dimensional data, which can be addressed by using feature selection, particularly BE. In this study, testing was conducted on four datasets: raisin, HF, BC, and early-stage diabetes risk prediction datasets. Each dataset was tested using three machine learning algorithms: NB, KNN, and SVM. By employing BE, attributes that do not have a significant impact were eliminated, leaving only influential attributes in the model. The t-test results indicate a significant impact on accuracy across all datasets. The classification results for the raisin dataset from all three learning machine algorithms showed the highest accuracy for the SVM algorithm, reaching 87.22%. The classification results for the HF dataset showed the highest accuracy for the KNN algorithm, achieving 86.31%. In the BC dataset, the KNN algorithm also yielded the highest accuracy, reaching 83.56%. For the diabetes dataset, the KNN algorithm achieved the highest accuracy at 96.15%. These findings highlight the effectiveness of BE in enhancing the efficiency of machine learning algorithms for disease detection.

Future research endeavors could extend this study by testing multiple machine learning algorithms on diverse datasets with larger data sizes to further validate the robustness and generalizability of the approach.

REFERENCES

[1] M. Gjoreski, M. Simjanoska, A. Gradišek, A. Peterlin, M. Gams, and G. Poglajen, “Chronic heart failure detection from heart sounds using a stack of machine-learning classifiers.,” in 2017 International Conference on Intelligent Environments (IE), 2017, pp. 14–19. doi: https://doi.org/10.1109/IE.2017.19.

[2] F. S. Alotaibi, “Implementation of Machine Learning Model to Predict Heart Failure Disease,” Int. J. Adv. Comput. Sci.

Appl., vol. 10, no. 6, 2019, doi: 10.14569/IJACSA.2019.0100637.

[3] S. A. Mohammed, S. Darrab, S. A. Noaman, and G. Saake, “Analysis of Breast Cancer Detection Using Different Machine Learning Techniques,” 2020, pp. 108–117. doi: 10.1007/978-981-15-7205-0_10.

[4] D. A. Omondiagbe, S. Veeramani, and A. S. Sidhu, “Machine Learning Classification Techniques for Breast Cancer

(9)

Diagnosis,” IOP Conf. Ser. Mater. Sci. Eng., vol. 495, p. 012033, Jun. 2019, doi: 10.1088/1757-899X/495/1/012033.

[5] L. Kopitar, P. Kocbek, L. Cilar, A. Sheikh, and G. Stiglic, “Early detection of type 2 diabetes mellitus using machine learning-based prediction models,” Sci. Rep., vol. 10, no. 1, p. 11981, Jul. 2020, doi: 10.1038/s41598-020-68771-z.

[6] M. Kubat, An Introduction to Machine Learning. Cham: Springer International Publishing, 2017. doi: 10.1007/978-3- 319-63913-0.

[7] R. C. Thom de Souza, C. A. de Macedo, L. dos Santos Coelho, J. Pierezan, and V. C. Mariani, “Binary coyote optimization algorithm for feature selection,” Pattern Recognit., vol. 107, p. 107470, Nov. 2020, doi:

10.1016/j.patcog.2020.107470.

[8] T. Nyathi and N. Pillay, “Comparison of a genetic algorithm to grammatical evolution for automated design of genetic programming classification algorithms,” Expert Syst. Appl., vol. 104, pp. 213–234, Aug. 2018, doi:

10.1016/j.eswa.2018.03.030.

[9] E. Odhiambo Omuya, G. Onyango Okeyo, and M. Waema Kimwele, “Feature Selection for Classification using Principal Component Analysis and Information Gain,” Expert Syst. Appl., vol. 174, p. 114765, Jul. 2021, doi:

10.1016/j.eswa.2021.114765.

[10] T. H. Nguyen, K. Shirai, and J. Velcin, “Sentiment analysis on social media for stock movement prediction,” Expert Syst.

Appl., vol. 42, no. 24, pp. 9603–9611, Dec. 2015, doi: 10.1016/j.eswa.2015.07.052.

[11] S. Arora, H. Singh, M. Sharma, S. Sharma, and P. Anand, “A New Hybrid Algorithm Based on Grey Wolf Optimization and Crow Search Algorithm for Unconstrained Function Optimization and Feature Selection,” IEEE Access, vol. 7, pp.

26343–26361, 2019, doi: 10.1109/ACCESS.2019.2897325.

[12] R. C. T. De Souza, L. dos S. Coelho, C. A. De Macedo, and J. Pierezan, “A V-Shaped Binary Crow Search Algorithm for Feature Selection,” in 2018 IEEE Congress on Evolutionary Computation (CEC), Jul. 2018, pp. 1–8. doi:

10.1109/CEC.2018.8477975.

[13] C. C. Aggarwal, X. Kong, Q. Gu, J. Han, and P. S. Yu, “Active learning: A survey,” Data Classification: Algorithms and Applications, pp. 571–605, 2014, doi: 10.1201/b17320.

[14] A. Katrutsa and V. Strijov, “Comprehensive study of feature selection methods to solve multicollinearity problem according to evaluation criteria,” Expert Syst. Appl., vol. 76, pp. 1–11, Jun. 2017, doi: 10.1016/j.eswa.2017.01.048.

[15] A. Zarshenas and K. Suzuki, “Binary coordinate ascent: An efficient optimization technique for feature subset selection for machine learning,” Knowledge-Based Syst., vol. 110, pp. 191–201, Oct. 2016, doi: 10.1016/j.knosys.2016.07.026.

[16] X. Zhu, S. Zhang, R. Hu, Y. Zhu, and J. Song, “Local and Global Structure Preservation for Robust Unsupervised Spectral Feature Selection,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 3, pp. 517–529, Mar. 2018, doi:

10.1109/TKDE.2017.2763618.

[17] M. Z. I. Chowdhury and T. C. Turin, “Variable selection strategies and its importance in clinical prediction modelling,”

Fam. Med. Community Heal., vol. 8, no. 1, p. e000262, Feb. 2020, doi: 10.1136/fmch-2019-000262.

[18] F. Maulidina, Z. Rustam, S. Hartini, V. V. P. Wibowo, I. Wirasati, and W. Sadewo, “Feature optimization using Backward Elimination and Support Vector Machines (SVM) algorithm for diabetes classification,” J. Phys. Conf. Ser., vol. 1821, no. 1, p. 012006, Mar. 2021, doi: 10.1088/1742-6596/1821/1/012006.

[19] S. Farahdiba, D. Kartini, R. A. Nugroho, R. Herteno, and T. H. Saragih, “Backward Elimination for Feature Selection on Breast Cancer Classification Using Logistic Regression and Support Vector Machine Algorithms,” IJCCS (Indonesian J.

Comput. Cybern. Syst., vol. 17, no. 4, p. 429, Oct. 2023, doi: 10.22146/ijccs.88926.

[20] M. Arifin, “Naïve Bayes Algorithm Based On Backward Elimination For Predicting Cervical Cancer,” Int. J. Innov. Sci.

Res. Technol., vol. 7, no. 7, pp. 1–3, 2022.

[21] N. Bodasingi, N. Balaji, and B. R. Jammu, “Automatic diagnosis of pneumonia using backward elimination method based SVM and its hardware implementation,” Int. J. Imaging Syst. Technol., vol. 32, no. 3, pp. 1000–1014, May 2022, doi: 10.1002/ima.22694.

[22] S. Karthika and N. Sairam, “A Naïve Bayesian Classifier for Educational Qualification,” Indian J. Sci. Technol., vol. 8, no. 16, Jul. 2015, doi: 10.17485/ijst/2015/v8i16/62055.

[23] V. Kumar, “Feature Selection: A literature Review,” Smart Comput. Rev., vol. 4, no. 3, Jun. 2014, doi:

10.6029/smartcr.2014.03.007.

[24] Y. Isler, U. Ozturk, and E. Sayilgan, “A new sample reduction method for decreasing the running time of the k-nearest neighbors algorithm to diagnose patients with congestive heart failure: backward iterative elimination,” Sādhanā, vol. 48, no. 2, p. 35, Mar. 2023, doi: 10.1007/s12046-023-02105-3.