INTRODUCTION - AIP Conference Proceedings

The use of Artificial Intelligence (AI) is currently receiving more attention in various sectors, one of which is the use of machine learning in the health sector. Globally, health systems face many challenges including an increasing burden of disease and a greater demand for health services [17]. The fundamental transformation of the health system is critical to overcoming the challenges of achieving Universal Health Coverage (UHC) by 2030. Machine learning is the most tangible manifestation of AI that can overcome these challenges in the form of digital technology that can provide more effectiveness and efficiency in the health system [6]. The ability of machine learning to overcome core information processing problems in the health system can increase the ability to make decisions, one of which is in the process of classifying types of diseases with a given accuracy of up to 97% [10].

Breast cancer is one of the cases in the health sector that has a high diagnostic error rate of 10-30% [8], based on this statement, an alternative is needed as a transformation of the health system that can help the prediction process to reduce the diagnostic error case [23]. This is based on the WHO statement that the initial diagnosis of cancer must be accurate because it will determine the next treatment path for the patient. In addition, according to the breast cancer statistics in 2019, invasive breast cancer has a very high number of cases, reaching 268,600 cases in the United States. Invasive cancer is cancer that has invaded other tissues other than the breast cancer tissue itself.

Kanker invasive happens due to delays and inaccuracies in the diagnosis stage so that cancer has spread to other tissues and becomes more difficult to treat at any given moment [20].

Machine learning can be an alternative solution with its ability to learn from the past by identifying patterns from a given data set. Machine learning works on the principle of probabilistic and statistical methods [14] which allows

the system to learn from past or repeated experiences to detect and identify patterns from data sets. Classification of breast cancer types is one application of artificial intelligence in machine learning that has specific characteristics and a high level of machine involvement or what is called automated intelligence [18].

Several people have contributed in the same field by working on WDBC datasets with different machine learning algorithms which give great results. Previously, research on breast cancer prediction has been carried out using KNN by [21] with an accuracy of 95.90% and other studies have been carried out using linear regression and decision tree algorithms by Murugan in 2018 with an accuracy of 88.14. %. In this study, predictions of the type of breast cancer will be carried out using a different algorithm with different evaluation methods and calculation of error values from previous studies with an accuracy value of 96.50%. The algorithm used in this study is the logistic regression algorithm with the evaluation method of cross validation, confusion matrix and ROC-AUC. The performance indicator of the algorithm used is RMSE.

METHODOLOGY

In this study, the dataset used is a dataset from the Wisconsin Breast Cancer Diagnostics (WBCD) published by the UCI Machine learning Repository, in this study the dataset was taken through a website, namely https://www.kaggle.com/uciml/breast-cancer- Wisconsin-data. This dataset was created by Dr. William H. Wolberg, a doctor at the University of Wisconsin Hospital in Madison, Wisconsin, the USA with the data collection method, namely Fine Needle Aspirations (FNAs). FNAs are a method of taking blood samples in the area of breast cancer patients where according to this dataset there are 10 different characteristics of each nuclear cell, namely: Radius, Texture, Perimeter, Area, Smoothness, Compactness, Concavity, Concave points, Symmetry, and Fractal dimension.

Attributes or features will then show significant data variations to indicate the type of malignant or benign cancer (Johra & Shuvo, 2017). This dataset contains 569 rows and 32 attributes with the number of classification classes namely malignant (M) and benign (B) with the number of predictions being 357 benign and 212 malignant respectively. Table 1 below shows a description of the WBCD dataset. The number of benign and malignant diagnoses is also visualized by Figure 1 below

TABLE 1. WBCD Dataset Description

Attributes Number of Attributes

Sample total 569

Dimesionality 30

Classes 2

Sample per class Benign: 357 (62.74%) Malignant: 212 (37.26%)

FIGURE 1. Reactive deflection bridge and differential amplifier.

The correlation between features is very important to know to see the interdependence between features in the dataset, the following Figure 2 shows the correlation between features visualized through a heat map plot on the programming interface.

FIGURE 2. Correlation Between Features

Based on Figure 2 above, the correlation shows how related to changes that are in between the two attributes in the data set. If two attributes change in the same direction, then the two attributes are positively correlated. But if two attributes change in opposite directions then the correlation between them is negative or there is no correlation at all. This knowledge is very important to know because some machine learning algorithms may have less than optimal performance on data that has a dominant positive correlation value. The amount of data used are briefly described in the Tabel 2 below the proportion of 80% for training and 20% for testing [9].

TABLE 2. Splitting Data

This research was carried out based on the flow chart shown in Figure 3 below.

FIGURE 3. Flowchart of the System

Dataset Malignant Benign Total Data

Training set 165 290 455

Testing set 47 67 114

Total data 212 357 569

Starting with inputting data with the extension .csv, which is a WBCD (Wisconsin Breast Cancer Diagnostics) breast cancer dataset. The dataset that has been inputted into the programming system will enter the pre-processing stage. High-quality data, namely data that does not contain noise and data that does not have incorrect data in the dataset, a good classification accuracy depends on the quality of the data processed and achieved through the pre- processing stage [19]. The training data will be used to train the algorithm to produce a trained classification model, in this study the context of the trained classification model is a form of the syntax that has carried out data training or the fitting process. The results of this fitting process will be transformed into test data in the testing process. The standardization has used this paper to normalize data so that numerical data has the same scale so that it will make it easier for machine learning to perform computations by only oscillating in data that is scaled in the range 0 to 1 [21].

The following is the standardization formula, namely the Min-Max normalization calculation.

= (1)

where:

X' = Attribute to be normalized

X min = The smallest value of the attribute X min = The largest value of the attribute

The training data that has gone through the standardization stage will be processed by each algorithm to carry out the training process using the cross-validation evaluation method. The next process is the testing process by test data with the evaluation method used, namely, confusion matrix and ROC-AUC. Then it ends by finding the error value through the RMSE calculation of each algorithm. The output of this research is the value of accuracy, precision, recall, F-1 score, ROCA-UC values and rmse values.

In this study the type logistic regression used is binary, which will predict the data in two target variables are 0 and 1 (Harlan, 2013). In the process of predictive formula logistic regression using the following equation.

= ⁽ ⁾

( ) (2)

To map the prediction results in a graph with target variables 1 and 0, the sigmoid function plays a very important role in this, the equation of the sigmoid function is shown by equation 3.

= (3)

The graph of the sigmoid function is shown in Figure 4.

FIGURE 4. Sigmoid Function Curve [15]

The confusion matrix calculates and provides a classification model regarding the correctness of the results and what types of errors are made. The number of true and false predictions is summarized by the count value and broken down by each class. The form of the confusion matrix is shown in Figure 5.

Predicted Class

Positive (P) Negative (N)

Actual Class

Positive (P) TP

(True Positive)

FP (False Positive)

Negative (N) FN

(False Negative)

TN (True Negative) FIGURE 5. Confusion Matrix (Fenner, 2019)

Based on Figure 5 above, the accuracy, recall, precision, and F-score values can be calculated sequentially in the following way

= ⁽ ⁾

( )

(4)

(5)

= (6)

1 − = ⁽ ^× ⁾ (7)

In this study, k- fold = 10 was used where cross-validation of 10 and 20 times was the most common procedure recommended to check the generalizability of the model [9]. Thus, from the cross-validation is the solution to create a more varied training data with k-fold = 10, the data will be divided into 10 segments with 10 iterations, 9 segments into one segment, and the training data into test data. The following is an illustration for k-fold = 10 in Figure 6.

FIGURE 6. Cross Validation [4]

The k-folds cross-validation is the total data is divided into k parts. Iteration or fold to 1, i.e the 1st part becomes the testing set, the remaining part becomes the training data set, then the system will calculate the accuracy. In the 2nd fold, where the 2nd part becomes the testing set, the rest becomes the training set, then the accuracy is recalculated. The process is repeated until reaching the fold to -k [3]. The average accuracy value of all k values is the final accuracy result.

ROC-AUC represents confusion matrix that can provide information about the ability of an algorithm to separate predictions 1 and 0 of datasets tested. ROC is a two-dimensional graph where the false positive is a horizontal line while the true positive is the vertical axis FPR and TPR are found through the equation 8 and equation 9 [16].

= (8)

= (9)

ROC curve is a technique to visualize and test the performance of classifiers based on their performance. While the Area Under Curve (AUC) value is a number that shows the area under the curve of the ROC curve. AUC values have categories as shown in Table 3.

TABLE 3. AUC Catagory

In this study, the performance indicator used is Root Mean Squared Error (RMSE). RMSE is one of the indicators used to evaluate the average error of the algorithm used, or what is called the standard error algorithm.

RMSE is based on the equation 10.

RMSE = ∑^{( ՛} ⁾ (10)

where:

Y' = Predicted Value Y = Actual Value N = Number of data

The smaller the RMSE value of the algorithm, the better the algorithm's performance in predicting diagnosis [22].

Dalam dokumen AIP Conference Proceedings (Halaman 38-43)