Comparative Analysis Performance of Naïve Bayes and K-NN Using Confusion Matrix and AUC To Predict Insurance Fraud

(1)

Comparative Analysis Performance of Naïve Bayes and K-NN Using Confusion Matrix and AUC To Predict Insurance Fraud

Gandung Triyono, Dermawan Ginting^*

Faculty of Information Technology, Master of Computer Science, Budi Luhur University, Jakarta, Indonesia Email: ¹[email protected],^2,*[email protected]

Correspondence Author Email: [email protected]

Abstract−Based on claim submission data from year 2019 to 2021 can be seen that the percentage of claims in one province is much higher than other provinces. During that period, the percentage of claims in that province reached 22% while the highest percentage in other provinces was only 6%. It is suspected that there has been a claim fraud in the province.The fraud allegedly started when customer submits a policy issuance for the elderly insured with a low sum insured so that the premium is also low. The insured's health condition at that time may not be good but it is not explained in the insurance application letter.

To increase the sum insured, the policy is usually added with additional coverage. Fraud claim creates big loss for insurance company since the company has to pay the claim that they should not pay. Insurance company need to have a mechanism to avoid the fraud claim. From this research, it is expected to find the best methodology to be able to predict the potential of insurance claim fraud early when customers apply for policy issuance so that additional checks can be carried out for suspected submissions. The initial data available for this research is 14,778 claim records with attributes are : the date of claim submission, policy effective date, sum assured, type of claim, cause of claim, province and fraud. In order to get the best methodology on the accuracy and performance aspect to fulfill the expectation, two methodologies (Naïve Bayes and K-NN) are compared.

Both Naive Bayes and K-NN methods are used with a comparison of training data and testing data is 80:20. Several combinations were performed for each of these methods. By using Confusion Matrix and AUC to measure the accuracy and performance of the two methods, it can be concluded that the best one is Naive Bayes with accuracy is 90% and AUC is 0.761.

The attributes used are province, sum assured, additional coverage and the insured is the policy holder.

Keywords: Insurance Claim Fraud; Naïve Bayes; K-NN; Confusion Matrix; AUC

1. INTRODUCTION

Insurance company is one of the financial institutions under OJK (Financial Services Authority). In accordance with [1] Chapter IV concerning Business Licensing Article 8 paragraph 1, it is stated that every party conducting an insurance business must obtain a business license from the Financial Services Authority (OJK). According to that law, article 1 paragraph 1, insurance is an agreement between two parties, namely the insurance company and the policy holder, which forms the basis for the insurance company in providing compensation to the policy holder due to a loss or damage suffered due to the occurrence of an event or provide payment based on the death of the insured or payment based on the life of the insured with benefits whose amount has been determined. Figure 1 is graph about the percentage of claim amount per province starting from year 2019 until 2021 and Figure 2 is graph about the percentage of number of claim per province on the same period. Based on those 2 graphs, it can be seen that both claim amount and claim number in one province are much higher than in other provinces so it is suspected that claims fraud has occurred in that province.

Figure 1. Percentage of Claim Amount

Figure 2. Percentage of Number of Claim

(2)

Fraud claim creates big loss for insurance company since the insurance company has to pay the claim that they should not pay. Insurance company need to have a mechanism to avoid the fraud claim. To avoid fraud claim, two options can be done : verify/validate claim submission from customer or verify/validate policy submission when customer buys a policy. This research focus on the second option to do verification / validation on policy submission when customer buys policy. Policy submission data from customer will be profiled with existing claim data experience, for suspected submission additional checks can be carried out as per needed. This research is expected to find the best algorithm to profile policy submission data with existing claim experience data. In order to get the best methodology on the performance aspect to fulfill this expectation, two methodologies (Naïve Bayes and K-NN) are compared. Naïve Bayes is selected since [2] prove that Naïve Bayes has better accuracy and precision compare to some other methodologies. K-NN is selected since [3] mentioned that K-NN is one of the best and simplest text classification algorithm. Confusion matrix and AUC are used to measure the performance and accuracy of those two methodologies so that the result can be compared.

Several research about insurance fraud has been done [4] with the title 'Analyses and Detection of Health Insurance Fraud Using Data Mining and Predictive Modeling Techniques' using three algorithms : logistic regression, neural network and decision tree. In the data set there is no information about fraud indicators so it must be determined before using scoring by comparing the data set with existing business rules. As the results of this research, it was found that the decision tree algorithm was slightly better than the other two. Research [5] with the title 'Detection of Automobile Insurance Fraud Using Feature Selection and Data Mining Techniques' found that the loss rate for vehicle insurance fraud claims has reached 7.7 billion dollars. The researcher uses 3 selection methods GO, PSO and ACO to select the relevant attributes in detecting fraud and then uses the PFCM clustering technique and the WELM classifying technique to classify whether the claim is fraudulent or not. The results obtained from three optimization techniques: PSO shows the best performance. Research [6] with the title 'Fraud detection and frequent pattern matching in insurance claims using data mining techniques'. Researchers classify fraudulent behavior into two categories of claim anomalies based on period and disease based anomalies.

Anomalies of claims based on period are investigated by analyzing statistical decision rules that help in detecting outliers then grouping is done to simplify the fraud detection process. First, to identify claims anomalies by period, the researcher analyzes statistical decision rules to detect short-term outliers that help in detecting fraud. Then k- means clustering is used to group the same pattern in one cluster, then the elbow test is applied to improve the performance of k-means clustering because it helps in calculating the optimal k-means value for the existing data.

The results show that the approach that the researcher proposes is efficient for identifying fraudulent claims.

Furthermore, it was obtained from research [7] with the title 'Fraud Detection in Automobile Insurance using a Data Mining Based Aproach' which also examined insurance fraud claims. The researcher uses the K-means clustering technique where the experimental results show high accuracy when compared to statistical information taken from the data set. Research [8] with the title 'An Improved Approach For Fraud Detection In Health Insurance Using Data Mining Techniques' made an approach using the Random Forest Algorithm where this technique was found to be more efficient. Research [9] with the title ‘Prediction of Insurance Fraud Detection using Machine Learning Algorithms’ used auto insurance fraud detection dataset that contains 110 customers. This research did comparative analysis on various classification algorithms, namely Support Vector Machine (SVM), Random- Forest (RF), Decision-Tree (DT), Adaboost, K-Nearest Neighbor (KNN), Linear Regression (LR), Naïve Bayes (NB), and Multi-Layer Perceptron (MLP) to detect the insurance fraud. This research conclude that Decision Tree gives the highest accuracy of 79% as compared to the other techniques. Further research [10] mentioned about 10 percent of the losses incurred by the insurance industry are estimated to come from fraud claim. This research processed 3.3 million health care bill data . Using logistic regression random forest and gradient boosting, this research conclude empirical experiments prove that the model can be improved by optimizing the neural network architecture, increasing the volume of training data, and combining techniques to anticipate the problem of unbalanced classes. Other research [11] explained fraud cases such as insurance fraud where a customer tries to take the opportunity to gain financial benefits. This research explains about 2 types of machine learning that can be used, namely supervised learning and unsupervised learning. Supervised learning such as SVM, Naïve Bayes, LR, K-NN, and Random Forest (RF). Unsupervised learning, for example K-Means, Fuzzy C-Means. As conclusion, it mentioned that machine learning has limitations, to anticipate these limitations big data analytics techniques can be used. Research [12] with the title ‘Using machine learning models to compare various resampling methods in predicting insurance fraud’, mention that Every year insurance fraud costs insurers millions of dollars that cause the increasing of premium. 37082 record of data from insurance company is used. This research used Neural Network (ANN), Multi-Layer Perceptron (MLP), Random Forest (RF), K-nearest-neighbor (KNN), XG-boost (XG) , AdaBoost, Support Vector Machine (SVM) , Decision Tree, and Naïve Bayes. This research conclude that classifiers cannot make appropriate predictions by using imbalanced data. The findings confirm that there is no one resampling method that overall outperforms. For example, the best model after using the Random Over Sampler method is SVM, the best model after using the Random Under Sampler method is C5.0, the best model after using SMOTE method is C5.0, and the best model after using the hybrid method is the Stochastic Gradient Boosting. Research [13] with the title ‘Blockchain and AI-Empowered Health care Insurance Fraud Detection: An Analysis, Architecture, and Future Prospects’ mentioned that over the past few years, fraud has become a sensitive issue in the health insurance field that cause high losses for individuals, private companies,

(3)

and governments. There are several challenges to overcome this case : validation of data and model, lack of talent, fraud detection system, outdated computer system, data privacy and security. This research propose several layers to overcome the issue : user layer, data generating layer, data analytical layer, dan blockchain layer.

Research [14] mentioned that with the development of the insurance industry, insurance fraud is also increasing rapidly. This insurance fraud has seriously hampered the development of the insurance industry. Research related to insurance fraud is very important. This research used improved adaptive genetic algorithm (NAGA) combined with a BP neural network (BP neural network) and conclude that the usage of improved NAGA-BP model to predict the insurance fraud are closer to the original data

This research compare the accuracy and AUC of Naïve Bayes and K-NN. Several scenarios are tested using Naïve Bayes and so does K-NN. Confusion matrix and ROC are used to measure the result. The best result from each method will be compared to get the best method. The Naïve Bayes algorithm was chosen since [2] proves that Naive Bayes has a better accuracy and precision than some other algorithms. In that research, 7 data sets taken from the UCI repository were used. The algorithms used are Naïve Bayes, KStar, OneR and Random Forest. For all datasets, Naïve Bayes shows the best accuracy. According to [15], Bayes theorem describes the probability of an event based on prior knowledge about the same conditions as a particular event. This theory was first introduced by the Reverend Thomas Bayes. K-NN was chosen because according to [3] K-NN is one of the best and simplest text classification algorithm.

The purpose of this research is to find the best algorithm between Naïve Bayes and K-NN to be able to predict the potential of insurance claim fraud when customer submit a policy issuance so that additional checks can be carried out for suspected submissions.

2. RESEARCH METHODOLOGY

This research uses quantitative research methods. The quantitative method was chosen because it will examine the existing data population using research instruments with the aim of testing a hypothesis. According to [16], quantitative methods are used to examine populations or samples using measuring instruments or research instruments, data analysis is quantitative or statistical with the aim of testing hypotheses that have been made.

Generally, quantitative methods consist of survey methods and experimental methods. According to [17], there are 2 standard process in data mining : CRISP-DM (Cross industry standard process for data mining and SEMMA (Sample, explore, modify, modify and assess). Figure 3 showing the step by step processes starting from business understanding, data understanding, data preparation, modeling, evaluation and deployment. This research uses CRISP-DM methodology. The implementation of Naïve bayes and K-NN are carried out at Modeling process.

Figure 3. CRISP-DM Methodology 2.1 Business Understanding

Business understanding is an activity to understand the detail of business process, the problem that want to be solved and what the desired is from the research. This activity can be done by doing discussion with business team or reading documentation.

2.2 Data Understanding

The necessary data is collected and then analyzed. Data understanding is the process of collecting and analyzing the data needed for the research process. Through this stage, it is possible to identify potential problems that exist in the data so that an initial picture of the shortcomings and limitations of the data can be obtained, the availability of data and the level of suitability of the data with the problem to be solved.

2.3 Data Preparation

The purpose of data preprocessing is to clean the data so that it can be used in the modeling process. Some of the selected data may have different formats because they are selected from different data sources, this data must be

(4)

converted to the required format. In general, data cleaning means filtering, aggregating, and adding values as needed. By filtering the data, the selected data is checked for outliers and redundancy. Outliers is a data that has a very large difference from most of the data or are outside the range of the selected data group

2.4. Modeling

At this stage, modeling is carried out using the specified algorithm.

2.4.1. Naïve Bayes

Equation (1) is Bayes’s theorem [18].

P(H|D) =𝑃(𝐻) 𝑃(𝐷|𝐻) 𝑃(𝐷)

(1)

P(H) : the probability of the hypothesis before we see the data, called the prior probability, or just prior.

P(H|D) : the probability of the hypothesis after we see the data, called the posterior.

P(D|H) : the probability of the data under the hypothesis, called the likelihood.

P(D) : the total probability of the data, under any hypothesis 2.4.2. K-NN

Equation (2) is formula to calculate euclidean distance. Euclidean distance is distance between two examples to be compared in K-NN theorem [19].

𝑑𝑖𝑠𝑡 (𝑝, 𝑞) = √(𝑝1 − 𝑞1)²+ (𝑝2 − 𝑞2)²+. . +(𝑝𝑛 − 𝑞𝑛)² (2) Dist (p,q) : distance between p and q

Equation (3) is a method of rescaling features for K-NN called min-max normalization. This process transforms a feature such that all of its values fall in a range between 0 and 1[19].

𝑋 𝑛𝑒𝑤 = 𝑋 − 𝑚𝑖𝑛(𝑋) 𝑚𝑎𝑥 (𝑋) − 𝑚𝑖𝑛 (𝑋)

(3)

2.5. Evaluation

At this stage an evaluation is carried out to measure the level of accuracy of the model used using confusion matrix and AUC as mentioned by [20]. Prediction results and fact data will be used as the basis for calculation of Confusion Matrix. According to [21] , Confusion matrix is a tool that can be used to analyze the predictions of a learning machine. To test the accuracy of a classification based on a machine learning model; It can also be said that the confusion matrix is a table that contains a summary of the results of true and false predictions resulting from a classification; Table 1 shows the confusion matrix where confusion matrix is an N X N matrix that is used to evaluate the accuracy of a classification model, where N is the number of targets.

Table 1. Confusion Matrix

Confusion matriks Fact N Fact Y

Predict N Number of Fact is N and Predict is N Number of Fact is Y but Predict is N Predict Y Number of Fact is N but Predict is Y Number of Fact is Y and Predict is Y Equation 4 is formula to calculate Confusion matrix rate.

Confusion matrix rate = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝐹𝑁 + 𝐹𝑃 + 𝑇𝑁

(4)

TP : True positive TN : True negative FP : False positive FN : False negative

Receiver Operating Characteristic (ROC) describes the performance of the classifier in graphical form. Unlike the Confusion matrix, the ROC graph is not sensitive to data with unequal class proportions. The ROC graph depicts the performance of the classifier for all discrimination threshold values. ROC is calculated using equation (5) [22]

F = 𝐹𝑃 𝑇𝑁 + 𝐹𝑃

(5)

FP : False Positive TN : True Negative

AUC (Area Under Curve) is the area under the ROC (Receiver Operator Characteristic) graph. AUC is a value that represents the expected performance of the classifier.

3. RESULT AND DISCUSSION

3.1 Data Preparation

The main data owned is claim data from year 2019 to 2021 with the number of record is 14,778. The data is consist of attribute claim submission, effective policy, sum assured, claim type, cause of claim, province and fraud.

The claim submission is the date on which the claim document is received by the insurance company. The effective policy is the date on which the customer begins to be covered by the insurance company. Sum Insured is the amount of money that will be paid by the insurance company if the customer dies. Claims type is types of claims such as death or illness. The cause of the claim is something experienced by the customer so that he submits a claim, such as cancer, stroke or others. Province is the province where the customer is located according to the customer's address. Fraud is an attribute that explains whether the claim is fraud or not.

Some other data is added such as the age of the insured when purchasing the policy, whether the policy has additional rider and whether the policy holder is insured person. Some data need to be converted into other forms as needed. The age attribute will be changed to a range. In [23], the Ministry of Health website states that the age group is divided into nine groups, namely toddlers = 0-5 years old; children = ages 6 – 11 years; Early teens = 12 – 16 years; Late teens = 17 – 25 years old; Early adult = 26 – 35 years old; Late adult = 36 – 45 years old; Early Elderly = 46 – 55 years old; Late Elderly = 56 – 65 years old; Seniors = 65 – above. Attributes of the sum assured will be changed into two groups, namely less than one hundred million and greater than one hundred million. Before the data is used for modeling, the data needs to be cleaned. Data cleaning is carried out to remove incomplete data or attributes that are not relevant to the objectives of the research. Attributes other than age, province, sum assured, rider, policy holder is insured are not needed because they are not related to fraud so these attributes are deleted.

3.2 Evaluation

3.2.1 Naïve Bayes Algorithm

Data is devided into 2 parts : Training data and Testing data with the ratio is 80:20. The first experiment used all attributes. Table 2 is the result of the first experiment where AUC is 0,761 and the accuracy is 87. The number of fact data not fraud and prediction not fraud is 2.496, the number of fact data fraud and prediction not fraud is 227, the number of fact data not fraud and prediction fraud is 164, the number of fact data fraud and prediction fraud is 68.

Table 2. Naïve Bayes Experiment 1 Confusion matriks Fact N Fact Y

Predict N 2.496 227

Predict Y 164 68

The second experiment used sum assured, province, rider and age attributes. Table 3 shows the result of this experiment where AUC is 0,758 and the accuracy is 88. The number of fact data not fraud and prediction not fraud is 2.551, the number of fact data fraud and prediction not fraud is 242, the number of fact data not fraud and prediction fraud is 109, the number of fact data fraud and prediction fraud is 53.

Predict N 2.551 242

Predict Y 109 53

The third experiment used sum assured, province, rider and ph is insured attributes. Table 4 is the result of this experiment where AUC is 0,761 and the accuracy is 90. The number of fact data not fraud and prediction not fraud is 2.634, the number of fact data fraud and prediction not fraud is 271, the number of fact data not fraud and prediction fraud is 26, the number of fact data fraud and prediction fraud is 24.

Predict N 2.634 271

Predict Y 26 24

(6)

The fourth experiment used sum assured, province, rider and age attributes. Table 5 is result of the fourth experiment where AUC is 0,701 and the accuracy is 90. The number of fact data not fraud and prediction not fraud is 2.660, the number of fact data fraud and prediction not fraud is 294, the number of fact data not fraud and prediction fraud is 0, the number of fact data fraud and prediction fraud is 1.

Predict N 2.660 294

Predict Y 0 1

The fifth experiment used province, rider, age and ph is insured attributes. Table 6 shows the result of this experiment where AUC is 0,754 and the accuracy is 87. The number of data not fraud and prediction not fraud is 2.502, the number of fact data fraud and prediction not fraud is 234, the number of fact data not fraud and prediction fraud is 158, the number of fact data fraud and prediction fraud is 61.

Predict N 2.502 234

Predict Y 158 61

As summary from five experiments above as shown in Table 7 below. Table 7 shows all the result of the experiment so it can be seen that the third experiment is the best because the percentage of accuracy and AUC is the highest.

Table 7. Summary of Naive Bayes Experiment Accuracy AUC

1 87 0,761

2 88 0,758

3 90 0.761

4 90 0,701

5 87 0.754

3.2.2 K-NN Algorithm

Table 8 shows the result of this first experiment where AUC is 0,424 and accuracy is 91. The number of fact data not fraud and prediction not fraud is 26, the number of fact data fraud and prediction not fraud is 87, the number of fact data not fraud and prediction fraud is 171, the number of fact data fraud and prediction fraud is 2671.

Table 8. K-NN Experiment 1 Confusion matriks Fact N Fact Y

Predict N 26 87

Predict Y 171 2671

The second experiment used K=10. Table 9 shows the result of this experiment where AUC is 0,653 and the accuracy is 92. The number of fact data not fraud and prediction not fraud is 24, the number of fact data fraud and prediction not fraud is 65, the number of fact data not fraud and prediction fraud is 173, the number of fact data fraud and prediction fraud is 2693.

Predict N 24 65

Predict Y 173 2693

The third experiment used K=15. Table 10 is the result of this experiment where AUC is 0,683 and the accuracy is 92. The number of fact data not fraud and prediction not fraud is 24, the number of fact data fraud and prediction not fraud is 65, the number of fact data not fraud and prediction fraud is 173, the number of fact data fraud and prediction fraud is 2693.

Predict N 24 65

Predict Y 173 2693

(7)

The fourth experiment used K=20. Table 11 is the result of this experiment where AUC is 0,698 and the accuracy is 93. The number of fact data not fraud and prediction not fraud is 26, the number of fact data fraud and prediction not fraud is 87, the number of fact data not fraud and prediction fraud is 171, the number of fact data fraud and prediction fraud is 2671.

Predict N 2 6

Predict Y 195 2752

The fifth experiment used K=25. Table 12 is the result of fifth experiment where AUC is 0,709 and the accuracy is 93. The number of fact data not fraud and prediction not fraud is 2, the number of fact data fraud and prediction not fraud is 6, the number of fact data not fraud and prediction fraud is 195, the number of fact data fraud and prediction fraud is 2752.

Predict N 2 6

Predict Y 195 2752

The summary of the five K-NN experiments above shown in Table 13. Table 13 is the result of all experiments using K-NN. This table shows that the fifth experiment is the best because the percentage of accuracy and AUC is the highest.

Table 13. Summary of K-NN

Experiment K Accuracy AUC

1 K = 5 91 0,424

2 K = 10 92 0,653

3 K = 15 92 0.683

4 K =20 93 0.698

5 K =25 93 0.709

4. CONCLUSION

The best accuracy and AUC of Naïve Bayes is obtained when using attribute sum assured, province, rider and policy holder is insured where accuracy is 90 and AUC is 0.761. The best accuracy and AUC of K-NN is obtained when K is 25 where accuracy is 93 and AUC is 0.709. From each of the best experiments of Naïve Bayes and K-NN above, it can be seen that the AUC of Naïve Bayes is higher than AUC of K-NN while the accuracy of Naive Bayes is slightly lower than K-NN. Based on that result, it is concluded that Naïve Bayes is the best methodology to predict the potential of insurance fraud early when customer apply for policy issuance.

REFERENCES

[1] Law of Republic Indonesia no 40, “Undang-Undang Republik Indonesia no 40 Tahun 2014,” Undang. Republik Indones.

no 40 Tahun 2014, p. 634, 2014.

[2] N. S. Devi and M. Jeyanthi, “Comparative Analysis Of Classification Algorithm Using Machine Learning Technique,”

no. February, 2019.

[3] J. Sun, W. Du, and N. Shi, “A Survey of kNN Algorithm,” Inf. Eng. Appl. Comput., vol. 1, no. 1, pp. 1–10, 2018, doi:

10.18063/ieac.v1i1.770.

[4] P. Pandey, A. Saroliya, and R. Kumar, “Analyses and detection of health insurance fraud using data mining and predictive modeling techniques,” Adv. Intell. Syst. Comput., vol. 584, pp. 41–49, 2018, doi: 10.1007/978-981-10-5699-4_5.

[5] S. Subudhi and S. Panigrahi, “Detection of Automobile Insurance Fraud Using Feature Selection and Data Mining Techniques,” Int. J. Rough Sets Data Anal., vol. 5, no. 3, pp. 1–20, 2018, doi: 10.4018/ijrsda.2018070101.

[6] A. Verma, A. Taneja, and A. Arora, “Fraud detection and frequent pattern matching in insurance claims using data mining techniques,” 2017 10th Int. Conf. Contemp. Comput. IC3 2017, vol. 2018-Janua, no. August, pp. 1–7, 2018, doi:

10.1109/IC3.2017.8284299.

[7] A. Ghorbani and S. Farzai, “Fraud Detection in Automobile Insurance using a Data Mining Based Approach,” Int. J.

Mechatronics, Electr. Comput. Technol., vol. 8, no. 27, pp. 3764–3771, 2018, [Online]. Available: www.aeuso.org.

[8] N. Ghuse, P. Pawar, and A. Potgantwar, “An Improved Approch For Fraud Detection In Health Insurance Using Data Mining Techniques,” no. 5, pp. 27–32, 2017, [Online]. Available: www.ijsrnsc.orgAvailableonlineatwww.ijsrnsc.org.

[9] L. Rukhsar, W. Haider Bangyal, K. Nisar, and S. Nisar, “Prediction of Insurance Fraud Detection using Machine Learning Algorithms,” Mehran Univ. Res. J. Eng. Technol., vol. 41, no. 1, pp. 33–40, 2022, doi: 10.22581/muet1982.2201.04.

[10] I. Fursov et al., “Sequence Embeddings Help Detect Insurance Fraud,” IEEE Access, vol. 10, pp. 32060–32074, 2022, doi: 10.1109/ACCESS.2022.3149480.

(8)

[11] H. Abbassi, I. El Alaoui, and Y. Gahi, “Fraud Detection Techniques in the Big Data Era,” no. June 2022, pp. 161 –170, 2022, doi: 10.5220/0010730300003101.

[12] M. Hanafy and R. Ming, “Using machine learning models to compare various resampling methods in predicting insurance fraud,” J. Theor. Appl. Inf. Technol., vol. 99, no. 12, pp. 2819–2833, 2021.

[13] K. Kapadiya and U. Patel, “Blockchain and AI-Empowered Healthcare Insurance Fraud Detection : An Analysis , Architecture , and Future Prospects,” IEEE Access, vol. 10, no. June, pp. 79606–79627, 2022, doi:

10.1109/ACCESS.2022.3194569.

[14] C. Yan, M. Li, W. Liu, and M. Qi, “Improved adaptive genetic algorithm for the vehicle Insurance Fraud Identification Model based on a BP Neural Network,” Theor. Comput. Sci., vol. 817, pp. 12–23, 2020, doi: 10.1016/j.tcs.2019.06.025.

[15] R. Karim and S. Alla, Scala and Spark for Big Data Analytics. 2017.

[16] ITEBA, “Ini Dia Perbedaan Metode Penelitian Kualitatif, Kuantitatif, dan Penelitian Gabungan,”

https://iteba.ac.id/blog/perbedaan-metode-penelitian-kualitatif-kuantitatif-gabungan/, 2021. . [17] Olson David L, Data Mining Models, Second Edition. 2018.

[18] Allen B. Downey, Think Bayes, 2nd Edition. 2021.

[19] B. Lantz, Machine Learning with R - Third Edition. 2019.

[20] R. Bhowmik, “Detecting Auto Insurance Fraud by Data Mining Techniques,” vol. 2, no. 4, pp. 156–162, 2011.

[21] Z. Karimi, “Confusion Matrix,” Encycl. Mach. Learn. Data Min., no. October, pp. 260–260, 2021, doi: 10.1007/978-1- 4899-7687-1_50.

[22] G. Hackeling, Mastering Machine Learning with scikit-learn - Second Edition. 2017.

[23] M. Al Amin and D. Juniati, “Klasifikasi Kelompok Umur Manusia Berdasarkan Analisis Dimensi Fraktal Box Counting Dari Citra Wajah Dengan Deteksi Tepi Canny,” J. Ilm. Mat., vol. 2, no. 6, pp. 1–10, 2017.