Cluster Analysis using K-Means and K-Medoids Methods for Data Clustering of Amil Zakat Institutions Donor
Hotmaida Lestari Siregar1, Muhammad Zarlis2,*, Syahril Efendi1
1Faculty of Computer Science and Information Technology, Master of Informatics Engineering Study Program, University of North Sumatra, Medan, Indonesia
2Information Systems Management Departement, BINUS, Graduate Program-Master of Information Systems Management, Bina Nusantara University, Jakarta, Indonesia
Email: 1[email protected], 2,*[email protected], 3[email protected] Correspondence Author Email: [email protected]
Abstract−Cluster analysis is a multivariate analysis method whose purpose is to classify an object into a group based on certain characteristics. In cluster analysis, determining the number of initial clusters is very important so that the resulting clusters are also optimal. In this study, an analysis of the most optimal number of clusters for data classification will be carried out using the K-Means and K-Medoids methods. The data were analyzed using the RFM model and a comparative analysis was carried out based on the DBI value and cluster compactness which was assessed from the average silhouette score. The K- Means method produces the smallest DBI value of 0.485 and the highest average silhouette score value of 0.781 at k=6, while the K-Medoids method produces the smallest DBI value of 1.096 and the highest average silhouette score value of 0.517 at k=3. The results show that the best method for data clustering donations Amil Zakat Institutions is using the K-Means method with an optimal number of clusters of 6 clusters.
Keywords: K-Means; K-Medoids; RFM Model; DBI; Average Silhouette Score
1. INTRODUCTION
Clustering is a grouping process whose goal is to group objects into groups that have different properties between groups, so that these objects have homogeneous properties with objects located in these groups [1]. Clustering has 2 methods, namely Hierarchical clustering and non-hierarchical clustering methods which start from determining the number of clusters, then the clustering process is carried out without carrying out a hierarchical process or commonly called the K-Means clustering method [2].
The K-Means algorithm is the simplest and most efficient algorithm, first proposed by MacQueen J in 1967 [3], it is also computationally faster and performs better on large data sets with categories and numeric elements [4][5] . K-Means is an algorithm used to separate data into different groups and minimize the distance between the clusters [6]. In the K-Means method, k is chosen randomly as the initial clustering center assigned to represent class clusters [7].
Cluster analysis or commonly called "clustering" aims to group objects into a group based on their characteristic similarities [8] [9] which is based on grouping data sets which are unsupervised learning [10]. In grouping objects into clusters, the most optimal cluster is a cluster with a high level of similarity between objects in one cluster and a high level of dissimilarity in objects in other clusters [2]. In cluster analysis, determining the initial number of clusters is very important so that the clustering process produces the optimal number of groups so that an optimal number of k analysis is needed for grouping. The clusters that are formed depend on the initial center initialization [11]. Cluster iteration aims to purify clusters that go through an iteration process [12]. Another important parameter in clustering is the optimal number of clusters [13].
One of the most famous clustering algorithms is K-Medoids [14], also a simple yet effective algorithm. The K-Medoids algorithm uses the actual point in the cluster as the center of the medoid of a cluster which is located in the center of the cluster and at the smallest distance to other point objects [15]. The K-Medoids method is used to find medoids in a cluster, where the location of the medoids always changes during the iteration process [16].
The basic concept of the K-Medoids algorithm is to determine k clusters which first find the most similar medoids in the clusters [17].
In this cluster analysis research, determining the initial number of clusters is very important because it can affect the final result of the clustering process. Determining the number of clusters that are not quite right has the risk of producing clusters that are not optimal so that it affects the information on the results of the clusters that are formed. Thus, the solution offered to overcome errors in determining the number of clusters in this study is to analyze the most optimal number of k for the 2 clustering methods by changing the k value from 1-6 and comparing based on the DBI value and average silhouette score. The optimal k value obtained becomes a reference for clustering donor data from related amil zakat institutions [18]. In this case, the method to be used is the K-Means and K-Medoids method using the RFM Recency, Frequency, Monetary model). This RFM model is used because it is easy to use in decision making [19]. Recency is the last date the transaction was made, frequency is the number of times the transaction was made, and monetary is how much the value of the transaction was made [20]. The RFM model is widely used to estimate customer data based on purchasing behavior [21].
In several studies, cluster analysis has been carried out a lot, as research conducted by [22] yielded 3 clusters. The results of this study were tested using a system which showed the same results between testing using
the system and manual calculations. The results of this research are expected to be input to the government to pay more attention to the number of facilities that are not good. In research [23] a cluster evaluation of the K-Means method has also been carried out which produces a much lower DBI value compared to K-Medoids. The results show that the total percentage of cluster suitability is 97%, so it can be concluded that the web-based clustering application in this study is feasible and relevant so that it is expected to be useful in the truck vehicle clustering process based on its productivity level. In previous research, a variant of K-Medoids research was carried out on heterogeneous data sets which resulted in increased algorithm performance and formed good clusters on heterogeneous data sets with various data types [24]. The results of the study [25] show that the K-Means algorithm is very sensitive to different initial values because it can produce different clusters, so researchers initialize the cluster centers. The results show that the accuracy is higher and the stability is much better using the initial cluster center than using the conventional K-Means algorithm.
In this study [18], cluster analysis was carried out using the K-Means, K-Medoids, and X-Means methods.
The results showed that K-Medoids had a better validity value compared to X-Means and K-Means. The average DBI value of the K-Medoids method is 0.540778 with the best number of clusters being 5 clusters. So it can be concluded that K-Medoids can be used in segmenting customers and developing appropriate marketing strategies for these retail stores.
Based on the background of the problem and an explanation regarding previous research, the research that will be carried out is to carry out cluster analysis on the K-Means and K-Medoids methods. The data to be used is the donor dataset of the amil zakat institution in the form of a Microsoft excel file from the institution's official website. Then it is done first to change the donor dataset into the form of an RFM (Recency, Frequency, Monetary) variable so that the dataset becomes easier to analyze. Then perform data transformation using data normalization with z-score. Finally, comparing the Davies Bouldin index and cluster cohesiveness values based on the average silhouette score and drawing conclusions. The results of this research are expected to be useful for the progress of related institutions in terms of implementing strategies to maintain the loyalty of donors.
2. RESEARCH METHODOLOGY
In this study, will be described sequentially, the process of research stages carried out to achieve the expected goals which will be shown in Figure 1 below.
Figure 1. Research Flow 2.1 Data Used
The data to be used in this study are data taken primarily from the NU Care LAZISNU Padangsidempuan-SUMUT zakat institution. In this study, transaction data was used in 2021 with a total of 1,261 donors.
2.2 Attribute Selection
There are several attributes contained in the data of donors at the Padangsidempuan-North Sumatra amil zakat infaq shodaqoh nahdlatul ulama institution, such as box number attributes, restaurant name, address, contact, box
RFM Analysis Dataset Pre-Process Data
K-Means K-Medoids
Comparative Analysis
Davies Bouldin Index
Conclusion
Cluster Compactness
status, date and total. Researchers use the RFM variable because the variable values are inconsistent. In this study, transaction attributes that include RFM will be selected, namely recency, frequency, monetary. Recency (R) is data in the form of the last transaction month from donors, Frequency (F) is data containing the number of transactions of donors within the year 2021. Meanwhile Monetary (M) is data in the form of the amount of money donated by donors in the period of 2021.
2.3 Pre-Process Data
Retrieval of donor data from amil zakat institutions in the form of a Microsoft excel file containing attributes Box number, restaurant name, address, contact, contact status, date and total. After data collection, redundant, unclear and incomplete data will be cleaned. The pre-process data is the raw data type to be continued in the next process.
The goal is to get data into a format that is easier and more effective. Furthermore, in this study data transformation will be carried out by determining the z-score value, where attribute values are normalized based on the average value and standard deviation of the attributes. Following are the results of normalization of donor data shown in the following figure:
Figure 2. RFM attribute transformation results
Figure 2 above is the result of data transformation by determining the z-score value for each attribute. This is done to facilitate the next clustering process.
2.4 K-Means Algorithm
The process of doing clustering using the K-Means formula is as follows:
1) Determine k as the number of clusters to be formed.
2) Determine the random value for the initial Cluster center (Centroid)
3) Calculate the distance of each input to the Centroid using the Euclidean distance formula until the closest distance is found for each Centroid data.
d = i(xi,iyj) i= i√∑ i(xi i− i iyj)2
(1) 4) Classifying data based on proximity to Centroid
5) Update the Centroid value
6) Repeat steps 2 to 5, until nothing changes in the members of each cluster. If step 6 has been completed, then the Cluster center value in the last iteration will be used as a parameter for data classification.
2.5 K-Medoids Algorithm
The steps for K-Medoids Clustering are as follows [26]:
1) Define k as the number of clusters.
2) Determine k cluster centers (medoids) randomly
3) Calculate the distance between the non-medoid objects and the medoid in each cluster, then place thenon- medoid object to the nearest medoid. Next, calculate the total distance.
4) Next, randomly select a non-medoid object as the new medoid in each cluster.
5) Calculate the distance between the non-medoid and the new medoid in each cluster, then place the object in the nearest medoid. Then calculate the total distance.
6) Calculate the difference between the total distance (total distance), with (total distance) = total distance in the new medoid – total distance in the old medoid.
S = new total cost – old total cost (2)
7) If the value (Stotaldistance) < 0, then the new medoid becomes the new medoid and then returns to steps (4) to (7). If obtained (total distance) > 0, then the iteration stops.
3. RESULT AND DISCUSSION
This chapter will discuss each research result and describe the stages of the research process. The data used is donor data for the 2021 period which will form the basis of this research data. Then, using the K-Means and K- Medoids algorithms based on the RFM variable, cluster analysis will be carried out. Comparative analysis will be based on the DBI value and cluster cohesiveness determined by the average silhouette score. However, in this discussion some donor data will be used as an example of manual calculation with a total of 10 data records whose attributes are based on the RFM variable and a clustering process will be carried out. The details of the dataset are shown in table 1.
Table 1. Donor Dataset Details
No Donor Id R F M
1. 1 0,063 -1,422 -0,365
2. 2 0,063 -1,422 -0,103
3. 3 0,063 -1,422 1,514
4. 74 -0,615 0,185 -0,410
5. 75 -0,615 0,185 -0,035
6. 76 -0,615 0,185 -0,415
7. 123 -0,553 0,185 0,846
8. 124 -0,553 0,185 -0,205
9. 125 -0,553 0,185 -0,415
10. 642 -0,245 0,185 0,116
3.1 Application of the K-Medoids Clustering Algorithm
a. In the process of calculating the K-Means method, 2 clusters will be formed, and the initial clusters will be chosen randomly. The values for each Cluster center are shown as follows:
Table 2. Random Cluster Center Points
Cluster Center Donor id R F M
1 2 0,063 -1,422 -0,103
2 124 -0,553 0,185 -0,205
b. Calculate the distance to all data points using the Euclidean Distance formula. The calculations are as follows:
𝐸𝐷 = √(𝑥2 − 𝑥1)2+ (𝑦2 − 𝑦1)2 (3)
Iteration 1 Cluster 1
= √((0,063 − 0,063)2+ (−1,422 + 1,422)2+ (−0,365 + 0,103)²) = 0,262
= √((0,063 − 0,063)2+ (−1,422 + 1,422)2+ ( 1,514 + 0,103)²) = 1,617
= √((−0,615 − 0,063)2+ (0,185 + 1,422)2+ (−0,410 + 0,103)² ) = 1,7709 Continue the next calculation in the same way until the 10th data.
Cluster 2
= √((0,063 + 0,553)2+ (−1,422 − 0,185)2+ (−0,365 + 0,205)²) = 1,7284
= √((0,063 + 0,553)2+ (−1,422 − 0,185)2+ (1,514 + 0,205)²) = 2,4324
= √((−0,615 + 0,553)2+ (0,185 − 0,185)2+ (−0,410 + 0,205)²) = 0,2141
The results of the calculation of the eculidean distance will be shown in the following table. Continue the next calculation in the same way until the 10th data.
Table 3. Update Donor Data Cluster Point Distance
No Donor Id R F M 1st iteration
C1 C2
1 1 0,063 -1,422 -0,365 0,262 1,7284
2 2 0,063 -1,422 -0,103 - -
3 3 0,063 -1,422 1,514 1,617 2,4324
4 74 -0,615 0,185 -0,410 1,7709 0,2141 5 75 -0,615 0,185 -0,035 1,7454 0,1809 6 76 -0,615 0,185 -0,415 1,7718 0,2189
7 123 -0,553 0,185 0,846 1,9653 1,051
8 124 -0,553 0,185 -0,205 - -
9 125 -0,553 0,185 -0,415 1,7490 0,21 10 642 -0,245 0,185 0,116 1,6508 0,4448
c. Define Clusters of data groups that are close to the Centroid shown in the following table.
Table 4. Cluster Division Cluster 1 Cluster 2
No. 1 No. 2 No. 3
124 74 75 76 123 125 642 d. Calculate the mean x and y of the clusters formed.
𝑥1+𝑥2+𝑥3
𝑎𝑚𝑜𝑢𝑛𝑡 𝑥 (4)
= 0,063+0,063+0,063
3 = 0,063
= −1,422−1,422−1,422
3 = −1,422
= −0,365−0,103+1,514
3 = 0,3486
The results of calculating the mean x and y values based on the Centroid Cluster 1 point data will be shown in the following table.
Table 4. Mean Cluster Calculation Results- 1 Cluster 1 Mean x1 Mean x2 Mean x3
1
0,063 -1,422 0,3486 2
3
= 𝑦4+𝑦5+𝑦6+𝑦7+𝑦8+𝑦9+𝑦10
𝑎𝑚𝑜𝑢𝑛𝑡 𝑦 (5)
= −0,615−0,615−0,615−0,53−0,53−0,245
7 = −0,5355
= 0,185+0,185+0,185+0,185+0,185+0,185+0,185
7 = 0,185
= −0,410−0,035−0,415+0,846−0,205−0,415+0,116
7 = −0,518
Table 5. Mean Cluster Calculation Results- 2 Cluster 2 Mean y1 Mean y2 Mean y3
124
-0,535571 0,185 -0,518 74
75 76 123 125 642
Table 6. Cluster Results for Each Data No-id Kelompok Cluster
1 Cluster 1
2 Cluster 1
3 Cluster 1
74 Cluster 2
75 Cluster 2
74 Cluster 2
123 Cluster 2 124 Cluster 2 125 Cluster 2 642 Cluster 2
Cluster results for each data from table 6. above are the results of cluster grouping based on the cluster center point randomly. The test is the result of calculating the 1st iteration, and for subsequent iterations it is carried out with the same process until the resulting data values converge.
3.2 Application of the K-Medoids Clustering Algorithm 3.2.1 Initial iteration
Select the medoids, for example the 1st and 2nd data, then calculate the distance using the Euclidean Distance formula.
𝐸𝐷 = √(𝑥2 − 𝑥1)2+ (𝑦2 − 𝑦1)2 (6)
= √((0,063 − 0,063)2+ (−1,422 + 1,422)2+ (−0,365 + 0,365)²) = 0,000
= √((0,063 − 0,063)2+ (−1,422 + 1,422)2+ (−0,365 + 0,103)²) = 0,262
= √((0,063 − 0,063)2+ (−1,422 + 1,422)2+ (−0,365 + 1,514)²) = 1,879
Thus calculating the distance in the same way up to the 10th data, the results of which can be shown in the following table.
Table 7. Results of C1 Distance on K-Medoids
No C1 Object Data (Xi) D1
1 0,63 -1,422 -0,365 0,063 -1,422 -0,365 0,000 2 0,63 -1,422 -0,365 0,063 -1,422 -0,103 0,262 3 0,63 -1,422 -0,365 0,063 -1,422 1,514 1,879 4 0,63 -1,422 -0,365 -0,615 0,185 -0,410 2,033 5 0,63 -1,422 -0,365 -0,615 0,185 -0,035 2,059 6 0,63 -1,422 -0,365 -0,615 0,185 -0,415 2,033 7 0,63 -1,422 -0,365 -0,553 0,185 0,846 2,322 8 0,63 -1,422 -0,365 -0,553 0,185 -0,205 1,988 9 0,63 -1,422 -0,365 -0,553 0,185 -0,415 1,982 10 0,63 -1,422 -0,365 -0,245 0,185 0,116 1,891
The selected medoids for cluster 2 is the 2nd data, then the calculation of the Euclidean Distance is as follows:
= √((0,063 − 0,063)2+ (−1,422 + 1,422)2+ (−0,103 + 0,365)²) = 0,262
= √((0,063 − 0,063)2+ (−1,422 + 1,422)2+ (−0,103 + 0,103)²) = 0,000
= √((0,063 − 0,063)2+ (−1,422 + 1,422)2+ (−0,103 − 1,514)²) = 2,049 Calculation of the distance to the 10th data will be shown in the following table.
Table 8. Results of C2 Distance on K-Medoids
No C2 Object Data (Xi) D2
1 0,63 -1,422 -0,103 0,063 -1,422 -0,365 0,262 2 0,63 -1,422 -0,103 0,063 -1,422 -0,103 0,000 3 0,63 -1,422 -0,103 0,063 -1,422 1,514 2,049 4 0,63 -1,422 -0,103 -0,615 0,185 -0,410 2,055 5 0,63 -1,422 -0,103 -0,615 0,185 -0,035 2,033 6 0,63 -1,422 -0,103 -0,615 0,185 -0,415 2,056 7 0,63 -1,422 -0,103 -0,553 0,185 0,846 2,197 8 0,63 -1,422 -0,103 -0,553 0,185 -0,205 1,984 9 0,63 -1,422 -0,103 -0,553 0,185 -0,415 2,006 10 0,63 -1,422 -0,103 -0,245 0,185 0,116 1,842
Next, calculate the cost and total cost of the distance between the two clusters and will be shown in the following table.
Table 9. Total Cost of K-Medoids
No D1 D2
1 0,000 0,262
2 0,262 0,000
3 1,879 2,049
4 2,033 2,055
5 2,059 2,033
6 2,033 2,056
7 2,322 2,197
8 1,988 1,984
9 1,982 2,006
No D1 D2
10 1,891 1,842
Cost 9,818 8,056 Total Cost 17,874
Because the new total cost has no difference in cost, it will continue with the next iteration.
3.2.2 Second iteration
In this 2nd iteration, clusters that are not medoids will be selected, for example C1 (3rd data) and C2 (7th data).
Furthermore, the distance calculation using Euclidean Distance will be described as follows.
= √((0,063 − 0,063)2+ (−1,422 + 1,422)2+ (1,514 + 0,365)²) = 1,879
= √((0,063 − 0,063)2+ (−1,422 + 1,422)2+ (1,514 + 0,103)²) = 1,617
= √((0,063 − 0,063)2+ (−1,422 + 1,422)2+ (1,514 + 1,514)²) = 0,000
Continue the next calculation in the same way until the 10th data. The distance calculation will be shown in the table below.
Table 10. Result of Distance C1 in the 2nd iteration
No C1(C2) Object Data (Xi) D1
1 0,63 -1,422 1,514 0,063 -1,422 -0,365 1,879
2 0,63 -1,422 1,514 0,063 -1,422 -0,103 1,617
3 0,63 -1,422 1,514 0,063 -1,422 1,514 0,000
4 0,63 -1,422 1,514 -0,615 0,185 -0,410 2,798
5 0,63 -1,422 1,514 -0,615 0,185 -0,035 2,555
6 0,63 -1,422 1,514 -0,615 0,185 -0,415 2,802
7 0,63 -1,422 1,514 -0,553 0,185 0,846 2,046
8 0,63 -1,422 1,514 -0,553 0,185 -0,205 2,623
9 0,63 -1,422 1,514 -0,553 0,185 -0,415 2,765
10 0,63 -1,422 1,514 -0,245 0,185 0,116 2,302
Calculate the distance for C2 using the following ED formula.
= √((−0,53 −.0,63)2+ (0,185 + 1,422)2+ (0,.846 + 0,365)²) = 2,322
= √((−0,53 − 0,63)2+ (0,185 + 1,422)2+ (0,846 + 0,103)²) = 2,196
= √((−0,53 − 0,63)2+ (0,185 + 1,422)2+ (0,846 − 1,514)²) = 2,091
Continue the next calculation in the same way until the 10th data. Calculation of data up to the 10th, will be shown in the table below.
Table 11. Distance C2 results in the 2nd iteration
No C2(O1) Object Data (Xi) D2
1 -0,53 0,185 0,846 0,063 -1,422 -0,365 2,322 2 -0,53 0,185 0,846 0,063 -1,422 -0,103 2,196
3 -0,53 0,185 0,846 0,063 -1,422 1,514 2,091
4 -0,53 0,185 0,846 -0,615 0,185 -0,410 1,258 5 -0,53 0,185 0,846 -0,615 0,185 -0,035 0,885 6 -0,53 0,185 0,846 -0,615 0,185 -0,415 1,263
7 -0,53 0,185 0,846 -0,553 0,185 0,846 0,000
8 -0,53 0,185 0,846 -0,553 0,185 -0,205 1,051 9 -0,53 0,185 0,846 -0,553 0,185 -0,415 1,261 10 -0,53 0,185 0,846 -0,245 0,185 0,116 0,783
Calculation of cost and total cost from the distance C1 and C2 will be shown in the following table.
Table 12. Total Cost in the 2nd iteration
No D1 D2
1 1,879 2,322
2 1,617 2,196
3 0,000 2,091
4 2,798 1,258
5 2,555 0,885
6 2,802 1,263
7 2,046 0,000
8 2,623 0,051
No D1 D2
9 2,765 1,261
10 2,302 0,783
Cost 3,496 6,501 Total Cost 9,997 Furthermore, the calculation of the difference in cost is as follows.
cost difference = 9,997 – 17,874 = -7,877
The result of the cost difference < 0, then continue the next iteration in the same way until the cost difference
> 0, so that the cluster center is in that iteration and the iteration will be stopped.
3.3 Research Result
In this sub-chapter, each research result will be presented. The data used is donor data for the 2021 period. Then, using the K-Means and K-Medoids algorithms based on the RFM variable, cluster analysis will be carried out.
Comparative analysis will be based on the DBI value and cluster cohesiveness which is determined by the average silhouette score resulting from calculations using python in the Google colaboratory.
3.3.1 K-Means Cluster Testing
Table 13. K-Means Cluster Performance Comparison K-Means
K DBI Average silhouette score
2 1,014 0,666
3 0,792 0,708
4 0,504 0,740
5 0,551 0,762
6 0,485 0,781
Based on table 13 above, it can be concluded that the lowest DBI value and close to 0 (positive value) is 0.485 and the highest Average Silhouette score is 0.781. Thus, the most optimal number of clusters in the k-means method for clustering donor data is k = 6.
3.3.2 K-Medoids Cluster Testing
Table 14. K-Means Cluster Performance Comparison K-Medoids
K DBI Average Silhouette Score
2 1,925 0,350
3 1,096 0,517
4 1,400 0,301
5 1,250 0,237
6 1,272 0,151
Based on table 14 above, it can be concluded that the lowest DBI value and close to 0 (positive value) is 1.096 and the highest Average Silhouette score is 0.517. Thus, the most optimal number of clusters in the k-means method for clustering donor data is k = 3.
3.4 Optimal Cluster Results
From the results of data processing in this study, which was carried out based on the donor transaction dataset using K-Means and K-Medoids, the DBI value and Average Silhouette score value of each algorithm were obtained as shown in table 15 below.
Table 15. Comparison of K-Means and K-Medoids Final Results
K-Means K-Medoids
k Davies Bouldin Index
Average Silhouette Score
Davies Bouldin Index
Average Silhouette Score
3 0,792 0,708 1,096 0,517
6 0,485 0,781 1,272 0,151
Based on table 15 above, the lowest DBI value in the K-Means method is 0.485 and the highest Average Silhouette Score is 0.781. In the K-Medoids method, the lowest DBI value is 1.096 and the highest Average Silhouette Score is 0.517. If the two methods are analyzed and compared, the lowest DBI value between the two methods is 0.485, while the Average Silhouette score is the highest at 0.781. So in this case, the K-Means method of grouping is
better than the K-Medoid method. That is, in this study the K-Means method is better used for clustering donor data for amil zakat institutions by grouping donor data as much as 6 groups.
Figure 3. DBI value for k=6
Figure 4. The average silhouette score is k = 6 3.5 Determination of Donor Characteristics
After selecting the best algorithm, namely K-Means, for clustering data on donors from amil zakat institutions, they will be grouped into 6 clusters. Donor characteristics will be known based on the average RFM value in each cluster so that clusters of potential donors will be formed and those that are not.
Table 16. Clustering Process Results
Cluster Number of Members R F M Characteristics
0 171 2,009 1,828 -0,058 Everyday Shopper
1 818 -0,396 1,828 0,123 Golden Customer
2 24 0,192 0,115 5,585 Superstar
3 94 -1,538 1,795 -0,061 Occasional Customer
4 82 1,419 -3,042 -0,185 Dormant Customer
5 72 0,063 -1,429 -0,027 Typical Customer
Contains the results of application implementation or program results or results from method testing. In the table above, the number of members of each cluster is shown, the average of each RFM attribute used to determine the characteristics of each resulting cluster. So that the characteristics of cluster 0 are Everyday Shopper, cluster 1 is Golden Customer, cluster 2 is Superstar, cluster 3 is Occasional Customer, cluster 4 is Dormant Customer and cluster 5 is characteristic of Typical Customer.
4. CONCLUSION
The conclusion of this study is that for testing the DBI value, the DBI values obtained from the K-Means and K- Medoids methods are 0.489 and 1.272, respectively. In this case, K-Means is superior to K-Medoids because the DBI value is smaller. Cluster compactness is done by assessing the average silhouette score. In this case the average silhouette score of the K-Means method is 0.781 and the average silhouette score of the K-Medoids method is 0.151. This proves that the K-Means method is better than the K-Medoids method because the average silhouette score is higher and close to 1. The cluster formed is the first cluster with 171 donors in the Everyday Shopper category, the second cluster with 818 donors in the Golden Customer category, the third cluster has 24 donors in the Superstar category, the fourth cluster has 94 donors in the Occasional Customer category, Cluster 5 has 82 donors in the Dormant Customer category and Cluster 6 has 72 donors in the Typical Customer category.
ACKNOWLEDGMENT
Alhamdulillah, thanks be to Allah swt. Thanks to both parents who always give their blessing at every step of the researcher's journey. Thank you very much to the Supervisor of Thesis 1 and 2, as well as those who have supported the implementation of this research.
REFERENCES
[1] M. W. Talakua, Z. A. Leleury, and A. W. Taluta, “Analisis Cluster Dengan Menggunakan Metode K-Means Untuk Pengelompokkan Kabupaten/Kota Di Provinsi Maluku Berdasarkan Indikator Indeks Pembangunan Manusia Tahun
2014,” BAREKENG J. Ilmu Mat. dan Terap., vol. 11, no. 2, pp. 119–128, 2017, doi: 10.30598/barekengvol11iss2pp119- 128.
[2] A. Ali, “Klasterisasi Data Rekam Medis Pasien Menggunakan Metode K-Means Clustering di Rumah Sakit Anwar Medika Balong Bendo Sidoarjo,” MATRIK J. Manajemen, Tek. Inform. dan Rekayasa Komput., vol. 19, no. 1, pp. 186–
195, 2019, doi: 10.30812/matrik.v19i1.529.
[3] S. Zhang, C. Bi, M. Zhang, S. Zhang, C. Bi, and M. Zhang, “ScienceDirect ScienceDirect Logistics service supply chain order allocation mixed K-Means and Logistics service supply chain allocation mixed K-Means and Qos order matching Qos matching CQVIP Conference on Data Driven Intelligence and Innovation,” Procedia Comput. Sci., vol. 188, no.
2019, pp. 121–129, 2021, doi: 10.1016/j.procs.2021.05.060.
[4] W. Qadadeh and S. Abdallah, “Customers Segmentation in the Insurance Company (TIC) Dataset,” Procedia Comput.
Sci., vol. 144, pp. 277–290, 2018, doi: 10.1016/j.procs.2018.10.529.
[5] J. Karthik, V. Tamizhazhagan, and S. Narayana, “Data leak identification using scattering search K Means in social networks,” Mater. Today Proc., no. xxxx, 2021, doi: 10.1016/j.matpr.2021.01.200.
[6] A. K. Wardhani, “K-Means Algorithm Implementation for Clustering of Patients Disease in Kajen Clinic of Pekalongan,”
J. Transform., vol. 14, no. 1, p. 30, 2016, doi: 10.26623/transformatika.v14i1.387.
[7] G. Niu, Y. Ji, Z. Zhang, W. Wang, J. Chen, and P. Yu, “ScienceDirect Clustering analysis of typical scenarios of island power supply system by using cohesive hierarchical clustering based K-Means clustering method,” vol. 7, pp. 250–256, 2021, doi: 10.1016/j.egyr.2021.08.049.
[8] W. Johnson and R. Dean, “Clustering, Distance Methods, and Ordination,” Applied Multivariate Statistical Analysis. pp.
671–757, 2007.
[9] P. Govender and V. Sivakumar, Application of k-means and hierarchical clustering techniques for analysis of air pollution: A review (1980–2019), vol. 11, no. 1. Turkish National Committee for Air Pollution Research and Control, 2020. doi: 10.1016/j.apr.2019.09.009.
[10] C. Yuan and H. Yang, “Research on K-Value Selection Method of K-Means Clustering Algorithm,” J, vol. 2, no. 2, pp.
226–235, 2019, doi: 10.3390/j2020016.
[11] F. M. Nasution, Penerapan Metode K-Means Clustering Untuk Mengelompokkan Ketahanan Tanaman Pangan Kabupaten/Kota Diprovinsi Sumatera Utara. 2019.
[12] A. Naghizadeh and D. N. Metaxas, “Condensed silhouette: An optimized filtering process for cluster selection in K- means,” in Procedia Computer Science, 2020, vol. 176, pp. 205–214. doi: 10.1016/j.procs.2020.08.022.
[13] H. Xu, P. Croot, and C. Zhang, “Discovering hidden spatial patterns and their associations with controlling factors for potentially toxic elements in topsoil using hot spot analysis and K-means clustering analysis,” Environ. Int., vol. 151, no.
February, p. 106456, 2021, doi: 10.1016/j.envint.2021.106456.
[14] H. Song, J. G. Lee, and W. S. Han, “PAMAE: Parallel k-Medoids clustering with high accuracy and efficiency,” Proc.
ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., vol. Part F1296, pp. 1087–1096, 2017, doi:
10.1145/3097983.3098098.
[15] N. Sureja, B. Chawda, and A. Vasant, “An improved K-medoids clustering approach based on the crow search algorithm,” J. Comput. Math. Data Sci., vol. 3, no. March, p. 100034, 2022, doi: 10.1016/j.jcmds.2022.100034.
[16] P. Arora, Deepali, and S. Varshney, “Analysis of K-Means and K-Medoids Algorithm for Big Data,” Phys. Procedia, vol. 78, no. December 2015, pp. 507–512, 2016, doi: 10.1016/j.procs.2016.02.095.
[17] B. Bernábe-Loranca, R. Gonzalez-Velázquez, E. Olivares-Benítez, J. Ruiz-Vanoye, and J. Martínez-Flores, “Extensions to K-medoids with balance restrictions over the cardinality of the partitions,” J. Appl. Res. Technol., vol. 12, no. 3, pp.
396–408, 2014, doi: 10.1016/S1665-6423(14)71621-9.
[18] S. I. Murpratiwi, I. G. Agung Indrawan, and A. Aranta, “Analisis Pemilihan Cluster Optimal Dalam Segmentasi Pelanggan Toko Retail,” J. Pendidik. Teknol. dan Kejuru., vol. 18, no. 2, p. 152, 2021, doi: 10.23887/jptk- undiksha.v18i2.37426.
[19] R. D. Astuti, “Analisis Perbandingan Algoritma K-Means Dan K-Medoids Untuk Menerapkan Segmentasi Pelanggan,”
2019.
[20] T. Hardiani, S. Sulistyo, and R. Hartanto, “Segmentasi Nasabah Tabungan Menggunakan Model RFM (Recency, Frequency,Monetary) dan K-Means Pada Lembaga Keuangan Mikro,” Semin. Nas. Teknol. Inf. dan Komun. Terap., no.
November, p. 2015, 2015.
[21] R. Heldt, C. S. Silveira, and F. B. Luce, “Predicting customer value per product: From RFM to RFM/P,” J. Bus. Res., vol. 127, no. March, pp. 444–453, 2021, doi: 10.1016/j.jbusres.2019.05.001.
[22] I. I. P. Damanik, S. Solikhun, I. S. Saragih, I. Parlina, D. Suhendro, and A. Wanto, “Algoritma K-Medoids untuk Mengelompokkan Desa yang Memiliki Fasilitas Sekolah di Indonesia,” Pros. Semin. Nas. Ris. Inf. Sci., vol. 1, no.
September, p. 520, 2019, doi: 10.30645/senaris.v1i0.58.
[23] A. Supriyadi, A. Triayudi, and I. D. Sholihati, “Perbandingan Algoritma K-Means Dengan K-Medoids Pada Pengelompokan Armada Kendaraan Truk Berdasarkan Produktivitas,” JIPI (Jurnal Ilm. Penelit. dan Pembelajaran Inform., vol. 6, no. 2, pp. 229–240, 2021, doi: 10.29100/jipi.v6i2.2008.
[24] S. Harikumar and P. V. Surya, “K-Medoid Clustering for Heterogeneous DataSets,” Procedia Comput. Sci., vol. 70, pp.
226–237, 2015, doi: 10.1016/j.procs.2015.10.077.
[25] Z. Min and D. Kai-Fei, “Improved Research to K-means Initial Cluster Centers,” Proc. - 2015 9th Int. Conf. Front.
Comput. Sci. Technol. FCST 2015, pp. 349–353, 2015, doi: 10.1109/FCST.2015.61.
[26] M. A. Nahdliyah, T. Widiharih, and A. Prahutama, “METODE k-MEDOIDS CLUSTERING DENGAN VALIDASI SILHOUETTE INDEX DAN C-INDEX (Studi Kasus Jumlah Kriminalitas Kabupaten/Kota di Jawa Tengah Tahun 2018),” J. Gaussian, vol. 8, no. 2, pp. 161–170, 2019, doi: 10.14710/j.gauss.v8i2.26640.