X Spotify Cares Clustering Analysis using K-Means and K-Medoids

(1)

X Spotify Cares Clustering Analysis using K-Means and K-Medoids

Citra Pangestu^*, Shaufiah, Rifki Wijaya School of Computing, Telkom University, Bandung, Indonesia

Email: ^1,*[email protected], ²[email protected],

Correspondence Author Email: [email protected]

Abstract−The rise of social media platforms, particularly Twitter, has transformed how individuals express opinions and concerns. Companies, like Spotify, leverage platforms such as Twitter for customer support and feedback gathering. This research delves into the world of Spotify Cares tweets using K-Means and K-Medoids clustering methods, aiming to enhance customer support analysis. The study employs the silhouette coefficient and the Davies-Bouldin Index (DBI) to evaluate clustering quality. With an extensive dataset covering more than 3 million Twitter customer service interactions, including 29,479 notes specific to Spotify Cares, this investigation uncovered latent patterns and themes. The versatility of K-Means and K-Medoids, proven effective in a wide range of applications, is highlighted. Therefore, K-means and K-medios were implemented in this research. The results show that K-Means, with 10 clusters (K = 10), with a DBI value of 1.76, shows moderate dispersion, indicating the potential for improvements for better segmentation precision. In contrast, K-Medoids, with 2 clusters (K = 2) and a lower DBI of 1.48, present a clearer and more compact clustering structure. This implies simplified customer categories, which is beneficial for targeted support. In conclusion, although both methods have strengths and weaknesses, K-Medoids with two clusters emerges as a promising method for Spotify Cares, offering cohesive customer groupings for efficient intervention. Future research efforts could focus on refining parameters and exploring the complex relationships between response time, sentiment analysis, and customer satisfaction to achieve a more nuanced analysis.

Keywords: Customer Support; K-Means Clustering; K-Medoids Clustering; Spotify Cares; Twitter (X)

1. INTRODUCTION

The advent of social media platforms has revolutionized the way individuals express their opinions, preferences, and concerns about various products and services [1]. One such platform, Twitter, has become a valuable source for understanding user sentiments and feedback [2]. Spotify, a leading music streaming service, utilizes its dedicated support handle, Spotify Cares, on Twitter to engage with users, resolve concerns, and gather valuable insights [3]. In the context of customer support, companies often employ social media channels to address customer queries and issues promptly. However, as time passes, the volume of customer inquiries on Spotify Cares continues to increase every minute, becoming a challenge for the company [4].

In the era of big data, analyzing and extracting meaningful patterns from vast amounts of user-generated content is crucial for enhancing customer experience and service efficiency [5]. The application of clustering algorithms has proven to be instrumental in categorizing and organizing unstructured data, such as tweets, into distinct groups based on shared characteristics. Two widely used clustering methods, K-Means and K-Medoids, offer unique approaches to partitioning data points into clusters, providing researchers with versatile tools for exploration and analysis [6]. While the k-means clustering algorithm stands out as a powerful and frequently applied tool in academic research for data mining, it is not without its limitations. One notable challenge arises from the random initialization of centroids, leading to unexpected convergence issues [7]. In contrast, the K- Medoids method, another classic clustering technique, aims to address some of these challenges by partitioning items into K clusters. Both methods contribute to the broader field of clustering, each with its strengths and weaknesses, prompting researchers to carefully consider their suitability based on specific data characteristics and analysis goals [8].

This research aims to conduct a comprehensive comparative analysis of the effectiveness of K-Means and K-Medoids clustering methods in the context of Spotify Cares tweets on Twitter. The primary focus is on determining the optimal number of clusters (K) for each method using the silhouette coefficient, a measure of clustering quality. Additionally, the Davies-Bouldin Index (DBI) will be employed as a matrix for evaluating the performance of the clustering algorithms.

Previously, both K-Means and K-Medoids have demonstrated their effectiveness in clustering fresh milk and exhibited impressive Davies-Bouldin Index (DBI) values, approaching 0 [9]. Furthermore, these methods have been successfully applied to classifiers within student theses, consistently yielding excellent DBI values [5]. The robust performance of K-means and K-medoids across different applications suggests their versatility and reliability in clustering tasks. The positive outcomes from this research underscore the capability of these methods to perform clustering efficiently, emphasizing their potential for broader applicability in various domains.

By delving into the world of Spotify Cares tweets, this study seeks to unveil hidden patterns, common themes, and potential improvements in customer support interactions. The insights gained from this research can contribute to refining the strategies employed by Spotify and other companies in managing their online customer support presence, ultimately leading to enhanced user satisfaction and a more streamlined customer service experience.

(2)

The significance of analyzing Spotify Cares tweets lies in its reflection of the pivotal role of social media, particularly Twitter, in the customer support ecosystem and overall brand management [25]. These tweets not only express user opinions and sentiments but also provide constructive feedback about Spotify's services.

Understanding these interactions is vital for improving customer experience and service efficiency, making the analysis of Spotify Cares tweets a valuable endeavor. The forthcoming comparative analysis of K-Means and K- Medoids methods aims to shed light on the optimal clustering approach for these tweets, ultimately contributing to a deeper understanding of customer support dynamics in the digital age.

2. RESEARCH METHODOLOGY

2.1 Research Stages

A word in a document, after which the best cluster was identified using the silhouette approach, K-Means clustering, and k-medoids was put into practice, and clustering was assessed using DBI. Figure 1 and 2, which presents the system performance in block diagram style, describes the overall system performance.

Figure 1. Block Diagram untuk data collection and preprocessing

Figure 2. Block diagram for implementation method and evaluation phase 2.2 Data Collection

Customer service information from Twitter was used in this study. The total amount of data on Twitter customer service is enormous—it includes over 3 million tweets and replies from well-known companies. Spotify Cares is one of the top 20 well-known brands included in this report. The types of data characteristics accessed include various aspects, such as user questions, complaints, responses from companies, and user feedback. In addition to conserving the 3 million accessible pieces of data, Spotify cares about data usage because, as of July 2019, it had 232 million monthly active users, making it one of the most popular music platforms of this era. [10][11]. Spotify has around 29,479 tweet reply records and is concerned about Twitter customer service. These data are then subjected to preprocessing.

2.3 Preprocessing

Data preparation is essential after gathering the required data. Preprocessing is the process of cleaning, organizing, and preparing data so that it can be used in modeling or analysis [12]. The goal of this procedure is to enhance data

(3)

quality for best use in next phases. We can enhance data analysis and model performance by carefully preprocessing the data. while simultaneously reducing the possibility of bias or distortion brought on by incomplete or inconsistent data. A machine learning model's performance is largely dependent on how well the input is preprocessed, as this can affect the quality of the output [13][14]. Text preprocessing, which includes tokenization, lowercasing, stop word removal, stemming, and lemmatization, is the first step in Natural Language Processing (NLP) [15]. These methods are frequently used to lower dimensionality and improve NLP models' overall performance [16].

2.4 TF-IDF

In this instance, text is converted into a value using Term Frequency Inverse Document Frequency (TF-IDF) [24].

for K-Means clustering to be able to calculate it. The TF-IDF equations are as follows:

Wdt = tfdt ∗ Idft (1)

In this case, Idft is the Inverse Document Frequency, which is computedas (log (^N

df)) , where N is the total number of documents and tfdt is the number of documents containing the searched word. Wdt indicates the weight of the d document against the t-word. In this case, Idft is the Inverse Document Frequency, which is computed as (log (^N

df)), where N is the total number of documents and df is the number of documents containing the searched word. Wdt indicates the weight of the d document against the t-word.

2.5 Silhouette Method

This method evaluates how similar an entity is to other entities within its assigned cluster—a process known as cohesion—compared to other clusters—a process known as separation. The evaluation's metric is the Silhouette value, which is in the [-1, 1] range. A score that is close to 1 indicates a strong correlation with the entities in its cluster, while a value that is close to -1 indicates the opposite [17].

Figure 3. Step for Implementation Silhouette

Based on Figure 3, the first step is to choose a number for K that falls between 1 and N clusters. Step 2 involves computing the Silhouette value for every K. In doing so, the ith data point's cluster (C(i)) is identified, the allocation quality (a(i)), is computed, and the average dissimilarity to the closest cluster that is not its own (b(i)) is found.

The difference between b(i) and a(i), normalized by the greater of the two, yields the Silhouette Coefficient (S(i)), which is calculated in Step 2. Step 3 involves plotting a curve of silhouette values against the total number of clusters, K. The ideal number of clusters is finally shown by the curve's peak in Step 4. The silhouette method (2) formula is as follows:

S(i) = b(i) − a(i)

max(a(i),b(i)) (2)

The average distances between a sample and each other point in the class are denoted by a(i) and by b(i).

respectively, as well as the average distance between a sample and each neighboring cluster.

2.6 K-Means Clustering

Data clustering is defined as the process of organizing data into distinct groups. In addition, clustering is a technique for grouping records in a database according to specific standards [18]. K-Means In an effort to improve the precision of suggested answers, clustering can be used to group client inquiries into groups based on related problems or subjects. Data mining is a technique for examining trends and traits to come, as well as for drawing unexpected information from big databases that has never been seen before. Clustering is one of the methods for data analysis in data mining. The K-Means algorithm, a clustering method based on distance partitioning, is one example of a data mining algorithm [19][20].

2.7 K-Medoids Clustering

K-Medoids is an algorithm for non-hierarchical clustering. A non-hierarchical clustering algorithm called K- medoids was developed from the K-means algorithm. The method divides x objects into k clusters using partition clustering. Medoid has resistance to outlier objects located at the core of each cluster. K-Medoids also calculate the distance to each object like K-Means [21]. The formula for Euclidean Distance is shown in (3).

(4)

D(xj, cj) = √∑ (x_j̇− i_j)² n

j=1

(3)

D stands for distance, n for the number of objects, j for sequence data from 1 to n, x_j for the j parameter to x, and cj for the centroid of the j parameter in the provided information. The fundamental principle of K-Medoids is to create clusters consisting of medoids, which are data points within a cluster that have the minimum total distance from other members of the cluster. The iterative process aids in optimizing the placement of medoids and the assignment of data points to clusters, thereby creating improved clustering. K-Medoids is often employed when data contains noise or outliers, as medoids are more robust to disturbances compared to the mean in the K-Means algorithm [22].

3. RESULT AND DISCUSSION

3.1 Preprocessing Implementation

Several preprocessing steps are performed at this stage: lowercase, remove punctuation and stop words. I also perform word conversions from emoticons and emojis, word conversions from chat terms, spelling checks, and lastly, the elimination of non-English terms. You can find a more thorough explanation in sections 3.1.1 through 3.1.5

3.1.1 Lowercasing, Removal Punctuation, and Removal Stop Words

Lowercasing is the process of changing capital words to lowercase letters to facilitate word processing. Next, we eliminate any symbols and punctuation that can obstruct the analysis process [23]. Among these characters are

~!@#$%^&*()-=[]<>. Furthermore, we eliminate white spaces, numerals, and URLs. The next phase is stop-word removal, which entails getting rid of words from the text that don't have any real meaning. In this step, a list of stop words that need to be eliminated is downloaded using the Natural Language Toolkit (NLTK). "I," "me," "my,"

"myself," "we," "our," "ours," "ourselves," "you," "you're," "you've," "you'll," "you'd," "your," "yours," "yourself,"

"yourselves," "he," "him," "his," "himself," "she," "she's," and so forth are a few examples of stop words that have been eliminated.

Table 1. Result table of lower-casing text

Before After

1 @624384 Hi there! Can you let us know what eve... 624384 hi let us know event youre referring we...

2 @271225 Hey! Our friends in Artist Support are... 271225 hey friends artist support right folks ...

3 @306419 Don't throw in the towel just yet! Can... 306419 dont throw towel yet dm us accounts...

4 @395005 Hey, that doesn't sound good! Did the ... 395005 hey doesnt sound good app crash point ...

5 @271218 Hey Morgan! We'd love to have them on ... 271218 hey morgan wed love spotify hopefully ...

3.1.2 Convert Emoticon to Word

In this phase, we take off the emoticons and swap them out for words or text. For instance, as Table 2 below illustrates, words are used in place of emojis.

Table 2. Illustration table for converting emoticon to word

1 " 8D" " Laughing, big grin or laugh with glasses"

2 “:'‑\(“ "Crying"

3 " ;‑\)" "Wink or smirk"

4 ":P" "Tongue sticking out, cheeky, playful or blowing a raspberry"

5 "O:‑\)" "Angel, saint or innocent"

3.1.3 Convert Emoji to Word

Emojis are eliminated in this phase and replaced with text or words. For instance, as seen in Table 3 below, words are used in instead of emojis.

Table 3. Illustration table for converts emoji to word

1 Sorry to hear that😢 Sorry to hear that sad

2 Hope we meet soon! 😉 Hope we meet soon! Wink or smirk

3 I Like 🍐 I Like pear

(5)

4 Thank you 🌹 Thank you rose

3.1.4 Chat Word Conversion and Spelling Corrections

This phase converts abbreviations into their equivalent in full. To prevent word ambiguity, spelling errors are corrected in the following phase. For instance, Table 4 below illustrates how conversation terms are substituted.

Table 4. Illustration table for chat word conversion

1 This meat from the USA This meat from the United States of America

2 FYI, the team meeting is scheduled for tomorrow morning at 10

For your information, the team meeting is scheduled for tomorrow morning at 10 3 She found an amazing DIY tutorial She found an amazing do-it-yourself tutorial 4 Please submit the report ASAP Please submit the report As soon as possible 3.1.5 Removal of non-English Word

At this point, misspellings are fixed to prevent word ambiguity. For instance, the spelling corrections are displayed in Table 5 below.

Table 5. Illustration table removal of non-English word

Don't forget untuk bawa your umbrella Don’t’ forget your umbrella Happy to see you! Arigatou Happy to see you!

3.2 TF-IDF Result

The technique used to determine each word's weight is called Term Frequency-Inverse Document Frequency (TF- IDF) [24]. According to the analysis's findings, there are 28,865 distinct words or features and 29,478 documents or texts (counted from 0). In other words, every document is a vector in a space with 28,865 dimensions, where each dimension corresponds to a distinct word in the dataset. The outcomes are shown in Table 6 below.

Table 6. Result value of TF-IDF

Word Index of document

Index of

word TF-IDF Value

removed 0 27226 0.2135014300585068

content 0 22499 0.1776834207794209

Sorry 0 27850 0.1787313947309077

send 1 27625 0.1967877005989905

Spotify 1 27894 0.15580300170015757

click 1 22359 0.3512423354129363

3.3 Silhouette Result

Currently, training uses 80% of the data, while testing uses the remaining 20%. The best K is then found by applying the data to the silhouette approach. The results of calculating the ideal value of K using the silhouette method are displayed in Figure 4 below.

Figure 4. The result of optimal K=10 for K-Means and K=2 for K-Medoids use the silhouette method

(6)

Based on the information presented in Figure 3, it is evident that the optimal number of clusters (K) in K- Means is 10 with a silhouette score of 0.0375, while for K-Medoids, the optimal K is 2 with a silhouette score of approximately 0.010. This analysis provides significant insights into patterns within Spotify's customer data. With K-Means showing 10 clusters, the substantial variation among these groups may indicate a diversity of user queries or issues, suggesting the need for highly specific customer support strategies. On the other hand, K-Medoids with only 2 clusters reveals clear polarization or separation, implying the existence of two dominant types of complaints or queries. Identification of these patterns offers profound insights for tailoring customer support strategies, enabling companies to address issues and provide more targeted and effective solutions to their users.

3.4 K-Means Result

The Python code uses the sci-kit-learn module to implement the K-Means clustering technique. The number of clusters, random seed, centroid initialization technique, and number of initialization tries are among the parameters that are initialized for the K-Means model. Using the `predict` approach, the model predicts cluster labels after fitting t to the training set. The code then reduces the dimensionality of the data to two for scatter plot presentation using Principal Component Analysis (PCA). A training data instance is represented by each point, which is colored based on the anticipated cluster labels. These red "x" marks represent each cluster's centroids. Figure 5, 6, and 7 display the ensuing graph.

Figure 5. Scatter plot of K-Means clustering

Figure 6. K-Means clustering scatter plot (separately) for cluster 0 until 4.

(7)

Figure 7. K-Means clustering scatter plot (separately) for cluster 5 until 9.

This plot visually represents the distribution of data in two dimensions using K-Means clustering. Each point represents a training data instance, colored based on the predicted cluster labels. Red "x" symbols mark the centroids of each cluster. The visualization helps assess the algorithm's effectiveness in creating distinct clusters.

In this case, there are 10 clusters, with cluster 3 having the most items (over 17,000) and cluster 9 having the fewest items (less than 2,500), as shown in Figure 8.

Figure 8. Diagram for cluster size Table 7. Table list of members (items) in each cluster

Cluster Item Frequency of word

0 15, 18, 32, 36, 53, … us: 7.90%, anything: 5.19%, else: 5.03%, need: 4.45%, shout: 3.29%

1 4, 7, 10, 13, 25, 48, ,… right: 5.44%, feedback: 4.44%, well: 4.31%, team: 3.74%, pass: 2.99%

(8)

2 38, 46, 47, 61, … us: 6.84%, happening: 6.35%, whats: 5.37%, hey: 5.19%, exactly: 4.93%

3 3, 5, 6, 11, 22, 26, … hey: 3.28%, us: 2.55%, well: 1.46%, know: 1.19%, hi: 1.08%

4 17, 138, 192, 229, … logging: 12.60%, gt: 6.96%, back: 6.70%, restarting: 6.52%, device:

4.92%

5 12, 60, 78, 79, 106,, … spotify: 5.41%, well: 5.12%, us: 5.05%, version: 4.88%, see: 4.39%

6 0, 14, 20, 23, 24, 27,… email: 6.45%, well: 6.42%, dm: 6.41%, us: 6.37%, address: 6.15%

7 16, 65, 82, 92, 94,… dm: 9.67%, weve: 9.55%, carry: 6.84%, sent: 6.24%, chatting: 5.01%

8 8, 9, 19, 21, 30, 35, … available: 6.63%, content: 5.25%, soon: 5.17%, info: 4.64%, hey: 4.61%

9 1, 2, 139, 227, 373, … via: 12.99%, help: 7.13%, email: 7.13%, support: 6.98%, twitter: 6.82%

Figure 9. Example of 10 complete tweet sentences to analyze the context in each cluster

Broad patterns arise in the conversational context based on observations of the top 5 terms that appear most frequently in Table 7 and random phrases for each cluster in Figure 9. Cluster 6 appears to be focused on providing support and help with account-related problems, suggesting that users send direct messages (DMs) with account details. Cluster 9 indicates multilingual support via email or Twitter in the meantime. Cluster 3 talks about technological issues and offers fixes, such as restarting devices or apps. Cluster 1 is concerned with user feedback and promotes opinion sharing. Cluster 8 addresses the accessibility of content, with some mentioning momentary unavailability because of license modifications. Operating system and device support is the main focus of Cluster 5. Positive comments and offers of support are reflected in Cluster 0. Consistent assistance and direct communication are part of Cluster 7. Clusters 4 and 2 concentrate on restarting, logging out, and frequent troubleshooting recommendations. All things considered, these patterns shed light on the ways that users engage with the service and the replies that service providers offer in return.

3.5 K-Medoids Result

The provided code utilizes the K-Medoids algorithm from the `sklearn_extra.cluster` module for clustering with a specified number of clusters (in this case, 2). It fits the model to training data, predicts cluster labels, and employs PCA for dimensionality reduction to visualize the results in a 2D scatter plot. Each point represents a data instance, color-coded by its predicted cluster. Additionally, red "x" symbols indicate the medoids, the most centrally located points in each cluster. This visualization aids in assessing the effectiveness of the KMedoids algorithm in forming distinct clusters and provides insights into the central representatives of each cluster. The results of clustering using K-medoids can be seen in Figure 10 and 11.

Figure 10. K-Medoids clustering scatter plot (separately)

(9)

Figure 11. Scatter plot of K-Medoids clustering

This plot visually represents the distribution of data in two dimensions using K-Medoids clustering. Each point represents an example of the training data, colored based on the predicted cluster label. In this case, there are 2 clusters, with cluster 1 having almost 14,000 items and cluster 2 having around 10,000 items, as shown in Figure 12.

Figure 11. Scatter plot of K-Medoids clustering

Table 8. Table list of members (items) in each cluster in K-Medoids

0 0, 4, 7, 8, 10, 11, 14,... us: 4.0, well: 4.01, dm: 3.73%, hey: 3.71%, email: 3.02%

1 1, 2, 3, 5, 6, 9, 12,… us: 4.00%, hey: 3.06%, know: 2.99%, let: 2.70%, well: 2.51%

Figure 13. Example of 10 complete tweet sentences to analyze the context in each cluster

Cluster 1 appears to contain sentences with a focus on addressing specific issues or inquiries related to user accounts, such as requesting assistance with usernames, email addresses, and account details. The content suggests

(10)

engagement with users facing problems or seeking support, involving activities like checking accounts, providing feedback, and troubleshooting technical issues. On the other hand, Cluster 2 comprises sentences that involve expressing gratitude, appreciating feedback, and encouraging users to share their thoughts or concerns. The language in this cluster seems more positive and supportive, with an emphasis on community engagement, reporting issues, and offering assistance. Overall, Cluster 1 leans towards support and problem-solving, while Cluster 2 reflects positive interactions and encouragement within a community or user base.

3.6 DBI Result

Table 9. Table of DBI Result for K-Means and K-Medoids K (Cluster) DBI Value

K-Means 10 1.76

K-Medoids 2 1.48

Based on the information provided in Table 9, it can be observed that the DBI result for K-Means Clustering is 1.76 with a total of 10 clusters, whereas K-Medoids yielded a DBI result of 1.48 with 2 clusters.

4. CONCLUSION

The evaluation results provide insightful observations on the efficacy of K-Means Clustering and K-Medoids in optimizing customer support analysis for Spotify Cares on Twitter. The Davies-Bouldin Index (DBI), a crucial metric assessing clustering quality, indicates interesting trends considering intra-cluster cohesion and inter-cluster separation. The DBI of 1.76 for K-Means with 10 clusters suggests a moderate level of dispersion and overlap among clusters, hinting at the potential for refining the number of clusters to enhance customer segmentation precision. In contrast, K-Medoids demonstrate a more well-defined and compact clustering structure with a lower DBI of 1.48 and only 2 clusters. The reduced number of clusters in K-Medoids implies simplified customer categories, offering potential benefits for more streamlined customer support strategies. In conclusion, while both methods exhibit strengths and weaknesses, the K-Medoids approach with two clusters emerges as promising for Spotify Cares on Twitter, presenting a more cohesive and distinct grouping of customer interactions conducive to targeted and efficient customer support interventions. Future research could build upon these findings by exploring the relationship between response time and customer satisfaction, delving into sentiment analysis for insights into user feelings toward customer service responses, and investigating more advanced clustering strategies or machine learning models to enhance analytical precision.

ACKNOWLEDGMENT

This research was conducted at the School of Computing, Telkom University, Bandung, Indonesia.

REFERENCES

[1] Appel, G., Grewal, L., Hadi, R., & Stephen, A. T. (2020). The future of social media in marketing. Journal of the Academy of Marketing Science, 48(1). https://doi.org/10.1007/s11747-019-00695-1.

[2] Antonakaki, D., Fragopoulou, P., & Ioannidis, S. (2021). A survey of Twitter research: Data model, graph structure, sentiment analysis and attacks. Expert Systems with Applications, 164. https://doi.org/10.1016/j.eswa.2020.114006.

[3] Hodgson, T. (2021). Spotify and the democratisation of music. In Popular Music (Vol. 40, Issue 1).

https://doi.org/10.1017/S0261143021000064.

[4] Agnihotri, R. (2020). Social media, customer engagement, and sales organizations: A research agenda. Industrial Marketing Management, 90. https://doi.org/10.1016/j.indmarman.2020.07.017.

[5] Sarin, P., Kar, A. K., & Ilavarasan, V. P. (2021). Exploring engagement among mobile app developers – Insights from mining big data in user generated content. Journal of Advances in Management Research, 18(4).

https://doi.org/10.1108/JAMR-06-2020-0128

[6] Tao, D., Yang, P., & Feng, H. (2020). Utilization of text mining as a big data analysis tool for food science and nutrition.

In Comprehensive Reviews in Food Science and Food Safety (Vol. 19, Issue 2). https://doi.org/10.1111/1541- 4337.12540.

[7] Sinaga, K. P., & Yang, M. S. (2020). Unsupervised K-means clustering algorithm. IEEE Access, 8.

https://doi.org/10.1109/ACCESS.2020.2988796.

[8] Ramadhani, S., Azzahra, D., & Z, T. (2022). Comparison of K-Means and K-Medoids Algorithms in Text Mining based on Davies Bouldin Index Testing for Classification of Student’s Thesis. Digital Zone: Jurnal Teknologi Informasi Dan Komunikasi, 13(1). https://doi.org/10.31849/digitalzone.v13i1.9292.

[9] Wahyudi, M., & Pujiastuti, L. (2022). Komparasi K-Means Clustering dan K-Medoids dalam Mengelompokkan Produksi Susu Segar di Indonesia Comparison of K-Means Clustering and K-Medoids in Clustering Fresh Milk Production in Indonesia. Jurnal Bumigora Information Technology (BITe), 4(2).

[10] Antonakaki, D., Fragopoulou, P., & Ioannidis, S. (2021). A survey of Twitter research: Data model, graph structure, sentiment analysis and attacks. Expert Systems with Applications, 164. https://doi.org/10.1016/j.eswa.2020.114006.

(11)

[11] Karami, A., Lundy, M., Webb, F., & Dwivedi, Y. K. (2020). Twitter and Research: A Systematic Literature Review through Text Mining. IEEE Access, 8. https://doi.org/10.1109/ACCESS.2020.2983656.

[12] Shehab, N., Badawy, M., & Arafat, H. (2021). Big Data Analytics and Preprocessing. In Studies in Big Data (Vol. 77).

https://doi.org/10.1007/978-3-030-59338-4_2.

[13] Zhang, Y., Safdar, M., Xie, J., Li, J., Sage, M., & Zhao, Y. F. (2023). A systematic review on data of additive manufacturing for machine learning applications: the data quality, type, preprocessing, and management. In Journal of Intelligent Manufacturing (Vol. 34, Issue 8). https://doi.org/10.1007/s10845-022-02017-9.

[14] Naseem, U., Razzak, I., & Eklund, P. W. (2021). A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter. Multimedia Tools and Applications, 80(28–29).

https://doi.org/10.1007/s11042-020-10082-6.

[15] Tabassum, A., & Patil, R. R. (2020). A Survey on Text Pre-Processing & Feature Extraction Techniques in Natural Language Processing. International Research Journal of Engineering and Technology, June.

[16] Santos, A., & Paula, H. (2021). Microservice decomposition and evaluation using dependency graph and silhouette coefficient. ACM International Conference Proceeding Series. https://doi.org/10.1145/3483899.3483908.

[17] Kwak, S., Lee, Y., Ko, T., Yang, S., Hwang, I. C., Park, J. B., Yoon, Y. E., Kim, H. L., Kim, H. K., Kim, Y. J., Cho, G.

Y., Sohn, D. W., Won, S., & Lee, S. P. (2020). Unsupervised Cluster Analysis of Patients with Aortic Stenosis Reveals Distinct Population with Different Phenotypes and Outcomes. Circulation: Cardiovascular Imaging, 13(5).

https://doi.org/10.1161/CIRCIMAGING.119.009707.

[18] Ghosal, A., Nandy, A., Das, A. K., Goswami, S., & Panday, M. (2020). A Short Review on Different Clustering Techniques and Their Applications. Advances in Intelligent Systems and Computing, 937. https://doi.org/10.1007/978- 981-13-7403-6_9.

[19] Yan, H., Yang, N., Peng, Y., & Ren, Y. (2020). Data mining in the construction industry: Present status, opportunities, and future trends. In Automation in Construction (Vol. 119). https://doi.org/10.1016/j.autcon.2020.103331.

[20] Gupta, M. K., & Chandra, P. (2020). A comprehensive survey of data mining. International Journal of Information Technology (Singapore), 12(4). https://doi.org/10.1007/s41870-020-00427-7.

[21] Dinata, R. K., Retno, S., & Hasdyna, N. (2021). Minimization of the Number of Iterations in K-Medoids Clustering with Purity Algorithm. Revue d’Intelligence Artificielle, 35(3). https://doi.org/10.18280/ria.350302.

[22] Sharma, K. K., & Seal, A. (2021). Outlier-robust multi-view clustering for uncertain data. Knowledge-Based Systems, 211. https://doi.org/10.1016/j.knosys.2020.106567.

[23] Chai, C. P. (2023). Comparison of text preprocessing methods. Natural Language Engineering, 29(3).

https://doi.org/10.1017/S1351324922000213.

[24] Alfarizi, M. I., Syafaah, L., & Lestandy, M. (2022). Emotional Text Classification Using TF-IDF (Term Frequency- Inverse Document Frequency) And LSTM (Long Short-Term Memory). JUITA : Jurnal Informatika, 10(2).

https://doi.org/10.30595/juita.v10i2.13262.

[25] Prey, R., Esteve Del Valle, M., & Zwerwer, L. (2022). Platform pop: disentangling Spotify’s intermediary role in the music industry. Information Communication and Society, 25(1). https://doi.org/10.1080/1369118X.2020.1761859.