Telkom University Opinion Topic Modeling on Twitter Using Latent Dirichlet Allocation During Covid-19 Pandemic

(1)

Telkom University Opinion Topic Modeling on Twitter Using Latent Dirichlet Allocation During Covid-19 Pandemic

Tandya Rizky Pratama^*, Donni Richasdy, Mahendra Dwifebri Purbolakson School of Computing, Informatics Study Program, Telkom University, Bandung, Indonesia

Email: ¹[email protected], ²[email protected],

3[email protected]

Correspondence Author Email: [email protected]

Abstrak−Pada era digital yang sedang berlangsung saat ini membuat perkembangan teknologi informasi berkembang pesat.

Perkembangan teknologi informasi tersebut diikuti dengan berkembangnya sosial media, salah satu sosial media yang sedang naik daun adalah Twitter. Karena banyak pengguna Twitter di seluru dunia, maka Twitter menyimpan banyak sekali data yang bisa digunakan untuk sesuatu hal, salah satunya adalah menentukan kategori opini masyarakat tentang suatu perusahaan atau universitas, dalam penelitian ini lebih berfokus kepada kategori opini masyarakat tentang Universitas Telkom. Opini masyarakat tersebut dapat dikelompokkan atau dikategorisasi agar memudahkan dalam menentukan topik yang sedang dibahas.

Menentukan opini secara manual akan memakan waktu yang lama dikarenakan banyaknya tweets yang ada. Maka dari itu harus ada metode lain untuk menentukan kategori opini masyarakat yang ada di Twitter. Salah satunya adalah metode Latent Dirichlet Allocation (LDA) dengan dataset tweets pengguna Twitter berbahasa Indonesia. Dengan metode tersebut, pengelompokkan tweets dalam skala besar lebih efisien. Dari modeling yang dibuat didapatkan hasil paling optimum dengan coherence score menggunakan metode c_umass sebesar -15.33029 dengan kombinasi jumlah topik 9, nilai alpha 0.31, dan nilai beta 0.01.

Kata Kunci: Topic Modelling; Latent Dirichlet Allocation (LDA); Topic coherence; Opini masyarakat; Term Frequency — Inverse Document Frequency (TF-IDF).

Abstract−In the current digital era, the development of information technology is growing rapidly. The development of information technology is followed by the development of social media, one of the social media that is on the rise is Twitter.

Because there are many Twitter users around the world, Twitter stores a lot of data that can be used for something, one of which is to determine the category of public opinion about a company or university, in this study the focus is more on the category of public opinion about Telkom University. The public opinion can be grouped or categorized to make it easier to determine the topic being discussed. Determining opinions manually will take a long time due to the large number of tweets.

Therefore, there must be another method to determine the categories of public opinion on Twitter. One of them is the Latent Dirichlet Allocation (LDA) method with a dataset of tweets of Indonesian-language Twitter users. With this method, grouping tweets on a large scale is more efficient. From the modeling made, the most optimum results obtained with a coherence score using the c_umass method of -15.33029 with a combination of 9 topics, 0.31 alpha value, and 0.01 beta value.

Keywords: Topic Modelling; Latent Dirichlet Allocation (LDA); Topic coherence; Public opinion; Term Frequency — Inverse Document Frequency (TF-IDF).

1. INTRODUCTION

The development of information technology is always followed by the development of social media which is one of the media that is often used by most people to exchange information. One example of social media that is currently on the rise is Twitter. Twitter is a social networking site that is often used to interact with each other between users around the world. Due to the large number of users around the world [1], there is a lot of data or public opinion about a company or university in the tweets made by Twitter users.

Telkom University is one of the private universities located in Bandung Regency. Through the www.timeshighereducation.com page, the Times Higher Education (THE) Asia University Rankings 2022 uses the same 13 performance indicators as THE World University Rankings, where THE AUR 2022 covers 616 universities from 31 regions in Asia.Of the 616 universities, Telkom University managed to rank 401-500 in Asia, rank 9 nationally, and rank first in private universities in Indonesia [2]. Therefore, not a few students are interested in enrolling themselves in Telkom University to continue their studies. To facilitate the dissemination of information, Telkom University has several social media accounts, one of which is Twitter. Not a few people have opinions about Telkom University on Twitter during the current pandemic. The public opinion can be grouped or categorized to make it easier to determine the opinion being discussed. After knowing what categories of opinions are widely discussed by the public, Telkom University can use this as material for performance evaluation in the future.

There are several methods that can be used to categorize topics with a dataset of tweets of Indonesian- language Twitter users, one of which is the Latent Dirichlet Allocation (LDA) method.Latent Dirichlet Allocation (LDA) is a topic modeling method that is useful for classifying text in a document into a specific topic. The purpose of using Latent Dirichlet Allocation (LDA) is to make topic determination on a large scale more efficient.

Research on topic modeling with the Twitter social media dataset has been previously congestion conducted by Ahmad Fathan and M. Rifqi [3] using the LDA method for modeling topics about traffic, the optimum value for the topic is 60 topic segments with a perplexity value of -3754.89. Another study regarding topic clustering on

(2)

BPJS Health sentiment data [4] with the LDA method obtained 2 optimum topics with an alpha value of 0.001 and beta 0.1 resulting in a perplexity value of 6.0907.

Latent Dirichlet Allocation (LDA) is a probabilistic generative model of discrete data sets or writings called corpus. This method is useful for modeling documents that arise from several topics, where a topic is defined as a distribution of fixed vocabulary terms [5]. In a study conducted by M. Choirul Rahmadan, Achmad Nizar Hidayanto, et al, regarding sentiment analysis and topic modeling related to flooding that occurred in Jakarta using the LDA method [6],this study obtained 9 optimum topics.

In a study conducted by Bagus Wicaksono and Gangga Anugra [7]. This study uses the Latent Dirichlet Allocation algorithm, and uses TF-IDF weighting. In this study, the accuracy value of the coherence score was 0.501014 with the optimum number of topics 28. Other research on topic modeling in football news was carried out by Ahmad Fathan Hidayatullah, Elang Cergas Pembrani, et al. [8] with the LDA method obtained 10 optimum topics but in this study the accuracy obtained was not explained. In a study conducted by Ella Anggraini [9] about modeling the abstract topic of thesis documents using the LDA method. The accuracy of the model obtained from this study is 0.5528 with an optimum number of topics of 3.

This study wants to categorize public opinion tweets about Telkom University on Twitter during the pandemic. The purpose of this research is to determine what opinion categories are discussed by Twitter users about Telkom University during the pandemic. The results of the topic modeling analysis conducted in this study using the LDA method are expected to help Telkom University as an evaluation material to improve performance and services in the future.

With the problem regarding the categorization of public opinion about Telkom University on Twitter, this final project focuses on implementing the LDA method to find out the categorization of public opinion about Telkom University during the COVID-19 pandemic. The dataset used is 2526 data in the form of text obtained through the Twitter API. The searched dataset uses the keywords "Tel-U", "Universitas Telkom", and "Telkom University". The author's implementation of the topic modeling program uses the LDA modeling algorithm and TF-IDF weighting with the python programming language.

2. RESEARCH METHODOLOGY

2.1 System Design

The model that will be made in this study will determine the topic of public opinion about Telkom University on Twitter during the covid-19 pandemic using the LDA method with the input used is Indonesian-language tweets.

An overview of the topic modeling process on Twitter is shown in Figure 1.

Figure 1. System Design Flowchart 2.2 Dataset

Data collection (crawling) was done by fetching data from Twitter, the text data for a total of 2526 tweets was collected in CSV format. The data crawling process is performed using the application programming interface (API) provided by Twitter and the Python library tweepy. The keywords used when searching the dataset were

"Universitas Telkom", "Telkom University", and "Tel U". The dataset used in this study is the tweets of Indonesian-language Twitter users. The python library, Pandas, is used for the dataset input process to be used.

2.3 Pre-Processing

Text mining is the process of finding patterns or information from large amounts of random and semi-structured text data [10]. Text mining is a variant of data mining. Often used to perform text classification, text grouping, and sentiment analysis [11]. The pre-processing of this data has several stages, namely cleansing, case folding, tokenizing, removing stopword, stemming, and lemmatizing.

(3)

2.3.1 Cleansing and case folding

The first thing to do is to clean the data, it is used to remove repeated characters, usernames, mentions, hashtags, blank lines, punctuation, special characters, excess spaces, and URLs. Next is to do case folding, the goal is to remove capital letters so that all characters become lowercase. The process of cleansing and case folding can be seen in table 1.

Table 1. Cleansing and Case Folding

Before After

@ABSetyono @Spy_Zone85 @kemkominfo @dennysirregar7

@psi_id @Prabu_dasamuka @babegalak1 @myozhyme

@PlateJohnny @putrivio3 @setengahmalass Menkominfo, Johnny G. Plate, menyatakan upaya untuk mengembangkan ekosistem digital tersebut diwujudkan melalui jalinan sinergi antara Kementerian Kominfo dan Telkom University dengan Telecom Infra Project (TIP) dan Meta Connectivity

menkominfo johnny plate menyatakan upaya untuk mengembangkan ekosistem digital tersebut diwujudkan melalui jalinan sinergi antara kementerian kominfo dan telkom university dengan telecom infra project tip dan meta connectivity

2.3.2 Tokenizing

Tokenizing is the process of separating words separated by spaces. This is done to facilitate the next step. The process of tokenizing can be seen in table 2.

Table 2. Tokenizing

menkominfo johnny plate menyatakan upaya untuk mengembangkan ekosistem digital tersebut diwujudkan melalui jalinan sinergi antara kementerian kominfo dan telkom university dengan telecom infra project tip dan meta connectivity

'menkominfo', 'johnny', 'plate', ‘menyatakan’, 'upaya',

‘untuk’ 'mengembangkan', 'ekosistem', 'digital', ’tersebut’, 'diwujudkan', ‘melalui’, 'jalinan', 'sinergi', ‘antara’, 'kementerian', 'kominfo', ‘dan’, 'telkom', 'university',

‘dengan’, 'telecom', 'infra', 'project', 'tip', ‘dan’, 'meta', 'connectivity'

2.3.3 Removing Stopword

Stopword removal is a process to remove connecting words and words that are less important in a particular document. Then in table 3 shows an example of the process of removing stopwords. The process of removing stopword can be seen in table 3.

Table 3. Removing stopword

'menkominfo', 'johnny', 'plate', ‘menyatakan’, 'upaya',

‘untuk’ 'mengembangkan', 'ekosistem', 'digital', ’tersebut’, 'diwujudkan', ‘melalui’, 'jalinan', 'sinergi', ‘antara’, 'kementerian', 'kominfo', ‘dan’, 'telkom', 'university',

‘dengan’, 'telecom', 'infra', 'project', 'tip', ‘dan’, 'meta', 'connectivity'

'menkominfo', 'johnny', 'plate', 'upaya', 'mengembangkan', 'ekosistem', 'digital', 'diwujudkan', 'jalinan', 'sinergi', 'kementerian', 'kominfo', 'telkom', 'university', 'telecom', 'infra', 'project', 'tip', 'meta', 'connectivity'

2.3.4 Stemming

Stemming is done to change words to become basic words. This can be done by removing the prefix or suffix from a word. The process of stemming can be seen in table 4.

Table 4. Stemming

'menkominfo', 'johnny', 'plate', 'upaya', 'kembang', 'ekosistem', 'digital', 'wujud', 'jalin', 'sinergi', 'menteri', 'kominfo', 'telkom', 'university', 'telecom', 'infra', 'project', 'tip', 'meta', 'connectivity'

2.3.5 Lemmatizing

Lemmatizing is the process of changing a word into a root word. The difference between lemmatizing and stemming is that stemming changes words into basic words without knowing the context of the word, while lemmatizing changes words into basic words by knowing the context of the word itself. The process of lemmatizing can be seen in table 5.

(4)

Table 5. Lemmatizing

'menkominfo', 'johnny', 'plate', 'upaya', 'kembang', 'ekosistem', 'digital', 'wujud', 'jalin', 'sinergi', 'menteri', 'kominfo', 'telkom', 'university', 'telecom', 'infra', 'project', 'tip', 'meta', 'connectivity'

2.4 Term Frequency — Inverse Document Frequency (TF-IDF)

The TF-IDF method is a method that is useful for calculating the weight or value relationship of a word (term) to a document. This method is an efficient, easy, and accurate method [12]. The next step is to change the processed tweets data into the frequency of word occurrences using the TF-IDF method. This is done to determine the weight of the important word values that are processed for modeling the topic at a later stage. The TF-IDF calculation formula used to calculate the weight of each word is as follows:

𝑇𝐹(𝑖, 𝑗) = ^{𝑓𝑟𝑒𝑞(𝑖,𝑗)}

𝑚𝑎𝑥𝑂𝑡ℎ𝑒𝑟𝑠(𝑖,𝑗) (1)

𝐼𝐷𝐹(𝑖) = 𝑙𝑜𝑔 ^𝑁

𝑛(𝑖) (2)

𝑇𝐹-𝐼𝐷𝐹(𝑖, 𝑗) = 𝑇𝐹(𝑖, 𝑗) ∗ 𝐼𝐷𝐹(𝑖) (3)

Where TF (i, j) is the calculation of each i-th word for each j-th document. i is the i-th term. j is the j-th document. freq (i, j) is the total of the i-th word in the j-th document. maxOthers (i, j) are the total words in the j- th document. IDF (i) is to eliminate the influence of the i-word that appears in many documents. N is the number of documents. n(i) is the number of documents containing the i-th word. Then TF-IDF represents how important a word is in a document.

2.5 Topic Modelling LDA

Topic modeling is a clustering method which is included in unsupervised learning. Objects in unsupervised learning do not have labels. There are three types of clustering, the first is hard clustering, the second is hierarchical clustering and the last is soft/fuzzy clustering. This modeling topic is included in soft/fuzzy clustering, that is, each object can have certain levels of grouping [13]. At this stage, the topic modeling analysis is carried out using the LDA method using the python packages, namely the Gensim packages. In this study, some of the features contained in Gensim will be used, namely gensim.corpora, gensim.models, and gensim.utils.

2.6 Evaluation

Topic coherence is where a set of words that are the output of topic modeling are assessed based on the level of coherence in interpreting humans [14]. The method used in this study is c_v and c_umass. The c_v metric is based on sliding windows, single-sentence segmentation of the most popular words, and indirect confirmation metrics using normalized pointwise mutual information (NPMI) and cosine similarity [15]. Meanwhile, c_umass is based on document co-occurrence counts, a leading split, and log-related probabilities as confirmation metrics [15]. The Umass similarity score measures the extent to which words tend to co-occur in topics together [16]. The calculation of the coherence score using the c_v method can be seen in formula (4) and c_umass in formula (5).

NPMI(𝑤_𝑖, 𝑤_𝑗)^𝛾 = (

log^{𝑃(𝑤𝑖,𝑤𝑗)+𝜖} 𝑃(𝑤𝑖) ∙ 𝑃(𝑤𝑗)

− log(𝑃(𝑤𝑖,𝑤𝑗)+ 𝜖))

𝛾

(4)

𝐶_{𝑈 𝑀𝑎𝑠𝑠} (𝑤_𝑖,𝑤_𝑗) = log^𝐷(𝑤^𝑖^,𝑤^𝑗⁾⁺¹

𝐷(𝑤𝑖,𝑤𝑗) (5)

2.7 Visualization

The results of the topic modeling will be visualized using pyLDAvis, which is a library available in python.

pyLDAvis is a topic model visualization that was built using the LDA method using a combination of R and D3 [17] . Specifically, pyLDAvis is a Python library for visualizing topic models that helps users interpret topics in topic modeling.

3. RESULT AND DISCUSSION

Tests in this study were conducted on a dataset in the form of tweets of Twitter users, totaling 2526 text data. The test scenario for this final project focuses on the pre-processing and modeling stages. The test scenario is divided

(5)

into three. The first scenario that was carried out was to compare the coherence scores when normalized and not normalized. The purpose of this scenario is to find out whether normalization, namely the process of changing abbreviations into normal words, will affect the coherence score. The second scenario carried out is by performing hyperparameter tuning, this hyperparameter tuning itself is carried out at the modeling stage to find out the most optimum combination of the number of topics, alpha and beta values. By determining the number of topics, alpha values, and optimal beta values, it is hoped that the best coherence score will be obtained. This second scenario is carried out on all scenarios, so the number of topics, alpha and beta values differ depending on the best tuning results. The third scenario is to compare the coherence score method, namely the c_v and c_umass methods. The purpose of this comparison is to determine which method is the best in determining the accuracy of the model made.

3.1 Result and Discussion of the Effect of no Normalizing Data with Hyperparameter Tuning (c_v) In the first test, testing is done by not applying normalization at the pre-processing stage. Hyperparameter tuning is used in this test to determine the most optimum possible combination of the number of topics, alpha values, and beta values. The test was carried out 540 times to determine the best combination so as to produce the maximum coherence value. In this first test, the c_v method is used as the coherence score method.

Table 6. First Test Result

Pre-processing Num of Topic Alpha Beta Coherence score (c_v)

No Normalizing 2 0.61 0.31 0.46779

3 0.01 0.61 0.51066

4 0.31 0.61 0.45559

5 0.31 0.61 0.55049

6 0.01 0.31 0.53041

7 0.01 0.61 0.56258

8 0.61 0.01 0.50878

9 0.31 0.01 0.51501

10 0.61 0.01 0.50673

It can be seen from the results of the first test in table 6, each topic was repeated 30 times to determine the best combination of alpha and beta values. The best coherence score is taken from each topic. In this first test, the best number of topics was 7 with an alpha value of 0.01 and a beta value of 0.61. From this combination, the coherence score with the c_v method is 0.56258.

3.1.1 First Test Topic Modeling Result

Table 7. is the result of modeling after hyperparameter tuning is done for the first test. There are 7 topics with terms that build on these topics. Each topic id generates one topic of discussion.

Table 7. First Test Topic Modeling Result Topic

ID

Term Topic

Topic:

0

tel, beasiswa, daftar, kuliah, jurus, terima, ptn, lolos, masuk, cek, mahal, jpa, web, bayar

Acceptance of new students at Telkom University

Topic:

1

Telkom, university, tel, Indonesia, digital, baik, universitas, peringkat, bandung, inovasi, hasil, mahasiswa, pts, prestasi, versi, swasta

Telkom University rank

Topic:

2

Tel, karir, info, virtual, media, facebook, social, website, twitter, pantau, linkedin, webinar, gratis, ajak, buka, Instagram, acara

Career webinar held by Telkom University

Topic:

3

Tel, hubung, info, email, kunjung, hadir, id, follow, media, nomor, social, kontak, karir, twitter, facebook, daftar, seminar

Information on registration of seminars held by Telkom University Topic:

4

Tel, asrama, informasi, jalan, gedung, cerita, dingin, kamar, pandemi, vaksin,

Telkom University Dormitory Environment

Topic:

5

Congratulation, ukm, akuntansi, ekonomi, festival, kompetisi Achievement of UKM participating in the competition

Topic:

6

Teknik, dkv, masjid, bem, dayeuhkolot, area, suasana, gurun, pasir, bisnis

Departments and Faculties and the atmosphere around Telkom University

(6)

3.1.2 Visualization

Figure 2. Visualization of First Topic Modeling Result

3.2 Result and Discussion of the Effect of Normalizing Data with Hyperparameter Tuning (c_v)

In the second test, testing is done by applying normalization at the pre-processing stage. Hyperparameter tuning is used in this test to determine the most optimum possible combination of the number of topics, alpha values, and beta values. The test was carried out 540 times to determine the best combination so as to produce the maximum coherence value. In this second test, the c_v method is used as the coherence score method.

Table 8. Second Test Result

Pre-processing Num of Topic Alpha Beta Coherence score (c_v)

Normalizing 2 0.01 0.61 0.39237

3 0.01 0.31 0.40244

4 0.61 0.61 0.40972

5 0.01 0.01 0.41033

6 0.61 0.31 0.45523

7 0.01 0.61 0.41621

8 0.31 0.61 0.47746

9 0.61 0.31 0.49322

10 0.31 0.31 0.44489

It can be seen from the results of the first test in table 8, each topic was repeated 30 times to determine the best combination of alpha and beta values. The best coherence score is taken from each topic. In this second test, the best number of topics was 9 with an alpha value of 0.61 and a beta value of 0.31. From this combination, the coherence score with the c_v method is 0.49322.

3.2.1 Second Test Topic Modeling Result

Table 9. is the result of modeling after hyperparameter tuning is done for the second test. There are 9 topics with terms that build on these topics. Each topic id generates one topic of discussion.

Table 9. Second Test Topic Modeling Result Topic

ID

Term Topic

Topic:

0

Tel, beasiswa, daftar, terima, jalur, ptn, masuk, lolos, tes, universitas, jpa, lulus, web, nilai, smb

Topic:

1

Universitas, Telkom, inovasi, Indonesia, swasta, pts, baik, kampus, versi, Indonesia, mahasiswa, langsung

Telkom University rank Topic:

2

Rumah, online, dekat, semester, orang, bagus, telyu, terima, daerah

Online semester Topic:

3

Tel, karir, facebook, info, sosial, twitter, media, virtual, website, baru, pantau, linkedin, gratis, Instagram, teman, ajak, buka, tunggu, alumni

Career webinar held by Telkom University

Topic:

4

Kuliah, jurus, isi, teman, mahal, anak, ajar, tolong, biaya, dapat, tel, kenal, program, Teknik, studi, informatika, ilmu, komunikasi, ambil, dkv, telekomunikasi

Departments and Faculties, expensive Telkom university cost

Topic:

5

Bandung, dimsum, usaha, kecil, tengah, cari, makan, jual, kabupaten, bojongsoang, lokasi

Places to eat around Telkom University

Topic:

6

Tel, info, hubung, email, hadir, id, tunggu, kunjung, follow, admin, media, nomor, kontak, sosial, karir, talk

Information on registration of seminars held by Telkom University

(7)

Topic ID

Term Topic

Topic:

7

Tel, universitas, yuk, ikut, mahasiswa, youtube, daftar zoom, buku, cari, sertifikat, meeting, tanggal, ekonomi, maret, februari, regsistrasi, live, waktu

Invitation to join seminars held by Telkom University

Topic:

8

Tel, digital, marketing, informasi, lihat, ketemu, radio, gelar, latih, asal

Digital marketing training held by Telkom University

Figure 3. Visualization of Second Topic Modeling Result

3.3 Result and Discussion of the Effect of no Normalizing Data with Hyperparameter Tuning (c_umass) In the third test, testing is done by not applying normalization at the pre-processing stage. Hyperparameter tuning is used in this test to determine the most optimum possible combination of the number of topics, alpha values, and beta values. The test was carried out 540 times to determine the best combination so as to produce the maximum coherence value. In this third test, the c_umass method is used as the coherence score method.

Table 10. Third Test Result

Pre-processing Num of Topic Alpha Beta Coherence score (c_umass)

No Normalizing 2 0.61 0.31 -9.04189

3 0.61 0.01 -10.43743

4 0.01 0.01 -11.30667

5 0.61 0.01 -13.05253

6 0.61 0.01 -12.90667

7 0.31 0.01 -13.97137

8 0.61 0.01 -14.53806

9 0.31 0.01 -15.40716 10 0.31 0.01 -15.24572

It can be seen from the results of the third test in table 10, each topic was repeated 30 times to determine the best combination of alpha and beta values. The best coherence score is taken from each topic. In this second test, the best number of topics was 9 with an alpha value of 0.31 and a beta value of 0.01. From this combination, the coherence score with the c_umass method is -15.40716.

3.3.1 Third Test Topic Modeling Result

Table 11. is the result of modeling after hyperparameter tuning is done for the third test. There are 9 topics with terms that build on these topics. Each topic id generates one topic of discussion.

Table 11. Third Test Topic Modeling Result Topic

ID

Term Topic

Topic:

0

Telkom, university, media, mahasiswa, linkedin, ajar, daftar, web, hubung, Indonesia, universitas, digital, alumni, acara, marketing

Digital marketing training held by Telkom University

Topic:

1

Tel, ptn, anak, masuk, kunjung, jpa, email, utbk, rapor, bagus, november

Topic:

2

Tel, bandung, ambil, ngisi, telyu, dapet, akuntansi, kuliah, rektor, sma, sekolah, oktober,

High school student accepted at Telkom University

Topic:

3

Beasiswa, terima, lolos, gratis, Teknik, prodi, pts, seleksi, raih Acceptance of new students using the scholarship path Topic:

4

Jurus, kuliah, swasta, pts, baik, peringkat, itb, unpar, versi, uph, ui, Indonesia, ptn, unpad

Telkom University ranking among other universities

(8)

Topic ID

Term Topic

Topic:

5

Daftar, isi, teman, ikut, kelas, talk, informasi, maret, ig, jam, wib, cipta

Topic:

6

Info, hadir, hasil, univ, dunia, sertifikat, competition, acara, twitter, uin, trisakti, Surabaya, tema, selenggara, kuat, tema, bismillah

Competitions attended by various universities

Topic:

7

Buka, jalur, kampus, lengkap, kontak, giat, butuh, usaha, Information on new student admissions

Topic:

8

Telkom, university, beasiswa, program, selamat, pintar, semestero Acceptance of new students using the scholarship path 3.3.2 Visualization

Figure 4. Visualization of Third Topic Modeling Result

3.4 Result and Discussion of the Effect of Normalizing Data with Hyperparameter Tuning (c_umass) In the fourth test, testing is done by applying normalization at the pre-processing stage. Hyperparameter tuning is used in this test to determine the most optimum possible combination of the number of topics, alpha values, and beta values. The test was carried out 540 times to determine the best combination so as to produce the maximum coherence value. In this fourth test, the c_umass method is used as the coherence score method.

Table 12. Fourth Test Result

Pre-processing Num of Topic Alpha Beta Coherence score (c_umass)

Normalizing 2 0.01 0.31 -9.83596

3 0.01 0.01 -10.05173

4 0.01 0.01 -10.73613

5 0.31 0.01 -11.88108

6 0.01 0.01 -11.85203

7 0.61 0.01 -12.67053

8 0.31 0.01 -13.75255

9 0.61 0.01 -13.57067

10 0.31 0.01 -13.68406

It can be seen from the results of the fourth test in table 12, each topic was repeated 30 times to determine the best combination of alpha and beta values. The best coherence score is taken from each topic. In this second test, the best number of topics was 10 with an alpha value of 0.31 and a beta value of 0.01. From this combination, the coherence score with the c_umass method is -13.68406.

3.4.1 Fourth Test Topic Modeling Result

Table 13. is the result of modeling after hyperparameter tuning is done for the third test. There are 10 topics with terms that build on these topics. Each topic id generates one topic of discussion.

Table 13. Fourth Test Topic Modeling Result Topic

ID

Term Topic

Topic:

0

Universitas, Telkom, ptn, swasta, linkedin, pts, baik, Indonesia, peringkat, versi, mahasiswa, teknologi

Telkom University rank Topic:

1

Tel, daftar, terima, jalur, beasiswa, institute, bandung, pilih Acceptance of new students using the scholarship path

(9)

Topic ID

Term Topic

Topic:

2

Tel, teman, universitas, masuk, media, ajak, Teknik, uii, uph, umum, publik,

Invitation to register on university Topic:

3

Media, kunjung, lulus, hubung, dapat, hasil, oktober, unpar, oktober, kampus, negeri, masuk, seleksi, giat, mandiri

Announcement of college entrance results

Topic:

4

Kuliah, jurus, cek, universitas, gratis, biaya, negeri, upi, uns, atma, ugm, padjajaran

Entrance fees per university Topic:

5

Yuk, ikut, digital, webinar, program, studi, acara, komunikasi, Indonesia, sertifikat, bojongsoang, program

Topic:

6

Tel, isi, bandung, lolos, instagram, putar, cari, cuma, seleksi, rektor, purwokerto, dinas

Universitas Telkom lolos seleksi lomba

Topic:

7

Email, hadir, nilai, smb, dkv, lengkap, kontak, telyu, free, online, semester

Information on new student admission

Topic:

8

Beasiswa, info, web, internasional, khusus, mohon, kelas, lain Information about international class

Topic:

9

Buka, mahasiswa, tes, salah, daftar, teknologi, bandung, utbk, maret, februari, nasional

information about student enterance test

Figure 5. Visualization of Fourth Topic Modeling Result

4. CONCLUSION

Based on the tests that have been carried out with several scenarios that have been made for topic modeling with a dataset of tweets of Indonesian-language Twitter users using the LDA method, the conclusion of this study is that the most optimum system performance results are produced when not using normalization at the pre- processing stage, with a combination of the number of topics is 9, the alpha value is 0.31, and the beta value is 0.01. With this combination, the coherence score using the c_umass method is -15.33029. Determination of alpha and beta values in modeling using the LDA method has an effect on coherence scores because alpha values represent document-topic density and Beta represents topic-word density. The topics generated from the modeling topics can be seen in table 11. Most of the topics discussed were new student admissions. The reason why normalization can reduce the coherence score is because normalization changes the original text a lot in the dataset.

Considering the dataset in the form of tweets from Twitter which uses more non-standard words. Checking non- standard words is difficult because the dataset used is quite large so that it can reduce the performance of the system created. Some examples of informal sentences and using foreign language words found in the dataset are as follows: “iyess tp skrg baru itb sama universitas telkom doang yang bisa”, “webinar tech hiring day sampurasun akang teteh sadayana rangkaian webinar sirclo tech hiring day hadir pertama kali di bandung webinar di bandung kali ini sirclo akan berkolaborasi dengan universitas telkom yang akan dilaksanakan pada” Then the use of the c_v method has problems in determining the coherence score, the problem experienced is that the coherence score can suddenly change without changing the modeling hyperparameters made. So, the use of the c_umass method is better in determining the coherence score because it is more consistent in determining the coherence score.

REFERENCES

[1] Pestov, I. (n.d.). Today’s Incredible Numbers About Social Media | by Ilya Pestov | Medium. Retrieved June 30, 2022, from https://medium.com/@ipestov/todays-incredible-numbers-about-social-media-a6b1ff2ca887

(10)

[2] Purwadi, M. (n.d.). Telkom University Kembali Jadi PTS Terbaik versi THE AUR 2022. Retrieved June 30, 2022, from https://edukasi.sindonews.com/read/787197/211/telkom-university-kembali-jadi-pts-terbaik-versi-the-aur-2022- 1654225596

[3] Institute of Electrical and Electronics Engineers. Indonesia Section., & Institute of Electrical and Electronics Engineers.

(2017). 2017 International Conference on Sustainable Information Engineering and Technology (SIET) : proceedings : Batu City, Indonesia, November 24th-25th 2017. (A. F. and M. M. R. Hidayatullah, Ed.).

[4] Putri, S. A., Daru Kusuma, P., & Setianingsih, C. (2021). CLUSTERING TOPIK PADA DATA SENTIMEN BPJS KESEHATAN MENGGUNAKAN METODE LATENT DIRICHLET ALLOCATION TOPIC CLUSTERING ON SENTIMENT DATA OF BPJS KESEHATAN USING LATENT DIRICHLET ALLOCATION METHOD.

[5] Blei, D. M., & Lafferty, J. D. (2009). Topic Models.

[6] Choirul Rahmadan, M., Nizar Hidayanto, A., Swadani Ekasari, D., Purwandari, B., & Theresiawati. (2020). Sentiment Analysis and Topic Modelling Using the LDA Method related to the Flood Disaster in Jakarta on Twitter. Proceedings - 2nd International Conference on Informatics, Multimedia, Cyber, and Information System, ICIMCIS 2020, 126–130.

https://doi.org/10.1109/ICIMCIS51567.2020.9354320

[7] Wicaksono Arianto, B., & Anuraga, G. (2020). Pemodelan Topik Pengguna Twitter Mengenai Aplikasi “Ruangguru”

Topic Modeling for Twitter Users Regarding the “Ruanggguru” Application (Vol. 21, Issue 2).

[8] Hidayatullah, A., & Pembrani, E. (2018). ICCCS 2018 : 2018 3rd International Conference on Computer and Communication Systems : April 27-30, 2018, Nagoya, Japan.

[9] Anggraini, E. (n.d.). LATENT DIRICHLET ALLOCATION UNTUK PEMODELAN TOPIK ABSTRAK DOKUMEN SKRIPSI TUGAS AKHIR.

[10] Cai, Y., & Sun, J.-T. (2009). Text Mining. In L. LIU & M. T. ÖZSU (Eds.), Encyclopedia of Database Systems (pp.

3061–3065). Springer US. https://doi.org/10.1007/978-0-387-39940-9_418

[11] Feldman, R., & Sanger, J. (2007). The text mining handbook : advanced approaches in analyzing unstructured data.

Cambridge University Press.

[12] Robertson, S. (2005). Understanding Inverse Document Frequency: On theoretical arguments for IDF (Issue 5).

[13] Doig, C. (n.d.). topic-modeling: Topic modeling with python. Retrieved June 30, 2022, from http://chdoig.github.io/pytexas2015-topic-%20modeling

[14] Newman, D., Lau, J. H., Grieser, K., & Baldwin, T. (2010). Automatic Evaluation of Topic Coherence. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 100–108.

[15] Kapadia, S. (2019). Evaluate Topic Models: Latent Dirichlet Allocation (LDA) | by Shashank Kapadia | Towards Data Science. https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0 [16] Lee, D., Institute of Electrical and Electronics Engineers, IEEE ITSS, & Florida International University. (n.d.). IEEE

ISI2018 : IEEE International Conference on Intelligence and Security Informatics : November 8-10, 2018, Florida International University, Miami FL.

[17] Sievert, C., & Shirley, K. E. (2014). LDAvis: A method for visualizing and interpreting topics. 63–70.

https://doi.org/10.13140/2.1.1394.3043