News Recommender System Based on User Log History Using Rapid Automatic Keyword Extraction

(1)

News Recommender System Based on User Log History Using Rapid Automatic Keyword Extraction

Inggrid Resmi Benita, Z K A Baizal^*

Informatics, School of Computing, Telkom Universiry, Bandung, Indonesia Email: ¹[email protected], ^2,*[email protected]

Email Penulis Korespondensi: [email protected]

Abstract−There are many ways to find information; one of them is reading online news. However, searching for news online becomes more difficult because we should visit multiple platforms to find information. Sometimes, the recommended news doesn't match the user's interests. In many prior works, news recommendations are based on trending. Thus, the recommended news may not necessarily match the user's interests. To overcome this, we built a web-based news recommender system to make it easier for users to find news. We use the Rapid Automatic Keyword Extraction (RAKE) method in the recommendation process because this method can recommend news based on user preferences by utilizing user history logs. RAKE converts the title and content of the news into vector representation using Count vectorizer and applies the Cosine Similarity function to compare similarities between news. The test results show that the average performance of our proposed system is 90.8%, this accuracy outperforms earlier systems in terms of performance by the purpose of the recommender system, i.e., diversity, novelty, and relevance.

Keywords: Online News; News Recommender System; Rapid Automatic Keyword Extraction

1. INTRODUCTION

Over time, the internet has played an essential role in accessing information because more and more people worldwide are looking for information online. One of them is looking for information from online news. Therefore, the number of online media that produces news is increasing. As a result, people have difficulty finding the desired news within a certain period [1], [2]. Also, the community has a problem because they have to visit several news websites to scan what they want [3], [4]. A news recommender system can be a solution to solve the above problems, one of which is to produce personalized news content. Li, M. et al. [5] explain that personalized news content can help users find the news most likely to interest the user's needs. Interest preferences are needed to produce personalized news content containing user history log data.

There are several studies related to the topic of news recommender system. Wang, Z. et al. [1] have developed a news recommender system based on keyword extraction. Based on this study, it is concluded that keyword extraction can recommend the latest news topics compared to other recommendation techniques.

However, this study has a weakness: people's names and nouns are still defined as candidate keywords. In addition, the recommended news is based on trends only, so it does not match users' interests. Wang, Y. et al. [6] developed a news recommender system based on user behavior.

The news recommender system uses a collaborative filtering method that utilizes user behavior attributes, i.e., browsing time. However, this paper ignores the level of accuracy because it only focuses on the algorithm's efficiency. Based on prior works, we created a news recommender system based on user log history using keyword extraction. The method we use in the keyword extraction is RAKE, because RAKE is unsupervised method used to perform a keyword extraction process in a larger amount of data from several types of individual documents than other keyword extraction algorithms [7], [8]. There are also several studies that discuss the RAKE. Thushara, M. G. et al. [8] compared the performance of 3 keyphrase extraction algorithms such as Textrank, RAKE, and Position Rank. Based on the comparison results, Position Rank gives better results than Textrank and RAKE.

Textrank does not consider the occurrence of keyword positions in a document, while RAKE if does not consider stopwords; the results obtained are irrelevant.

Huang, H. et al. [7] developed a keyword extraction algorithm combining Named Entity Recognition (NER) and RAKE called NER-RAKE. This algorithm is for keyword extraction of scientific literature. The NER process in this method optimizes the RAKE candidate keywords selection process, and the fast and practical features are retained. Meanwhile, Hu J., et al. [9] developed a keyword extraction algorithm based on a distributed skip-gram model. The algorithm has effective results for keyword extraction compared to the methods of frequency, Term Frequency-Inverse Document Frequency (TF-IDF), Textrank, and RAKE.

Based on the problems and previous work, this study created a web-based news recommendation system based on user log history using RAKE. Thus, the recommended news is according to the interests of the user. In addition, in this study, RAKE will consider people's names and nouns so as not to enter them as keyword candidates to improve our previous research. This paper is structured based on the organization of the paper as follows: the first part describes the introduction, the second part describes the related studies, the third part describes the system architecture, and the fourth part explains the results of the system test evaluation, and the fifth part provides conclusions.

(2)

2. RESEARCH METHODOLOGY

In this section, the recommender system that we built will describe. The application recommender system flow that is created starts when the user logs in first. If the user is new or has never read any news, the system will display all the news based on the latest time. However, if the user has read one or more news, the system will recommend news based on the user history log using RAKE.

For a more detailed explanation, Figure 1 describes how the system recommends news based on the user's history log starting when the user reads the news. Next, the PHP server responsible for accessing PostgreSQL adds the read news to the PostgreSQL Database as log history data. PostgreSQL functions to store news log history data and user data. Then, PostgreSQL Database sends all news history data back to the PHP server and passes it to User Interface. From the User Interface, the news history data is sent to the Python server. Python server is responsible for accessing the Dataset because the Dataset in this study was taken offline before the system was run. After which, the Python server will send recommender news to the User Interface. Thus the user can see the recommendations displayed based on the user history log.

Figure 1. System Architecture Recommender News Based on History

Figure 2 describes the process flow of a recommendation using RAKE. Firstly, the data needs to be pre- processed to eliminate unnecessary data such as stop words, punctuation, white space, and handling synonyms in the dataset. Then, the system generates word representation by combining the column 'Title' and 'Content' of news into 'Bag of words' and extracts keywords from complete sentences in 'Bag of words' using RAKE. Then, to eliminate duplications, change to lowercase. Following that, 'Bag of words' will convert to a vector representation, a simple frequency counter for each word in the 'Bag of words,' and the cosine similarity matrix will be used to compare similarities between news. The final process is to compare the news titles in the user history log with the relevant news index according to the Similarity Matrix. The recommendation results are sorted from the highest Cosine Similarity value.

Figure 2. Flow of Recommendation Process

(3)

2.1 Candidate Keywords Extraction

The candidate keywords were extracted using RAKE. RAKE is an automatic domain-independent method for extracting single document keywords [10], [11]. We use RAKE to extract relevant word sets by combining the title and content of the news into a candidate keyword. In candidate keywords, people's names and nouns are not included.

2.2 Count vectorizer

The recommendation model can only compare vectors (matrixes) with others [12]. In this study, the title and content of the news that has been processed into candidate keywords are represented as vectors using Countvectorizer.

2.3 Cosine Similarity

After getting a vector matrix containing the sum of all the candidate keywords, we then apply the Cosine Similarity function to perform a similarity comparison between news. The Cosine Similarity calculation between two vectors [13] is obtained according to (1). Where vector a is an object, another object is symbolized by vector b. Then, n is the symbol of a database.

𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑎, 𝑏) = ^𝑎⋅𝑏

‖𝑎‖‖𝑏‖= ^∑^𝑛^𝑖=1^𝑎^𝑖^𝑏^𝑖

√∑^𝑛_𝑖=1𝑎_𝑖²√∑^𝑛_𝑖=1𝑏_𝑖²

(1)

3. RESULT AND DUSCUSSION

3.1 The Dataset

In the period 1 April 2022 - 7 April 2022 and 5 May 2022 - 15 May 2022, the news used as a dataset was taken from the news website CNN Indonesia. The categories of news taken are lifestyle and entertainment. News data retrieval is done manually before the system runs. One news dataset requires a news title, publication date, content of the news, news URL link, and poster URL link. In addition, a unique news ID will be assigned to avoid duplication.

3.2 The Result Cosine Similarity Calculation

Table 1. Cosine Similarity Calculation Results

id News Tittle Cosine

Similarity 21 Waktu yang Tepat untuk Ikhtiar Jalani Program Bayi Tabung (english: The Right Time

to Fight for the IVF Program) 0.07

132 Metallica Sumbang Rp7,1 M untuk Pengungsi Ukraina (english: Metallica Donates

IDR 7.1 Billion for Ukrainian Refugees) 0.06

77 Cerita Desainer Soal Jas JK di Jepang: Dibuat 1x24 Jam (english: Designer's Story

About JK Suits in Japan: Made 1x24 Hours) 0.06

81

Tiba di AS, Jokowi Gandeng Mesra Iriana saat Turun dari Tangga Pesawat (english:

Arriving in the US, Jokowi Collaborates with Iriana when Descending from the Airplane Stairs)

0.06 75 Balenciaga Jual Sepatu Rusak Seharga Rp9 Juta (English: Balenciaga Sells Damaged

Shoes for IDR 9 Million) 0.05

45 Waktu Terbaik untuk Berolahraga saat Puasa (english: The Best Time to Exercise

While Fasting) 0.04

185 Akhir Pekan Panjang, Jaringan Cinema XXI Tambah Jam Tayang (english: Long

Weekend, Cinema XXI Network Adds Showtimes) 0.04

Table 1 is the result of the news calculation stored in the history log titled "Temui Jokowi, Elon Musk Pakai Kaos Rp349 Ribu (english: Meet Jokowi, Elon Musk Wears Rp 349 Thousand T-shirt)". The system performs a RAKE calculation based on the user's history log, which focuses on the title and content of the news. Then it is converted into a vector representation using the Count Vectorizer, after which the system calculates the Cosine Similarity based on the news title.

3.3 The Result of Website News Recommender System

News website that we created to recommend news called Verofinnews. The home page of the Verofinnews website as shown in Figure 3, on the left contains the home, history, and profile. And then, the right side has several news recommendations that can be read by the user based on the user history log. In addition, on the home page, users can also filter by a category, i.e., entertainment or lifestyle.

(4)

Figure 3. Homepage Website

Recommended news based on the Cosine Similarity calculation by ordering the most similarity scores. The data used to calculate Cosine Similarity is the title and content of the news. Users can search for news based on the desired keywords on the home page. After the user enters the keyword, the system will display news with a title containing the inputted keyword, as shown in Figure 4.

Figure 4. Search News by Keywords 3.4 Evaluation

The evaluation of the system's performance in this study was using one of the techniques in the inquiry method, i.e., surveying [14], [15] of 65 respondents, where respondents assessed the system tested in the form of an evaluation form. The respondents consisting of 33 students, 16 employees, and 16 others. The data obtained from the performance evaluation is in the form of respondents' subjective statements regarding the assessment of Relevance, Novelty, Diversity, and respondent satisfaction [16], [17] to the Verofinnews website.

Diversity is a difference. Where the recommended news has diversity. In other words, the recommended news on the home page is not tied to a single story. For example, if the user already has a history log, the system will recommend other news similar to the news in the history log. Relevance is interrelated. When the user has read the news, the recommended news on the homepage relates to the news in the history log. Novelty is a novelty in study that is useful for people's lives.

In evaluating the performance of the system, the scores chosen by the respondents ranged from 1-5. After getting all the assessment scores, each score will be added up and divided by the maximum score to get the percentage result of the performance test [16] as shown in Table 2.

Table 2. Percentage of System Performance Test Results

Rating Parameters Score

Diversity 92.3%

Relevance 91.4%

Novelty 88.6%

Average 90.8%

(5)

Based on Table 2, our proposed system has a higher accuracy value than the previous system in terms of performance. However, the conclusion obtained from respondents' satisfaction with the User Interface of the Verofinnews website, it is known that the system design does not meet the usability criteria of a system. Because it has not given satisfaction to the respondents, thus, it is necessary to improve the page layout and add other features.

4. CONCLUSION

This paper proposes a novelty news recommender system using RAKE based on user history logs. The test results show that the news recommendation process based on the history log is carried out correctly. The recommended news is similar to the news in the history log, and the recommended news is according to user interests. In addition, the RAKE method used successfully did not consider the names of people and nouns as candidate keywords. The news recommender system can achieve the goal of a better system by getting a Diversity value of 92.3%, Relevance of 91.4%, and Novelty of 88.6%, so that the average rating obtained is 90.8%. However, the system design has not met the usability criteria of a better system. In addition, the news dataset used in this study is relatively limited in terms of categories. Therefore, it is hoped that the system design that is made in the future needs to measure the user experience aspect to meet the usability criteria. The news dataset is expanded to reach all people.

REFERENCES

[1] Z. Wang, K. Hahn, Y. Kim, S. Song, and J. M. Seo, “A news-topic recommender system based on keywords extraction,”

Multimedia Tools and Applications, vol. 77, no. 4, 2018, doi: 10.1007/s11042-017-5513-0.

[2] A. A. Fakhri, Z. K. A. Baizal, and E. B. Setiawan, “Restaurant Recommender System Using User-Based Collaborative Filtering Approach: A Case Study at Bandung Raya Region,” in Journal of Physics: Conference Series, 2019, vol. 1192, no. 1. doi: 10.1088/1742-6596/1192/1/012023.

[3] W. Hariri, K. I. Ghauth, and C. Eswaran, “A Multimedia Content Recommender System Using Table of Contents and Content-Based Filtering,” Advanced Science Letters, vol. 24, no. 2, 2018, doi: 10.1166/asl.2018.10699.

[4] Z. K. A. Baizal, D. H. Widyantoro, and N. U. Maulidevi, “Computational model for generating interactions in conversational recommender system based on product functional requirements,” Data and Knowledge Engineering, vol.

128, 2020, doi: 10.1016/j.datak.2020.101813.

[5] M. Li and L. Wang, “A Survey on Personalized News Recommendation Technology,” IEEE Access, vol. 7, 2019, doi:

10.1109/ACCESS.2019.2944927.

[6] Y. Wang and W. Shang, “Personalized news recommendation based on consumers’ click behavior,” 2016. doi:

10.1109/FSKD.2015.7382016.

[7] H. Huang, X. Wang, and H. Wang, “ NER‐RAKE : An improved rapid automatic keyword extraction method for scientific literatures based on named entity recognition ,” Proceedings of the Association for Information Science and Technology, vol. 57, no. 1, 2020, doi: 10.1002/pra2.374.

[8] M. G. Thushara, T. Mownika, and R. Mangamuru, “A comparative study on different keyword extraction algorithms,”

2019. doi: 10.1109/ICCMC.2019.8819630.

[9] J. Hu, S. Li, Y. Yao, L. Yu, G. Yang, and J. Hu, “Patent keyword extraction algorithm based on distributed representation for patent classification,” Entropy, vol. 20, no. 2, 2018, doi: 10.3390/e20020104.

[10] J. S. Baruni and Dr. J. G. R. . Sathiaseelan, “Keyphrase Extraction from Document Using RAKE and TextRank Algorithms,” International Journal of Computer Science and Mobile Computing, vol. 9, no. 9, 2020, doi:

10.47760/ijcsmc.2020.v09i09.009.

[11] S. Anjali, M. Meera Nair, and M. G. Thushara, “A graph based approach for keyword extraction from documents,” 2019.

doi: 10.1109/ICACCP.2019.8882946.

[12] J. Ng, “Content-based Recommender Using Natural Language Processing (NLP),” 2020.

https://www.kdnuggets.com/2019/11/content-based-recommender-using-natural-language-processing-nlp.html (accessed Jul. 10, 2022).

[13] A. R. Lahitani, A. E. Permanasari, and N. A. Setiawan, “Cosine similarity to determine similarity measure: Study case in online essay assessment,” 2016. doi: 10.1109/CITSM.2016.7577578.

[14] S. P. Dewi, G. R. Dantes, and G. Indrawan, “EVALUASI USABILITY PADA ASPEK SATISFACTION MENGGUNAKAN TEKNIK KUESIONER PADA SISTEM LMS PROGRAM KEAHLIAN GANDA,” Jurnal Pendidikan Teknologi dan Kejuruan, vol. 15, no. 1, 2018, doi: 10.23887/jptk-undiksha.v15i1.13028.

[15] B. M. Maake, S. O. Ojo, and T. Zuva, “A Survey on Data Mining Techniques in Research Paper Recommender Systems,”

2019, pp. 119–143. doi: 10.4018/978-1-5225-8437-7.ch006.

[16] F. Ramadhan and A. Musdholifah, “Online Learning Video Recommendation System Based on Course and Sylabus Using Content-Based Filtering,” IJCCS (Indonesian Journal of Computing and Cybernetics Systems), vol. 15, no. 3, 2021, doi:

10.22146/ijccs.65623.

[17] M. Kunaver and T. Požrl, “Diversity in recommender systems – A survey,” Knowledge-Based Systems, vol. 123, 2017, doi: 10.1016/j.knosys.2017.02.009.