PDF Proceeding of Endinamosis 2019 No ISSN: 2502-1478 Vol. 1 No. 3

(1)

(2)

ENDINAMOSIS 2019

3^rd International Conference on

Rural Development and Community Empowerment

PROCEEDINGS OF

THE THIRD INTERNATIONAL CONFERENCE ON RURAL DEVELOPMENT AND COMMUNITY EMPOWERMENT

Bandung, Indonesia November 2-3, 2019

Center for Rural Areas Empowerment – ITB

(3)

3rd International Conference on Rural Development and Community Empowerment

Proceedings of the Third International Conference on Rural Development and Community Empowerment, Center for Rural Areas Empowerment – ITB

Bandung, November 2-3, 2019 No ISSN: 2502-1478 Vol. 1 No. 3

(4)

Proceeding of Endinamosis 2019 No ISSN: 2502-1478 Vol. 1 No. 3

iii

Organizing Committee

General Chairman

Prof. Dr. Delik Hudalah School of Architecture, Planning, and Policy Development ITB, Indonesia

Steering committee

Prof. Khairurrijal, Ph.D Head of Institute for Research and

Community Services, Faculty of Mathematics and Natural Science ITB, Indonesia

Endra Susila, Ph.D Head of Center for Rural Areas Development, Faculty of Civil and Environment Engineering ITB, Indonesia

Taufikurahman, Ph.D School of Life Sciences and Technology ITB, Indonesia

Irwan Rusli, PhD Princeton Institute of Technology, USA Brian Yuliarto, Ph.D Faculty of Industrial Engineering ITB,

Indonesia

Dr. Irwan Meilano Faculty of Earth Sciences and Technology ITB, Indonesia

Ibnu Syabri, Ph.D School of Architecture, Planning, and Policy Development, Indonesia

Scientific Committee

M. Pauzi Zakaria, PhD Institute of Ocean and Earth Sciences, University of Malaya, Malaysia

Prof.Dr. Sukree Langputeh Fathoni University, Thailand Dr. Li Zhang Tongji University, Shanghai, China Prof. Tomoya Kaji Faculty of Law, Meiji Gakuin University,

Japan

Prof. Lee Inhee Department of Architecture, Pusan National University, Korea

Prof. Budi Sulistianto Professor in Mining Engineering, the Faculty of Mining and Petroleum Engineering, Bandung Institute of Technology, Indonesia dr. Astrid Kartika, MPP Australian Embassy, Indonesia

Dr. Eng. Bonivasius

Prasetya Ichtiarto Director of Public Relations and International Cooperation, Ministry of Village, Development of Disadvantaged Regions, and

Transmigration of Republic of Indonesia, Indonesia

Technician Committee

Maryam Al Lubbu Aulia Arfida Paramita Asih Suryati Marsha Elfiandri

Adhi Setyanto Dinar Bayuningtyas Nur Malatipah Ibnu Sina Mochammad Ihsanuddin Karimullah Isnu Putra Pratama

(5)

19

Prediction Popularity of Movies in Indonesia Using Data Mining Algorithm Classification

Daning Nur Sulistyowati^1*, Windu Gata², Fachri Amsury¹, Norma Yunita³, Agus Junaidi⁴, and Ahmad Setiadi⁵

1 Sistem Informasi, Sekolah Tinggi Manajemen Informatika dan Komputer Nusa Mandiri

2 Ilmu Komputer, Sekolah Tinggi Manajemen Informatika dan Komputer Nusa Mandiri

3 Sistem Informasi, Universitas Bina Sarana Informatika

4 Sistem Informasi Akuntansi, Universitas Bina Sarana Informatika

5 Rekayasa Perangkat Lunak, Universitas Bina Sarana Informatika E-mail: [email protected]

Abstract. The development of the movie business a rapid progress had an impact to the audience a movie that need movie with the quality of the image, sound, story line and positive value contained therein so that the audiens of the movie are still enthusiastic in following the latest movie. An assessment is needed so that the film industry is still growing continuously. Assessment from the public and especially the audience of the movie itself is used to determine the most desirable type of movie.

This study aims to predict the level of popularity of the movie in Indonesia by analyzing the rating of a film. The result of research by the ROC curve showed AUC values using Naive Bayes models amounted to 0,836 K-Nearest Neighbors models amounted to 0.818 and 0.767 for decision tree models. Of the three models two of which belong to the classification of Good classification is naive Bayes algorithm and k-nearest neighbors with AUC values between 0,80 to 0,90. The model of decision tree algorithm included in Fair classification classification, had AUC values between 0,70 to 0,80.

Keywords: Classification, Prediction, Movie Popularity, Naïve bayes, Data Mining

1. Introduction

The development of the world of cinema which has been very rapidly impacted the audience of films that require films that have quality images, sound, story lines and positive values contained therein. It was based so that the audience of the film remained enthusiastic in following the latest films. Although not all people like movies or only like films with certain types (genres).

So that the world of film develops continuously, it needs an assessment. The assessments themselves come from the public and especially the connoisseurs of the film itself. This assessment is used to find out what type of film you want. Therefore an analysis is needed to determine the interests of film lovers by analyzing a film's rating.

From research conducted by (Ahmed, Jahangir, Afzal, Majeed, & Siddiqi, 2015) with the title Using crowd-source-based features from social media and conventional features to predict the movies popularity explains the prediction of the success of a film using the Decision Tree algorithm with an accuracy rate of 61%. Based on these studies similar studies were carried out with different cases and research methods. If

(6)

20

in the previous study the case discussed was in the form of external films, in this study taking the case of Indonesian cinema. Data taken is based on the IMDb website (www.imdb.com) and filmindonesia (filmindonesia.co.id) by taking film data from the past 10 years.

The method used is the same as previous research but added with the Naïve Bayes Algorithm and K-Nearest Neighbors (KNN) to find out the highest accuracy value in predicting film popularity, because both algorithms have advantages in terms of classifying a data.

Although there have been even many studies that take cases in the form of films, but from the many studies there are no studies that take data on Indonesian cinema, most researchers take outside film data. In addition to film data that has never been used as a research concept, the concept of film popularity prediction has not been much raised in research, more research on film reviews or the popularity of a film.

Therefore, the author tries to make a study about the prediction of film popularity in Indonesia.

This study aims to predict the popularity of films in Indonesia over the past 10 years, using the Decision Tree, Naïve Bayes and K-Nearest Neighbors algorithm for each category with a value of k = 10.

2. Literature study

(McLeon, 2008) McLeod and Schell (2008: 258) suggest that “Data mining is the process of finding relationships in data that are unknown to the user. This process is the same as a miner who seeks gold in a mountain stream”.

Data Mining is a new technology that is very useful to help companies find very important information from their data warehouse. Some data mining applications focus on predictions, they predict what will happen in new situations from data that describe what happened in the past (Witten, I. H., Frank, E., & Hall, 2011).

Data mining tools predict trends and the nature of business behavior that is very useful to support important decision making. Automated analysis performed by data mining exceeds that done by traditional decision support systems that are already widely used. In particular, a collection of methods known as "data mining" offers methodologies and technical solutions to overcome medical data analysis and construction of predictive models (Bellazzi, R., & Zupanb, 2008).

Based on the tasks and objectives of the analysis, the process of data mining can be divided into two main categories depending on the existence of target variables and learning methods (learning), namely between supervised and unsupervised learning processes (Vercillis, 2009).

Classification is a technique by looking at the behavior and attributes of a group that has been defined. This technique can classify new data by manipulating existing data that has been classified and by using the results to provide a number of rules. These rules are used for new data to be classified. This technique uses supervised induction which utilizes a collection of tests from classified data sets (Iskandar, D., & Suprapto, 2015).

C4.5 algorithm was introduced by J. Ross Quinlan who is a developer of the ID3 algorithm, the algorithm is used to form a decision tree. Decision trees are made by dividing attribute values into branches for each possibility. The way a decision tree works is to search from root to branch until the class of an object is found. Instance

(7)

21 is classified by directing from tree roots to leaves according to test results through internal nodes (Alfisahrin, 2014).

Classification is a form of data analysis that extracts models describing important data classes. The classification has a variety of applications including fraud detection, target marketing, predictive performance, manufacturing and media diagnosis due to high accuracy, predictive levels and automated methods for finding hypotheses. In this phase, the Naive Bayes (NB) classifier is used to classify ECG signals as normal and abnormally implemented in the Miner fast tool. Naive Bayes classifiers are statistical classifiers that can predict class membership probabilities such as the probability that a given tuple belongs to a particular class. (Padmavathi &

Ramanujam, 2015). Naive Bayes algorithm is a form of data classification using probability and statistical methods. This method was first introduced by British scientist Thomas Bayes, which is used to predict future opportunities based on past experience so that it is known as the Bayes theorem (N. Nuraeni, 2017).

In pattern recognition, K-Nearest Neighbor (KNN) is a nonparametric algorithm used for learning classification and regression. Because it is a typical type for example based or memory based learning scheme, all KNN calculations are deferred until classification and no explicit training steps are needed to build the KNN classifier.

Therefore, KNN is a very simple but efficient algorithm that shows the time complexity O (1) when training the KNN classifier and O (mn + mlog2m) when classifying new examples over training sets with m examples and n attributes, where O (mn) is the time complexity for calculating the distance between new instances and each one training instance. In addition, O (mlog2m) is a time complexity for sorting distances when finding the closest k neighbors of a new example (Wang, dan Ning., Et al, 2015).

Confusion Matrix is a very useful tool for analyzing how well classifiers can recognize tuples from different classes (Utami, 2017). The function of the ROC curve is to show accuracy and visually compare classifications. ROC expresses confusion matrix, ROC is a two-dimensional graph with false positive as a horizontal line and true positive as a vertical line (N. Nuraeni, 2017).

Using Crowd-Source Based Features From Social Media And Conventional Features To Predict The Movies Popularity (Ahmed et al., 2015). Predicting the success of the film has been attractive to economists and investors (media and production houses) as well as prediction analysts. A number of attributes such as cast, genre, budget, production house, PG rank affect the popularity of a film. Social media like Twitter, YouTube, etc. Is the main platform where people can share their views about films.

This paper describes experiments in predictive analysis using machine learning algorithms on both conventional features, collected from film databases on the Web as well as social media features (text commentary on YouTube, Tweet).

Role of Different Factors in Predicting Movie Success (Bhave, Kulkarni, Biramane,

& Kosamkar, 2015). Due to the rapid digitalization and emergence of social media, the film industry is growing rapidly. The average number of films produced per year is greater than 1000. So to make this film profitable, is a problem that makes this film successful. With a low success rate, models and mechanisms to reliably predict box office rankings and or movie collections can help reduce business risks significantly and increase average returns.

(8)

22

Prediction of Movies Box Office Performance Using Social Media (Apala et al., 2013). In this study, we implemented a data mining tool to produce interesting patterns to predict box office film performance using data collected from various social media and web sources including Twitter, YouTube and the IMDb film database. This prediction is based on a decision factor derived from a historical film database, the number of followers from Twitter, and YouTube viewer comment sentiment analysis.

Movies Popularity Prediction Using Social Media and Conventional Features (Jangid, Jadhav, Dhokate, Jadhav, & Bhandari, 2017). Now a day, the use of social media has increased widely. By using social media like Twitter, YouTube, etc. Users can post their reviews about the film. Economists, investors and prediction analysts are very interested in predicting the success of their films. To predict the success and popularity of the film, the number of factors affected such as actors, actresses, invested budget, production houses, genre, PG ratings. In this project, machine learning algorithms are used for predictive analysis.

Classifications sentiment movie reviews using algorithms support factor machine (Yulietha, Faraby, & Adiwijaya, 2017). With advances in technology, all information about all films is available on the Internet. If the information is processed well, the quality of the information will be obtained.

Film Rating Prediction Using the Naïve Bayes Method (Nugroho & Pratiwi, 2016).

At this time the development of the world of cinema has been very rapid, for example with the number of films that alternated to show. Movie connoisseurs also need films that have good picture, sound, story line quality and positive values in a film, so that they remain enthusiastic in following the latest films.

Sentiment Analysis Summarizing Film Review Using the Information Gain and K- Nearest Neighbor Method (Pristiyanti, Fauzi, & Muflikhah, 2017), Sentiment analysis in film reviews is divided into 2, namely positive and negative reviews.

Grouping the results of sentiment analysis can be simplified by the k-nearest neighbor classification method where this method will search for documents that have a closeness between one document to another. This research uses information gain method to reduce many features used during the classification process.

Text Mining For Analysis of Film Review Sentiments Using K-Means Algorithm (Budi, 2017), The ease with which humans use the website results in an increase in text documents in the form of opinions and information. In a long time text documents will get bigger. Text mining is one of the techniques used to explore a collection of text documents so that the essence can be taken. There are several algorithms used for extracting documents for sentiment analysis, one of which is K- Means. In this study the algorithm used is K-Means.

Klasifikasi Sentimen pada Movie Review dengan Metode Multinominal Naïve Bayes (Wisudawati, Adiwijaya, & Faraby, 2017). Opinion of others about a movie review in the media is very important in making a decision. To find out someone's opinion on a movie review, a system is needed that can make it easier to find out someone's opinion. Sentiment classification can help in building a system for knowing someone's opinion on movie reviews. The dataset used in this sentiment classification process is the Internet Movie Database (IMDb). But the problem in knowing the polarity of an opinion in the sentiment classification process in the movie review dataset is the existence of unstructured data, so many data attributes

(9)

23 and the negation that causes the polarity of a word will be different in different text contexts. With these problems, the classification process in the dataset will be classified into two classes of polarity, positive and negative.

Comparison classification algorithm machine learning and features selection in the analysis sentiment movie reviews (Chandani, Wahono, & Purwanto, 2015).

Sentiment analysis is a process that aims to determine the contents of a dataset in the form of text that is positive, negative or neutral. Today, public opinion is an important source in a person's decision to make a product. Classification algorithms such as Naïve Bayes (NB), Support Vector Machine (SVM), and Artificial Neural Network (ANN) are proposed by many researchers to be used in film review sentiment analysis.

The prediction of popularity song based on pop chart billboards use decision tree and k-means (Gumilar, Pudjiantoro, & Yuniarti, 2017). Billboard has been a trusted source for song popularity ratings for the past 60 years, and most record labels refer to the ratings given by Billboard. Hits are usually not only influenced by the lyrics and artists who sing them, some hits are also influenced by factors such as artists, genres, record labels and so forth. But if the record label can predict for themselves whether a song can enter the rankings on the Billboard will certainly be very helpful.

In the charts there are attributes that are considered such as artist, title, genre and others so that the combination of these attributes becomes a pattern in classifying a song in the charts.

Prediction of Movies popularity Using Machine Learning Techniques (Latif & Afzal, 2016). Number of films released every week. There is a large amount of data related to films available through the internet, because of the large amount of data available, this is an interesting data mining topic. Film prediction is a complex problem. Every audience, producer, production house director is all curious about the film that will be screened in the theater. Much work has been done relating to films using social networks, blog articles but less has been explored by the data and attributes associated with continuous films and in different dimensions.

Implementation of Sentiment Analysis to Determine the Level of Popularity of Tourism Destinations (Murnawan & Sinaga, 2017). Utilization of sentiment analysis has been widely applied in various fields, even some companies have started to take more value from the application of the analysis to get new benefits. Basically, most algorithms for sentiment analysis are based on the results of the training data classification using a collection of text data. Before the data is trained, a collection of text data is extracted first. In this study, the researcher decided to use a classification strategy based on the naïve bayes (NB) algorithm because it is a simple and intuitive method whose performance is similar to other approaches.

Determination of Credit Worthiness Using the Naïve Bayes Classifier Algorithm: A Case Study of Bank Mayapada PGC Branch Business Partners (Nia Nuraeni, 2017).

In analyzing credit, loan officers sometimes are less accurate in credit analysis, which can cause an increase in bad loans. The data mining algorithm classification is widely used to determine the credit worthiness of one of the Naive Bayes classifiers, BC is superior in increasing the value of high accuracy but weak in attribute selection.

(10)

24

3. Methodology

Research is an organized investigation, which is carried out to present information and solve problems. The research method to be used is the CRISP-DM research method shown in Figure 1.

Figure 1. Research Stage Sumber: (Sulistyowati & Gata, 2018)

3.1 Business Understanding

This stage is called the research understanding stage, determining the objectives of a research project in the formulation of defining data mining problems. Based on this data mining classification will produce data classification values based on the algorithm used and the accuracy value that can be used in predicting the popularity of films in Indonesia.

3.2 Data Understanding

The understanding data phase involves collecting data, conducting data analysis and evaluating the data. The data source used is Indonesian film data with 8 attributes.

Then the data is analyzed to estimate the amount of data to be retrieved and calculate the amount of data with popular and unpopular information.

3.3 Data Preparation

This stage is to prepare data that will be used in the modeling process, data will be selected and distributed as needed. The data to be used is 556 film data consisting of popular and unpopular movie titles with 8 attributes and 1 label attribute obtained through conventional and social media features.

3.4 Modeling

Data processing is done at this stage using data mining techniques. The approach model used is the Decision Tree, Naive Bayes and K-Nearest Neighbors algorithm approaches to produce a predictive value and will form a decision tree.

Bussiness Understanding

•prediksi popularitas film

Data Understanding

•dataset lokal

•8 atribut

•1 label

Data Preparation

•filtering data

Modelling

•prdiksi algoritma DT, NB dan KNN

Evaluation

•accuracy

•AUC

Deployment

•model simulator

(11)

25 3.5 Evaluation

The evaluation stage is the same as the classification stage where this stage is determined testing for accuracy. The testing phase is carried out to see the results of applying the proposed algorithm and evaluation using confusion matrix and ROC curves. The purpose of the evaluation is to determine the value of the model created in the previous stage.

3.6 Deployment

The deployment stage is the final stage in the CRISP-DM model where the results of all the stages that have been carried out are actually implemented. At this stage the simulator model is implemented in order to predict the popularity of films into popular or unpopular categories.

The population of this research is Indonesian films that aired in the last 10 years that came from local data from an IMBD website and Indonesian Films and social media such as Official Youtube. The sample in this study is popular and unpopular Indonesian film data. This data is local, published by outside parties on the official IMDB and YouTube websites, the number of samples taken is in table 1.

Table 1. Sample Dataset Jumlah

Pupuler 437

Tidak Populer 119

Total Sampel 556

The collection method for obtaining data that will be used in this study is the primary data collection method. Where the data to be used is the IMDb dataset (Internet Movie Database) in the last 10 years, which consists of 9 variables and 556 movie titles.

Analysis of the data in this study was through the movie title data that had high and low ratings then the data was processed and tested on the decision tree algorithm, Naive Bayes and K-Nearest Neighbor. The test results of the algorithm were tested again using the confusion matrix and AUC curves to determine the level of accuracy generated from these methods.

4. Finding and Discussion

This study aims to predict the popularity of films from Indonesian film data using the decision tree algorithm, naive bayes and k-nearest neighbors to be popular or unpopular as a limitation for assessing future film popularity predictions.

Data obtained through data collection with conventional features by visiting the IMBD website and social media features by visiting the official youtube production of the film. Data collected as many as 556 movie titles that have been screened in Indonesia from 2009 to 2018 are presented in table 2.

(12)

26

Table 2. Indonesian Film Dataset (Sulistyowati & Gata, 2018)

Movie Year Ratings Genre Usia Durasi

(menit) Views Likes Dislikes Komentar

308 2013 6.7 Horror R 120 149053 332 23 40

#66 2016 5.6 Action D 116 60895 120 20 34

12 Menit:

Kemenangan untuk Selamanya

2014 8.6 Drama R 112 100599 619 6 52

12:06 Rumah

Kucing 2017 2.7 Horror 13 78 91492 127 26 19

13 cara memanggil

setan 2011 5.6 Horror R 81 97418 20 28 17

13: The

Haunted 2018 5.1 Horror 13 107 842614 8800 601 884

22 Menit 2018 7.2 Drama 13 71 234178 1000 88 103

3 hati dua dunia, satu

cinta 2010 7.1 Romance R 108 3142 14 1 1

3 Nafas Likas 2014 7.6 Drama 13 105 78199 246 3 18

3 pejantan

tanggung 2010 4.6 drama R 80 23823 9 1 0

3 Pocong

idiot 2012 4.8 Comedy R 81 57146 75 12 8

3 Srikandi 2016 6.6 Biography SU 121 101503 300 11 23

3600 Detik 2014 6.8 Drama R 90 975340 4400 142 222

5 cm 2012 7.2 Drama R 126 986172 2800 160 294

9 Summers

10 Autumns 2013 8.4 Drama R 114 119453 354 7 62

99 Cahaya di

Langit Eropa 2013 7.3 Drama R 105 436321 1200 72 125

A Copy of

My Mind 2015 7.3 Drama R 118 413596 392 66 20

A: Aku,

Benci & cinta 2017 7.5 Drama 13 94 2899045 22000 309 1216

About a

Woman 2015 9 Drama D 76 22857 70 3 25

Based on table 2 it can be seen that 8 attributes are used. Consists of 4 attributes obtained from the IMBD website such as broadcast year, genre, age limit of the audience and duration of the film, as well as 4 other attributes obtained from YouTube social media, namely the number of views, likes, dislikes and comments with ratings as popular and unpopular labels. Film data that has been collected is then carried out the process of filtering data so as to produce data as in table 3.

(13)

27

Table 3. Filtering Dataset Film (Sulistyowati & Gata, 2018)

No Movie Year Genre Usia Durasi Views Likes Dislikes Komentar Rating

1 308 2013 Horror R PANJANG <1JT <30K <500 <500 POPULER

2 #66 2016 Action D PANJANG <1JT <30K <500 <500 POPULER

3 12 Menit:

Kemenangan

untuk Selamanya 2014 Drama R PANJANG <1JT <30K <500 <500 POPULER 4 12:06 Rumah

Kucing 2017 Horror R PANJANG <1JT <30K <500 <500 TIDAK

POPULER 5 13 cara memanggil

setan 2011 Horror R PANJANG <1JT <30K <500 <500 POPULER

6 13: The Haunted 2018 Horror R PANJANG <1JT <30K 500 sd 1000

500 sd

1000 POPULER 7 22 Menit 2018 Drama R PANJANG <1JT <30K <500 <500 POPULER 8 3 hati dua dunia,

satu cinta 2010 Romance R PANJANG <1JT <30K <500 <500 POPULER 9 3 Nafas Likas 2014 Drama R PANJANG <1JT <30K <500 <500 POPULER 10 3 pejantan

tanggung 2010 Drama R PANJANG <1JT <30K <500 <500 TIDAK

POPULER 11 3 Pocong idiot 2012 Comedy R PANJANG <1JT <30K <500 <500 TIDAK

POPULER 12 3 Srikandi 2016 Biography SU PANJANG <1JT <30K <500 <500 POPULER 13 3600 Detik 2014 Drama R PANJANG <1JT <30K <500 <500 POPULER

14 5 cm 2012 Drama R PANJANG <1JT <30K <500 <500 POPULER

15 9 Summers 10

Autumns 2013 Drama R PANJANG <1JT <30K <500 <500 POPULER 16 99 Cahaya di Langit

Eropa 2013 Drama R PANJANG <1JT <30K <500 <500 POPULER

17 A Copy of My

Mind 2015 Drama R PANJANG <1JT <30K <500 <500 POPULER

18 A: Aku, Benci &

cinta 2017 Drama R PANJANG 1JT sd

5JT <30K <500 >1000 POPULER 19 About a Woman 2015 Drama D PANJANG <1JT <30K <500 <500 POPULER 20 Ada apa dengan

pocong? 2011 Horror R PANJANG <1JT <30K <500 <500 POPULER

4.1 Modeling

An This stage contains experiments and model testing by calculating the algorithm model used. The algorithm model used in this study is a data mining classification consisting of the Decision Tree, Naive Bayes and K-Nearest Neighbors algorithms.

Figure 2 is a prediction algorithm model with 1 read excel as input data operator, multipy as paraller, 3 cross validation operators with each tested data mining algorithm.

(14)

28

Figure 2. The Examination of Mining With Rapid Miner 4.2 Algorithm Decision Tree

Decision tree is made by counting the number of popular and unpopular film classes in advance of each existing class based on the attributes used. The amount of data used was 556 film data with 437 popular films and 119 unpopular films. Figure 3 is the shape of the decision tree formed where the root of the decision tree is the age variable. The goal is to get a rule that will be used in decision making in new cases.

Figure 3. Display Decision Tree Formed

The model in Figure 4 is the algorithm operator contained in the validation operator where the case used is the decision tree algorithm model. Based on the test results above, it can be seen that the accuracy value for the decision tree algorithm is 76.45%.

Figure 4. Decision Tree Algorithm Model Validation

(15)

29 4.3 Algorithma Naive Bayes

A movie dataset of 556 film data consisting of 437 popular film data and 119 unpopular film data was used as input for the Naïve Bayes Rapidminer algorithm classification. The following is a display of the results of the Naive Bayes algorithm test model using RapidMiner software.

The model in Figure 5 is an algorithm operator contained in the validation operator where the case used is the Naive Bayes algorithm model. Based on the test results above, it can be seen that the accuracy value for Naive Bayes algorithm is 83.29%.

Figure 5. Naive Bayes Algorithm Validation Model

4.4 Algorithm K-Nearest Neighbors

A movie dataset of 556 film data consisting of 437 popular movie data and 119 unpopular film data was used as input for the classification of the K-Nearest Neighbors Rapidminer algorithm. The following is a display of the results of testing the K-Nearest Neighbors algorithm model using RapidMiner software.

The model in Figure 6 is an algorithm operator contained in the validation operator where the case used is the k-nearest neighbors algorithm model. Based on the test results above, it can be seen that the accuracy value for Naive Bayes algorithm is 80.95%.

Figure 6. K-Nearest Neighbors Algorithm Validation Model 4.5 Evaluation and validation

The method used in this research is Naive Bayes classification algorithm, decision tree and K-Nearest Neighbors which are then evaluated by AUC (Area Under Curve).

(16)

30

As discussed in the previous chapter, the aim of this study is to predict the level of accuracy of film popularity based on predetermined variables.

The accuracy value of the model using the decision tree algorithm is 76.45%, Naive Bayes is 83.29%, and K-Nearest Neighbors is 80.95%. The results of the model testing that have been done are to measure the level of accuracy and AUC (Area Under Curve).

The results of testing the three data mining classification algorithms used can be shown in table 4 and the evaluation results using ROC are obtained through comparison of the results of the calculation of the AUC value for the classification of data mining used. Evaluation results can be seen in table 5.

Table 4. Test Result Values

Decision Tree Naive Bayes K-Nearest Neighbors

Accuracy 76,45% 83,29% 80,95%

Precision 82,78% 87,83% 86,92%

Recall 88,68% 91,24% 88,90%

Table 5. AUC Evaluation Results

Decision Tree Naive Bayes K-Nearest Neighbors

AUC 0,767 0,836 0,818

The results of the evaluation of the AUC value showed the highest value using the Naive Bayes algorithm of 0.836 then the second position was the k-nearest neighbors algorithm of 0.818 and the lowest value of the algorithm was a decision tree of 0.767.

Of the three models used two of them are included in the classification of Good classification, namely the Naive Bayes algorithm model and k-nearest neighbors, because it has an AUC value between 0.80-0.90. While the decision tree algorithm model is included in the Fair classification classification, has an AUC value between 0.70 - 0.80.

5. Conclusion

Based on research that has been done by using three data mining classifications namely the Decision Tree algorithm, Naive Bayer and K-Nearest Neighbors which are then evaluated by AUC (Area Under Curve) using Indonesian film data to predict film popularity. The prediction value obtained after calculations using three data mining classifications produced 454 popular movie titles and 102 unpopular movie titles using the Decision Tree algorithm, Naive Bayer algorithm produced 446 popular movie titles and 110 unpopular movie titles and the K-Nearest algorithm Neighbors produced 443 popular films and 113 unpopular films.

Confusion Matrix value generated using decision tree algorithm is 76.45% accuracy value with precision is 82.78% and recall is 88.68%, naive bayes is 83.29% accuracy value with precision is 87.83% and recall by 91.24% and K-Nearest Neighbors by 80.95% accuracy with a precision of 86.92% and recall by 88.90%. The results of the evaluation with the ROC curve showed the value of AUC using the Naive Bayes

(17)

31 model of 0.836 K-Nearest Neighbors model of 0.818 and the decision tree model of 0.767.

While the results for the evaluation using the ROC curve for the three models produced the highest AUC value using the Naive Bayes algorithm of 0.836, then the second position was the k-nearest neighbors algorithm of 0.818 and the lowest value was the algorithm of the decision tree 0.767. Of the three models used two of them are included in the classification of Good classification, namely the Naive Bayes algorithm model and k-nearest neighbors, because it has an AUC value between 0.80- 0.90. While the decision tree algorithm model is included in the Fair classification classification, it has an AUC value between 0.70 - 0.80.

To produce better measurements it is recommended to add new variables that are more influential in the popularity of films and make comparisons of other algorithms such as random forest, support vector machines and others or can do algorithm optimization.

References

Ahmed, M., Jahangir, M., Afzal, H., Majeed, A., & Siddiqi, I. (2015). Using crowd- source based features from social media and conventional features to predict the movies popularity. Proceedings - 2015 IEEE International Conference on Smart City, SmartCity 2015,Stainable Computing and Communic, (March 2016), 273–

278. https://doi.org/10.1109/SmartCity.2015.83

Alfisahrin, S. N. (2014). Komparasi Algoritma C4.5, Naive Bayes dan Neural Network Untuk Memprediksi Penyakit Jantung. Jakarta : Pascasarjana Magister Ilmu Komputer STMIK Nusa Mandiri.

Apala, K. R., Jose, M., Motnam, S., Chan, C., Liszka, K. J., & Gregorio, F. De.

(2013). Prediction of Movies Box Office Performance Using Social Media.

IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, (July), 1209–1214. https://doi.org/10.1145/2492517.2500232 Bellazzi, R., & Zupanb, B. (2008). Predictive Data Mining In Clinical Medicine:

Current Issue And Guidelines. International Journal Of Medical Informatics, 81–

97.

Bhave, A., Kulkarni, H., Biramane, V., & Kosamkar, P. (2015). Role of different factors in predicting movie success. 2015 International Conference on Pervasive

Computing (ICPC), 00(c), 1–4.

https://doi.org/10.1109/PERVASIVE.2015.7087152

Budi, S. (2017). Text Mining Untuk Analisis Sentimen Review Film Menggunakan Algoritma K-Means. Techno.COM, 16(1), 1–8.

Chandani, V., Wahono, R. S., & Purwanto, . (2015). Komparasi Algoritma Klasifikasi Machine Learning Dan Feature Selection pada Analisis Sentimen Review Film. Journal of Intelligent Systems, 1(1), 55–59.

Gumilar, D., Pudjiantoro, T. H., & Yuniarti, R. (2017). Prediksi Kepopuleran Lagu Berdasarkan Tangga Lagu Billboard Menggunakan Decision Tree Dan K-Means.

Prosiding SNATIF Ke -4 Tahun 2017, 187–192. https://doi.org/10.1007/s12664- 012-0191-3

Iskandar, D., & Suprapto, Y. . (2015). Perbandingan Akurasi Klasifikasi Tingkat Kemiskinan Antara Algoritma C45 dan Naive Bayes. Ilmiah NERO, 2(1), 37–43.

Jangid, B. M., Jadhav, C. K., Dhokate, S. M., Jadhav, G. M., & Bhandari, G. M.

(2017). Movies Popularity Prediction Using Social Media and Conventional Features, (i), 5794–5798. https://doi.org/10.15680/IJIRSET.2017.0604192

(18)

32

Latif, M. H., & Afzal, H. (2016). Prediction of Movies popularity Using Machine Learning Techniques. IJCSNS International Journal of Computer Science and Network Security, 16(8), 127–131.

McLeon, R. G. P. S. (2008). Sistem Informasi Manajemen. Salemba Empat.

Murnawan, & Sinaga, A. (2017). Implementasi Sentiment Analysis untuk Menentukan Tingkat Popularitas Tujuan Wisata. Prosiding Seminar Nasional Teknologi Dan Rekayasa Informasi Tahun 2017, (November), 12–18.

Nugroho, Y. S., & Pratiwi, R. W. (2016). Prediksi Rating Film Menggunakan Metode Naïve Bayes. Jurnal Teknik Elektro (ISSN 1411-0059), 8(2), 60–63.

Nuraeni, N. (2017). Penentuan Calon Pegawai di PTPN 12 Kota Blater Tempurejo Jember dengan Metode Naive Bayes. Seminar Nasional Teknologi Informasi Dan Multimetida, 61–66.

Nuraeni, N. (2017). Penentuan Kelayakan Kredit Dengan Algoritma Naïve Bayes Classifier : Studi Kasus Bank Mayapada Mitra Usaha Cabang PGC. Jurnal Teknik Komputer AMIK BSI, III(1), 9–15. https://doi.org/ISSN. 2442 - 2436

Pristiyanti, R. I., Fauzi, M. A., & Muflikhah, L. (2017). Sentiment Analysis Peringkasan Review Film Menggunakan Metode Information Gain dan K-Nearest Neighbor, 2(3), 1179–1186.

Utami, L. A. (2017). Melalui Komparasi Algoritma Support Vector Machine Dan K- Nearest Neighbor Berbasis Paerticle Swarm Optimization, 13((1)), 103–112.

Vercillis, C. (2009). Business Intelligence: Data Mining and Optimization for Decision Making.

Wisudawati, J., Adiwijaya, & Faraby, S. Al. (2017). Klasifikasi Sentimen pada Movie Review dengan Metode Multinomial Naïve Bayes. E-Proceeding of Engineering, 4(2), 2978–2988.

Witten, I. H., Frank, E., & Hall, M. A. (2011). Data Mining Practical Machine Learning Tools And Techniques. Burlington, Usa: Morgan Kaufmann Publishes.

Yulietha, I. M., Faraby, S. Al, & Adiwijaya. (2017). Klasifikasi Sentimen Review Film Menggunakan Algoritma Support Vector Machine. E-Proceeding of Engineering, 4(3), 4740–4750.