Application of Data Mining using Naive Bayes for Student Success Rates in Learning

(1)

Application of Data Mining using Naive Bayes for Student Success Rates in Learning

Bayu Angga Wijaya^*, Vijay Kumar, Berlian Fransisco Jhon Wau, Juliansyah Putra Tanjung, N P Dharshinni

Fakultas Teknologi dan Ilmu Komputer, Teknik Informatika, Universitas Prima Indonesia, Medan, Indonesia Email: ^1,*[email protected], ²[email protected], ³[email protected],

4[email protected], ⁵[email protected] Email Penulis Korespondensi: [email protected]

Abstract− Education is a very important part of human life because through education quality human resources will be formed.

Quality education can be read and measured by the achievement of various indicators. However, achieving these indicators is not easy, because learning success is influenced by several factors. One of the factors that can affect the success of learning is the learning system. To understand the level of student success in learning, a data mining processing technique is needed. The algorithm that will be used in this research is the naive Bayes algorithm. This study uses 601 datasets per year from Academic Year 2019/2020 to Academic Year 2021/2022, the data used are attendance score data, assignment scores, mid-exam scores, semester exam scores, and averages. The test is divided into 3, namely testing for the Academic Year 2019/2020 dataset, testing for the Academic Year 2020/2021 dataset, and testing for Academic Year 2021/2022 using the split validation operator. The test results using the Academic Year 2019/2020 – Academic Year 2020/2021 student score dataset have an accuracy value of 95.01% while the Academic Year 2021/2022 student score dataset has an accuracy value of 97.79%.

Keywords: Success; Learning; Data Mining; Prediction; Nave Bayes.

1. INTRODUCTION

Education is a very important part of human life because through education quality human resources will be formed. Quality education can be read and measured by the achievement of various indicators. However, achieving these indicators is not easy, because learning success is influenced by several factors [1]. Factors that influence these include internal factors and external factors. Internal factors are factors that come from students, namely health factors, interest factors, talent factors, and motivation factors. While external factors are factors that come from outside the students, such as family factors, school factors, and community factors [2]. One of the factors that can affect the success of learning is the learning system.

The learning system is a way to achieve maximum student learning outcomes in learning activities.

Learning outcomes can be seen from the ability of students to understand the material, which can determine student performance [3]. In the teaching and learning process, many students are dissatisfied with the lesson, do not pay attention to what has been learned, homework becomes a burden, and learning outcomes are only aimed at advancing to grades or graduation, thus affecting student success in learning [4]. The learning system can also affect learning outcomes, especially in recent years, due to the Covid-19 pandemic, the conventional learning system has changed to a distance learning system, so this is also the cause of the decline in learning outcomes.

Research conducted [5] revealed the factors causing the decline in learning achievement during the pandemic. In his research, he found that understanding the material was the main factor behind the decline in learning outcomes.

The learning process consists of a number of components or elements that are related to each other. The interaction between teachers and students during the learning process plays an important role in achieving the desired goals.

It is possible that during the learning process the teacher does not provide interesting material, the teacher does not attract attention, and students do not participate actively in learning which affects learning outcomes [6].

Therefore, it is necessary to carry out data processing techniques to determine the success rate of students in learning. This can be an evaluation for the school to determine future steps.

To understand the the success rate of students in learning, a data mining processing technique is needed.

Research conducted [7] to determine the success rate of students using the C4.5 algorithm, the test results produce 23 rules which are the success of online learning and the attribute of understanding which is the root node because it has the highest gain value of 0.179745446. The algorithm that will be used in this research is the naive Bayes algorithm. The Nave Bayes algorithm is a classification method by calculating probability values and combining values [8]. Naive Bayes uses the Bayes theorem and assumes that the value between variables does not depend on the output value. In this case, it is assumed that the presence or absence of certain variables is not related to the presence or absence of other variables [9]. The basic concept of the nave Bayes algorithm method is to be able to predict future opportunities based on past experience data and this theory is known as the Bayes theorem. This concept is combined with Naive, which applies independent attribute values [10]. The Nave Bayes algorithm has been used in previous research such as predict the level of satisfaction in online learning with the results of the study being satisfied and completely satisfied, namely 27, dissatisfied and completely satisfied, 0, satisfied and actually dissatisfied, 0, satisfied. and facts that are actually not satisfied, namely 3 and the accuracy obtained is 100% [11], and analyzes the success of online learning with important factors that can improve student learning achievement, namely adaptive subjects such as English, Mathematics, Science, Chemistry, Physics, Social

(2)

DOI: 10.30865/mib.v6i4.4639

Sciences with an accuracy rate of 99% [12]. Therefore, this study uses the naive Bayes algorithm to measure student success in learning.

2. RESEARCH METHODOLOGY

2.1 Stages of Research

The steps of this research work procedure are as follows:

Start Learning Success Rate Dataset Preprocessing Data

Implementation of The Naïve Bayes Algorithm Knowledge

Learning Success Rate End

Figure 1. Work Procedure a. Needs Analysis

Collect references to support research such as national and international journals b. Data Preprocessing

The data used are data on attendance scores, assignment scores, mid test scores and semester test scores.

1. Data Cleaning

In this process, the deletion of duplicate data is carried out.

2. Data Normalization

This process is carried out to separate the data into two tables, namely the T.A. 2019/2020 data table and the 2020/2021 T.A. data into one table, and the T.A. 2021/2022 data into one table.

3. Data Transform

This process is carried out to select data for attribute selection the data is processed using Microsoft Office Excel 2019.

c. Implementation of Naïve Bayes

Applying the naïve bayes algorithm in calculating student success rates in learning using software Rapid Miner Studio 9.6.

2.2 Model Building of Naïve Bayes

In the second test the dataset will be carried out using the split validation operator and the split data operator, here is the test operator view for the three data:

Figure 2. Split Validation Process Operator

Figure 3. Operator Main Process Split Validation In the split validation operator, there are several parameters used, namely:

a. Split Ratio

Being a specific parameter in the training data

(3)

b. Sampling Type

Sampling type has several types of sampling that are used to build subsets. Several types of sampling types, namely:

1. Linear sampling

Just split the exampleset into partitions without changing the order of instances i.e. a subset with successive instances.

2. Shuffled sampling

Build a random or selected subset to create a subset 3. Stratified sampling

Builds a random subset and ensures that the distribution of classes in the set is the same as in the exampleset While in the Performance operator the parameters used, namely:

a) Accuracy

The degree of proximity between the obtained value to the actual value.

b) RMSE (Root Mean Squared Error)

A measure of accuracy that shows the average value. In this measure it can be said to be best if the value is 0.

Figure 4. Split Data Process Operator In the split data operator, several parameters are also used, namely:

a. Partitions

Partitions are used to divide data into training data and testing data b. Sampling type

The sampling type is used the same as that used in the split validation operator

And in the performance operator the parameters used are also the same as the parameters used in the split validation operator.

2.3 Naïve Bayes Algorithm

Naïve bayes is the simplest calculation because it can reduce complexity to a simple multiplication of probabilities.

Naïve bayes is also capable of handling data sets that have many attributes[22]. The general form of the bayes theorem formula used in the naïve bayes algorithm, as follows [21]:

𝑃(𝐻|𝑋) = 𝑃(𝑋|𝐻) . 𝑃(𝐻)

𝑃(𝑋) (1)

In equation (1) shows X the data of an unknown class, H is the hypothesis of the data of a specific class, while P(H| X) is the probability based on the condition, while P(H) and P(X|H) are the previous probabilities based on the hypothesis condition H, and P(X) is the probability of X. To explain the formula of the naïve bayes theorem, it is necessary to carry out a classification process for a number of clues in order to determine what class is suitable for the analyzed sample. Therefore, bayes' theorem on equations (1) is adjusted into equations (2).

𝑃(𝐶|𝑋1 . .. 𝑋_𝑛) = 𝑃(𝐶) 𝑃(𝑋1 ...𝑋𝑛𝑣𝐶)

𝑃(𝑋1 ...𝑋𝑛) (2)

3. RESULTS AND DISCUSSION

3.1 Dataset

The data used in this study is data on attendance scores, assignment scores, mid-test scores, and semester test scores. 1083 data were collected from Academic Year 2019/2020 to Academic Year 2021/2022.

Table 1. Student Datasets

No Name Attendance

Scores

Task Scores

Mid Exam Scores

Semester Test

Scores Average

1 Student 1 83 82 87 85 84,25

2 Student 2 79 82 80 81 80,5

3 Student 3 82 83 87 84 84

(4)

DOI: 10.30865/mib.v6i4.4639

No Name Attendance

Scores

Task Scores

Semester Test

Scores Average

4 Student 4 83 82 88 83 84

5 Student 5 84 83 87 83 84,25

6 Student 6 82 82 89 85 84,5

7 Student 7 83 82 87 84 84

8 Student 8 82 84 86 83 83,75

9 Student 9 79 81 82 82 81

10 Student 10 79 82 83 84 82

… … … …

1803 Student

1803 83 89 98 75 86,25

In this study, 1803 student datasets were used, each year had data from 601 student data from Academic Year 2019/2020 to Academic Year 2021/2022.

Tabel 2. The Result of Data Transform

No Name Attendance Scores

Task Scores

Semester Test

Scores Average Grade

1 Siswa 1 83 82 87 85 84,25 B

2 Siswa 2 79 82 80 81 80,5 B

3 Siswa 3 82 83 87 84 84 B

4 Siswa 4 83 82 88 83 84 B

5 Siswa 5 84 83 87 83 84,25 B

6 Siswa 6 82 82 89 85 84,5 B

7 Siswa 7 83 82 87 84 84 B

8 Siswa 8 82 84 86 83 83,75 B

9 Siswa 9 79 81 82 82 81 B

10 Siswa 10 79 82 83 84 82 B

… … … …

1803 Siswa 1803 83 89 98 75 86,25 A

After process of data transform, data will be divided into 2, namely the last two-year dataset (Academic Year. 2019/2020-2020/2021) and the last one-year dataset (2021/2022).

3.2 Calculation of Naïve Bayes

Calculations are manually done before performing the test by taking several datasets a. Enumerate classes/label

P(Grade A) = ₁₂₀₂⁵⁴⁸ = 0,455 P(Grade B) = ₁₂₀₂⁶²² = 0,517 P(Grade C) = ³²

1202 = 0,026 b. Count the amount of data per class/label

Attendance Scores :

P(75|Grade A) = ₁₂₀₂⁶ = 0,004 P(75|Grade B) = ¹⁸

1202 = 0,014 P(75|Grade C) = ⁴

1202 = 0,003 Task Scores :

P(75|Grade A) = ₁₂₀₂¹² = 0,009 P(75|Grade B) = ⁴

1202 = 0,003 P(75|Grade C) = ⁴

1202 = 0,003 Mid Test Scores :

P(75|Grade A) = ₁₂₀₂¹⁶ = 0,013 P(75|Grade B) = ¹⁸

1202 = 0,014 P(75|Grade C) = ²

1202 = 0,001 Semester Test Scores :

P(75|Grade A) = ₁₂₀₂¹⁰ = 0,008 P(75|Grade B) = ¹⁶

1202 = 0,013

(5)

P(75|Grade C) = ²

1202 = 0,001 Average :

P(85|Grade A) = ₁₂₀₂¹⁶⁰ = 0,133 P(80|Grade B) = ⁷⁰

1202 = 0,058 P(77|Grade C) = ⁶

1202 = 0,004 c. Multiply all class variables

From the calculation results above, the highest probability value is obtained, which is a value with Grade A.

3.3 Testing

In the second test, the dataset used 70% training data and 30% testing data. The test was carried out several times on the sampling type to find a good level of accuracy. The following is a table of test results on the three datasets using 2 operators, namely the split data operator and the split validation operator.

a. Test results using the split validation operator

Table 3. Test Results Academic Year 2019/2020 – Academic Year 2020/2021

Sampling Data Testing

Accuracy RMSE

Linier 89,20% 0,295%

Shuffle 92,52% 0,227%

Stratified 95,01% 0,189%

Table 3 show the result of testing using split validation operators in the 2019/2020 – 2020/2021 school year data, a good level of accuracy was found in the stratified type sampling with 70% training data and 30% testing data, then an accuracy of 95.01% was obtained.

Table 4. Test Results Academic Year 2021/2022

Linier 96,11% 0,177%

Shuffle 93,89% 0,204%

Table 4 show the result of testing using split validation operators in the 2021/2022 school year data, a good level of accuracy was found in the stratified type sampling with 70% training data and 30% testing data, then an accuracy of 97.79% was obtained.

b. Test results using split data operators

Table 5. Test Results Academic Year 2019/2020 – Academic Year 2020/2021

Linier 89,18% 0,272%

Shuffle 90,84% 0,248%

Table 5 show the result of testing using split data operators in the 2019/2020 – 2020/2021 school year data, a good level of accuracy was found in the Shuffle type sampling with 70% training data and 30% testing data, then an accuracy of 90.84% was obtained.

Table 6. Test Results Academic Year 2021/2022

Linier 97,39% 0,155%

Shuffle 97,86% 0,142%

(6)

DOI: 10.30865/mib.v6i4.4639

Table 6 show the result of testing using split data operators in the 2021/2022 school year data, a good level of accuracy was found in the Shuffle type sampling with 70% training data and 30% testing data, then an accuracy of 97.86% was obtained.

3.4 Evaluation of Results

After testing for the datasets using 2 operators, namely the split data operator and split validation. The test results showed a difference in the degree of accuracy. So that this test will use a split validation operator because it has good accuracy and low RMSE values by using 70% training data and 30% testing data with Stratified

Figure 5. Average Value Graph Academic Year2019/2020 – Academic Year 2020/2021

Figure 5 show the average value in Academic Year 2019/2020 – Academic Year2020/2021 the value with grade C is higher than the grade A and grade B.

Figure 6. Confusion Matrix Result Academic Year 2019/2020 – Academic Year 2020/2021

Figure 6 show the result of confusion matrix Academic Year 2019/2020 – Academic Year 2020/2021, the prediction of the B value has a class precision result of 95.68%. The predicted A value has a class precision result of 96.89%. And the C value has a class precision result of 66.67% with an accuracy result of 95.01%.

Figure 7. Average Value Graph Academic Year 2021/2022

(7)

Figure 7 show the results of the rat-average value graph in Academic Year2021/2022, the value with grade C is higher while grade B is medium and grade A is low.

Figure 8. Confusion Matrix Results Academic Year 2021/2022

Figure 8 show the results of the confusion matrix above the prediction of the value of A has a class precision result of 100%. The prediction of the value of B has a class precision result of 95.00%. And the C value has a class precision result of 83.33% with an accuracy result of 97.79%.

3.5 Discussion

Based on testing with two datasets in Academic Year 2019/2020 – Academic Year 2020/2021 with Academic Year 2021/2022 with attributes of attendance scores, mid-semester test scores, semester scores, assignment scores and average scores with these attributes, grades with grade C have higher results than students who scored with grade A and grade B. in the research conducted (Anggraini et al, 2020) determining the success rate of schools facing the UN exam resulted in an accuracy of 95.50% with the average score of english having the potential to get an overall score for student graduation[14]. Meanwhile, the research (Irma et al, 2021) determined the success of online learning to produce an accuracy value of 99% with adaptive subject attributes that are the dominant factor, which include adaptive subjects, namely English, Mathematics, Science, Chemistry, Physics, Social Studies[12].

4. CONCLUSIONS

Based on the results of the research that has been carried out, several conclusions can be drawn, namely the test results using the ACADEMIC YEAR student value dataset 2019/2020 – ACADEMIC YEAR 2020/2021 has an accuracy value of 95.01% while the ACADEMIC YEAR student value dataset 2021/2022 has an accuracy value of 97.79%. Absenteeism value. Assignment scores, mid test scores, semester test scores and grade points on the test of both datasets resulted in higher grade C scores so there was no significant difference from the three school years.

REFERENCES

[1] A. Aprijal, A. Alfian, and S. Syarifudin, “Pengaruh Minat Belajar Siswa Terhadap Hasil Belajar Siswa di Madrasah Ibtidaiyah Darussalam Sungai Salak Kecamatan Tempuling,” MITRA PGMI J. Kependidikan MI, vol. 6, no. 1, pp. 76–

91, 2020, doi:10.46963/mpgmi.v6i1.125.

[2] T. Nabillah and A. P. Abadi, “Faktor Penyebab Rendahnya Hasil Belajar Siswa,” Pros. Sesiomadika, vol. 2, no. 1, pp.

659–663, 2020.

[3] S. A. Wijaya, R. A. Novi W, and S. D. Saputri, “Pengaruh Kebiasaan Belajar Terhadap Prestasi Belajar Siswa,” Ekuitas J. Pendidik. Ekon., vol. 7, no. 2, pp. 117–121, 2019, doi: 10.23887/ekuitas.v7i2.17917.

[4] S. Marpaung, S. -, and I. -, “Penerapan Metode Naïve Bayes Dalam Memprediksi Prestasi Siswa Di SMA Negeri 1 Panombeian Panei,” J. Sist. Inf. dan Ilmu Komput. Prima(JUSIKOM PRIMA), vol. 4, no. 2, pp. 8–13, 2021, doi:

10.34012/jurnalsisteminformasidanilmukomputer.v4i2.1522.

[5] K. F. Irnanda, D. Hartama, and A. P. Windarto, “Analisa Klasifikasi C4 . 5 Terhadap Faktor Penyebab Menurunnya Prestasi Belajar Mahasiswa Pada Masa Pandemi,” J. Media Inform. Budidarma, vol. 5, no. 1, pp. 327–331, 2021, doi:

10.30865/mib.v5i1.2763.

[6] Y. Niak, W. Mataheru, and D. A. Ngilawayan, “Perbedaan Hasil Belajar Siswa Pada Model Pembelajaran Kooperatif Tipe Circ Dan Model Pembelajaran Konvensional,” J. Honai Math, vol. 1, no. 2, p. 67, 2018, doi:

10.30862/jhm.v1i2.1040.

[7] E. Ahadi, I. Gunawan, I. O. Kirana, D. Hartama, and M. R. Lubis, “Penentuan Keberhasilan Pembelajaran Daring Pada Masa Pandemi Covid-19 dengan Menggunakan Algoritma C4.5 di Stikom Tunas Bangsa,” J. Komput. dan Inform., vol.

10, no. 1, pp. 78–85, 2022, doi: 10.35508/jicon.v10i1.6446.

[8] M. F. Rifai, H. Jatnika, and B. Valentino, “Penerapan Algoritma Naïve Bayes Pada Sistem Prediksi Tingkat Kelulusan Peserta Sertifikasi Microsoft Office Specialist (MOS),” Petir, vol. 12, no. 2, pp. 131–144, 2019, doi:

10.33322/petir.v12i2.471. [8] I. W. Saputro and B. W. Sari, “Uji Performa Algoritma Naïve Bayes untuk Prediksi Masa Studi Mahasiswa,” Creat. Inf. Technol. J., vol. 6, no. 1, p. 1, 2020, doi: 10.24076/citec.2019v6i1.178.

(8)

DOI: 10.30865/mib.v6i4.4639

[9] D. Yunita, P. Rosyani, and R. Amalia, “Analisa Prestasi Siswa Berdasarkan Kedisiplinan, Nilai Hasil Belajar, Sosial Ekonomi dan Aktivitas Organisasi Menggunakan Algoritma Naïve Bayes,” J. Inform. Univ. Pamulang, vol. 3, no. 4, p.

209, 2018, doi: 10.32493/informatika.v3i4.2032.

[10] A. R. Damanik, S. Sumijan, and G. W. Nurcahyo, “Prediksi Tingkat Kepuasan dalam Pembelajaran Daring Menggunakan Algoritma Naïve Bayes,” J. Sistim Inf. dan Teknol., vol. 3, no. 3, pp. 88–94, 2021, doi: 10.37034/jsisfotek.v3i3.137.

[11] N. Nurajijah, D. A. Ningtyas, and M. Wahyudi, “Klasifikasi Siswa Smk Berpotensi Putus Sekolah Menggunakan Algoritma Decision Tree, Support Vector Machine Dan Naive Bayes,” J. Khatulistiwa Inform., vol. 7, no. 2, pp. 85–90, 2019, doi: 10.31294/jki.v7i2.6839.

[12] I. A. Sihombing, D. Hartama, I. Parlina, I. Gunawan, and I. O. Kirana, “Analisis Keberhasilan Pembelajaran Daring pada Masa Pandemi Covid-19 menggunakan Algoritma C4.5 dan Naive Bayes,” JUKI J. Komput. dan Inform., vol. 3, no. 2, pp. 89– 96, 2021, doi: 10.53842/juki.v3i2.68.

[13] W. Yustanti and N. Rochmawati, “Analisis Algoritma Klasifikasi untuk Memprediksi Karakteristik Mahasiswa pada Pembelajaran Daring,” J. Edukasi dan Penelit. Inform., vol. 8, no. 1, pp. 57–61, 2022.

[14] Y. Angraini, S. Fauziah, and J. L. Putra, “Analisis Kinerja Algoritma C4.5 Dan Naïve Bayes Dalam Memprediksi Keberhasilan Sekolah Menghadapi Un,” JITK (Jurnal Ilmu Pengetah. dan Teknol. Komputer), vol. 5, no. 2, pp. 285–290, 2020, doi: 10.33480/jitk.v5i2.1233.

[15] Rumini and Norhikmah, “Prediksi Kegagalan Siswa Dalam Data Mining Dengan,” J. Mantik Penusa Vol. 3, No. 1.1, Agustus 2019, vol. 3, no. September, pp. 42–46, 2019.

[16] M. S. Mustafa, M. R. Ramadhan, and A. P. Thenata, “Implementasi Data Mining untuk Evaluasi Kinerja Akademik Mahasiswa Menggunakan Algoritma Naive Bayes Classifier,” Creat. Inf. Technol. J., vol. 4, no. 2, p. 151, 2018, doi:

10.24076/citec.2017v4i2.106.

[17] P. A. Lizsara, S. Oyama, and S. Wardani, “Implementasi Data Mining Menggunakan Metode Naïve Bayes Untuk Memprediksi Ketepatan Waktu Tingkat Kelulusan Mahasiswa (Study Kasus: Program Studi Informatika Universitas PGRI Yogyakarta),” Seri Pros. Semin. Nas. Din. Inform., vol. 4, no. 1, pp. 34–37, 2020, [Online]. Available:

http://prosiding.senadi.upy.ac.id/index.php/senadi/article/view/121

[18] Dharshinni, N., Sitepu, A., Syuhada, R., Barasa, D., & Wijaya, A. “Moodle Web-Based Learning Constraints toward Student Learning Interest Using C4.5 Algorithm during Covid-19 Pandemic”. JOURNAL OF INFORMATICS AND TELECOMMUNICATION ENGINEERING, 5(1), 132-141. 2022. doi:https://doi.org/10.31289/jite.v5i1.5301

[19] A. Gupta, L. Kumar, R. Jain, and P. Nagrath, “Heart Disease Prediction Using Classification (Naive Bayes),” in Lecture Notes in Networks and Systems, vol. 121, 2020. doi: 10.1007/978-981-15-3369-3_42.

[20] D. Berrar, “Bayes’ theorem and naive bayes classifier,” in Encyclopedia of Bioinformatics and Computational Biology:

ABC of Bioinformatics, vol. 1–3, 2018. doi: 10.1016/B978-0-12-809633-8.20473-1.

[21] N. Ye, “Naïve Bayes Classifier,” in Data Mining, 2020. doi: 10.1201/b15288-5.