Analysis of Student Performance Based on LMS Activities with Learning Analytics Approach
Dawam Dwi Jatmiko Suwawi, Hafizh Jihaad Husni, Kusuma Ayu Laksitowening School of Computing, Informatics Study Program, Telkom University, Bandung, Indonesia
Email: 1,*[email protected], 2[email protected], 3[email protected] Email Penulis Korespondensi: [email protected]
Submitted 23-11-2021; Accepted 28-12-2021; Published 30-12-202 Abstract
Good performance for a student during a course is important because it can affect the student's final index. However, in general, student performance can only be seen at the end of the semester, so if a student has a poor performance before the course ends, they do not get enough opportunity to improve. Therefore, it is necessary to have an early analysis of students who have poor performance in a course.
Since most student learning activities in this pandemic era are currently on the LMS, the LMS activity log can reflect student performance.
The main objective of this study is to classify the LMS activity log in a course using a probabilistic classifier algorithm. This study chose Naïve Bayes to classify student performance into three categories – good, satisfactory, and poor. The dataset is separated into two scenarios – the half-semester and the full-semester - in the Modeling and Database Implementation course at Telkom University. The results show that the Naïve Bayes Algorithm successfully predicts student performance early and provides information about students experiencing changes in performance with the highest accuracy of 93%. The practical implication of this study is that teachers can use the LMS activity log for early prediction of student success in passing a course. The learning analytics developed in this study prove that Naïve Bayes has a fairly good performance for small dataset sizes based on recall and accuracy to classify student performance. However, as the study focuses solely on a specific course and small dataset sizes, it lacks generalizability. Therefore, it needs to be tested for other courses and larger dataset sizes.
Keywords: LMS; Learning Analytics; Student Performance; Naïve Bayes; Classification
1. INTRODUCTION
E-learning or electronic learning in this pandemic era is growing rapidly, both because strict health protocols are applied in all sectors and because almost all people can already have Internet access to use E-learning. E-Learning changes the method of teaching and learning strategies. Using E-Learning, students can practice more material than traditional classes. E- Learning is generally described as electronic instruments such as computers and televisions that convey information for learning needs [1], [2]. Learning Management System is a fairly popular E-learning tool. Teachers can use LMS to collect assignments, forums, providing material, quizzes, and reports. They can see every activity that students do in the LMS.
Although the teacher can see students’ activities, many factors cause a decrease in student performance. The way students learn is with the wrong pedagogy, inappropriate curriculum, and subject matter that is considered difficult to understand [3], [4]. So, it is necessary to analyze data on students who experience decreased performance.
Therefore, learning analytics is carried out to find students who need help. Predicting student performance is one thing that needs to be considered an important task in learning analytics [5]. Attributes that determine student performance include GPA or average grades, absences, tests, backlogs, year drops, and subjects [6].
The decision tree produces the best results in examining student performance in C++ subjects compared to the Naive Bayes algorithm and JRip Rule [5]. However, the Naïve Bayes algorithm produces a fairly good performance in classifying student learning techniques based on the feedback that has been filled in [1]. Therefore, this study uses the Naïve Bayes as the learning analytics technique.
Naïve Bayes is used because it is one of the most popular classification methods. As algorithm, Naïve Bayes is excellent in terms of accuracy and recall. The Naïve Bayes algorithm can be used for small datasets. Without feature selection, this algorithm gets a higher accuracy than feature selection [3].
This study aimed to address the following research questions: how Naive Bayes algorithm can predict student performance based on their activities in LMS and how accurate is the Naïve Bayes algorithm in classifying student performance in LMS?
The reader should bear in mind that the study is based on 1) the dataset used combines three Modeling and Database Implementation classes of the Information Technology Bachelor Program at Telkom University. 2) The data used is sourced from the Moodle LMS, and the Academic Information System implemented at Telkom University.
The specific objective of this study was to implement the Naive Bayes algorithm in predicting student performance earlier before the semester ends and knowing the Nave Bayes performance in this research dataset.
The paper is divided into five distinct sections. The first section contextualizes the research by providing background information on learning analytics to analyze student performance. The second section will consider both the sources and methods of study which will include previous studies, learning analytics, Naïve Bayes, and Learning Management System.
The third section is concerned with the methodology employed for this study. Section 4 analyses the data gathered and discussed the significant findings. And the final section summarizes the main findings of this project and identifies areas for further research.
In previous research, learning analytics with Naïve Bayes was developed to find suitable learning techniques for students [1]. The dataset used is in the form of a form containing compliance, categories, attitude and tests [1]. The results of this study can help teachers to adjust what students need [1].
Other research uses 2 classification techniques, namely C5.0 (Decision Tree) and regression to predict student GPA scores and student dropouts [5]. The attributes used in this study are id, Nationality, age, educational history, passed course, failed course, and deleted course [5]. The dataset used in this research EDM is LMS from the Iranian University of Science and Technology [5]. From these two techniques, the prediction results of Regression and C5.0 are very dependent on the distribution of the data. Because the data used has a small variation, the results of data processing are quite good [5]. The author argues that if the dataset used has a large data distribution or data variation, it is possible that the results obtained will be bad [5].
The next study used decision trees and multiple regression to predict student performance and clustered students based on their academic strengths or weaknesses using K-Means Clustering [6]. Decision trees for predicting students' pass or failure on their final results [6]. Multiple Regression to predict students' final GPA; the clustering process is carried out in order to find similarities in abilities based on the best value of each student; and the attributes used in this study are GPA, Lab Grade, Class Test Score, Assignment score, presentation score, Attendance, backlogs, year drops, 10th grade marks, 12th grade marks and Board of Study [6]. The three methods are effective in predicting students' academic performance [6].
Another study compared the effectiveness (accuracy, sensitivity, and specificity) of several traditional supervised learning methods in predicting student performance by looking at the differences between the use of single-instance representation and multiple-instance representation [7]. Several supervised learning methods based on trees, rules, Naive Bayes, Neural Networks, and SVMs were compared using single-instance representation. Meanwhile, for data using multiple-instance representation, Zafra, et al. compared supervised learning methods based on rules, SVMs, Naive Bayes, Decision Tree, logistic regression, and diverse density [7]. This study took the dataset from the Moodle LMS at the University of Cordoba. This study proved that the use of multiple-instance representation increases the efficiency and effectiveness of the classical single-instance representation in predicting student performance [7].
Fernandes, et al. in [8] conducted a learning analytics technique using the Gradient Boosting Machine (GBM) method.
This research aims to look for attributes that affect student performance and look for attributes that cause student failure [8].
In this study, there are two datasets, each of which has a difference in each attribute, although not all [8]. The dataset used is students' academic performance before the school year starts and data after two months of the school year starting [8].
2. RESEARCH METHODOLOGY
2.1 Learning Analytics
Learning analytics (LA) is the measurement, collection, analysis, and presentation of data about learning. The goal of LA is to understand and maximize learning and the learning environment. LA is defined as an explanation of student activity data obtained during academic activities. The activity data is used to predict or detect potential problems caused by students [2].
Figure 1. is an example of a learning analytics process. The system will process data obtained through LMS in data mining/analytics. The teacher will analyze the results of the analytics process through the dashboard provided. The teacher will select the necessary indicators.
Figure 1. Example of a Learning Analytics Process
There is a method similar to learning analytics, namely Educational Data Mining. Educational Data Mining is part of data mining that deals with academic databases. EDM results can improve student academic performance, reduce student failure rates and improve teaching techniques [6]. EDM focuses on computing matters related to educational questions or educational needs [5]. The purpose of EDM is to create a model to improve students' learning and students' success. Thus, it can increase the effectiveness of the institution [3].
2.3 Naïve Bayes
Bayes' theorem is a simple mathematical formula used to calculate the conditional probability for each label. By predicting future opportunities based on previous experience. In Bayes' theorem, the class with the highest probability is considered the best and highest predictor or most likely class [1]. Therefore, this method belongs to one type of classification named Naive
Bayes. The word Naive is based on how Bayes' theorem calculates probabilities because the position of each attribute does not affect the given result. For example, when calculating the probability of class A*B*C, Naive Bayes will calculate the same probability of class A*B*C as the combination of class A*C*B, as well as other combinations. Naïve Bayes assumes that each class or attribute is unrelated to another or individual [1]. The Naive Bayes algorithm is well used for small data and does not affect irrelevant data [3].
The following is the formula for the Naive Bayes theorem:
P(A|B)= (P(B|A).P(A))/(P(B)) (1)
P(A|B) is the probability of occurrence of A when B occurs, P(B|A) the probability of occurrence of B when A occurs.
P(A) and P(B) are independent probabilities [1]. In addition to the previous theorem equations, if the data used is continuous, then the equations used are:
P(X)= 1/√(2πσ〖_X^ 〗) e^(- ((Xi - μx)〖^2〗)/(2σ〖_x^2〗)) (2) 2.4 Learning Management System (LMS)
Learning Management System is a learning system that can integrate material, content, and other things that support the learning process. The purpose of LMS is to support collaborative activities between teachers and students and traditional learning [9].
There are various types of LMS such as Proprietary LMS, Open-Source LMS, Cloud-Based LMS, and LMS Trends [9]. Proprietary LMS is an LMS that has copyright ownership; to use this LMS requires infrastructure such as a building that has a lab, network, and computer [9]. Open-Source LMS can be created and accessed by anyone for the public, because Open-source LMS does not have copyright [9]. Cloud-Based LMS is an online LMS that allows students to study online or online [9].
Predicting student performance based on activity logs in a learning management system is an interesting and growing issue, especially in the era of the pandemic where all community interactions, including students and teachers, must be limited to prevent the spread of COVID-19 cases. Everything "must" be carried out online, both synchronously and asynchronously. Learning management systems (LMS) are becoming very popular and used as a center for interaction and learning activities.
It is necessary to analyze data on students who experience decreased performance. Learning analytics is carried out to find students who need help. A more detailed process to build a learning analytics using activity log of an LMS is given in the following section.
3.1 System Schema
To build learning analytics using Naïve Bayes, we need input data that fits the format that the algorithm can obtain. Before classifying the data, we need to prepare the data by preprocessing. This study preprocesses the data using some applications, like Pentaho and MS. Excel. Figure 2 shows the flow of the system design in this study. The dataset will be divided into validation data and test data to determine the level of accuracy of the classification results. Lastly, after obtaining the classification results are obtained from the Naive Bayes classification process, the results will be analyzed and evaluated for accuracy.
Figure 2. Flow of the system
3. RESULTS AND DISCUSSION
3.1 Preprocessing Data a. Attribute Selection
The data used in this study is the LMS activity log from the Modeling and Implementation Database class of Information Technology S1 class of 2019 at Telkom University. This course weighs three credits and has 14 meetings or the equivalent of 14 weeks. This course is delivered online using several learning platforms, one of which is LMS Moodle. In LMS Moodle, students obtained materials, assignments, quizzes, external resource links, and forums. Quizzes and assignments are mandatory for every student, while forum activities are optional.
Students can take the quiz in more than one attempt. Students get assignments at the half of the semester and the end of the semester. Because this course is related to databases, the assignment material is relational database design and database implementation (SQL). Forums are available for each material for students who want to discuss the material provided.
In addition to quizzes and assignments, each student is measured based on the achievement of course learning outcomes (CLO), an Outcome Based Education based on IABEE International Accreditation Standards [18]. There are 3 CLOs in this course. CLO 1 measures students' ability to explain basic database concepts correctly. CLO 1 consists of Mid- test activities in the Multiple Choice section and Quizzes for weeks 1, 2, 3, and 7. CLO 2 measures students' ability to make database modeling using Entity-Relationship Diagrams and Relational Models with correct normalization of database tables.
CLO 2 consists of Quiz activities week 4, 5, and 6, Project 1, and Mid-test Essay section. CLO 3 measures students' ability to properly implement database modeling using SQL statements according to the needs of the case study organization. CLO 3 consists of Project 2, Quiz week 8 to 14, and practicum.
Figure 1. Preprocessing data flow
The activity logs generated from student activities in LMS include when the student started taking the quiz, how many times the student took the quiz, whether the student visited the forum, and what he did. The data set in the log produces new attributes, namely the time for taking quizzes and discussion forums. The reason for quiz time selection as an attribute is because it can show the level of student focus and the possibility of cheating committed by students during quizzes [10].
Based on [11], students who use the forum have a better final score than students who do not use the forum.
Of all these attributes, the selected attributes consist of student ID, week 1-14 quiz scores, quiz times week 1 - 14, project scores, and discussion forum activity. The performance attributes are generated from the data weighting process, data normalization, and data binning. The performance attribute in the classification will be used as a label. Fig. 3 ilustrates the preprocessing data flow used in this study. Table 1, Table 2, and Figure 3 display the dataset sample before the pre-processing from iGracias, LMS, and Backlog.
Table 1. Sample of datasets before the pre-processing (iGracias)
Student ID CLO1_MidTest CLO2_Quiz CLO2_MidTest CLO3-Project CLO3-Quiz CLO3-Practicum CLO1_Quiz CLO2_Project Final Status ,
1 0 0 0 0 0 0 0 0 Fail
2 60 53.56666667 33 60 77.38095238 51.44444444 58.325 75 56.95252 Pass
3 30 41.76666667 67 34 22.32857143 37.27777778 35 80 45.76032 Fail
4 30 50 59 60 47.61428571 59.77777778 47.5 75 55.81127 Pass
5 30 14.43333333 38 34 35.82380952 37.88888889 22.5 80 40.21563 Fail
6 30 66.66666667 27 64 86.43333333 62.44444444 51.675 75 55.32764 Pass
7 20 36.66666667 48 60 69.51904762 61.66666667 55 75 53.84262 Pass
8 70 53.76666667 38 60 72.13809524 64.77777778 55 75 61.95079 Pass
9 30 52.33333333 50 64 71.1952381 26.11111111 45 80 50.44865 Pass
Table 2. Sample of dataset before the pre-processing (LMS)
Student ID Quiz1 Quiz2 Quiz3 Quiz4 Quiz5 Quiz6 Quiz7 Quiz8 Quiz9 Quiz10 Quiz11 Quiz12 Quiz13 Quiz14
1 - - - - - - - - - - -
2 5 8 3.33 54 4.67 6 7 8.67 7.5 12 13 4 9.33 8
3 6 8 72 3.33 2 - 5.63 12 - 2 -
4 3 9 - 6 9 7 8.67
6 4 9.33 7.33
5 6 1- 3.33 1 2 3.75 6 4 4 5.33 5.33
6 5 9 6.67 60 6 8 - 8.67 7.5 13 13 9 10 8
7 6 10 30 2 6 6 8 10 13 8 9 8
8 5 10 48 7.33 7 7 8 7.5 11 11 5 8 7.33
9 4 7 57 4 6 7 4 7.5 14 14 5 8 6
Figure 4. Sample of dataset before the pre-processing (backlogs)
b. Dataset Analysis
Based on the dataset obtained, the values in the dataset feature have different missing values, data types, and data ranges.
Examples include the value in the quiz value feature, which has an integer data type with a data range of 1 - 15, and the value in the discussion forum feature, which has a string data type. From the condition of the dataset, it is necessary to do pre- processing, including data encoding, checking for missing values, checking duplicate data, data weighting, data normalization, and data binning. A total of 102 pieces of data were used in this study. It consisted of 24 identified attributes for the half-semester data and 47 attributes for the full-semester data. The results of pre-processing data are shown in Table 3.
Table 3. Sample of dataset after the pre-processing
Stude nt ID
Qui z1
WK 1
DF 1
Qui z2
WK 2
DF 2
Qui z3
WK 3
DF 3
Qui z4
WK 4
DF 4
Qui z5
WK 5
DF 5
Qui z6
WK 6
DF 6
Qui z7
WK 7
DF 7
Projec t1
DF Projec t1
Final Score
Performa nce
1
10.0 6.7 1 10.0 3.5 1 0.00 5.0 1 0.01 30 0 0.00 0.0 1 6.0 4.0 0 7.0 7.0 0 6.
00 0
0.5725 49
Satisfactor y
2
10.0 7.5 1 10.0 1.5 1 7.50 13.5 1 4.03 30 1 4.67 8.0 0 5.0 2.6 1 6.0 11.0 1 6.
00 1
0.7186 06
Satisfactor y
3
3.0 2.0 1 7.0 10.0 0 5.83 10.0 0 0.00
1
1.33 1.0 0 0.0 0.0 1 0.0 0.0 0 6.
00 1
0.4094 06 Poor
4
10.0 5.0 1 10.0 4.7 1 8.33 5.0 1 7.05 34 1 6.67 6.5 1 10.0 7.0 0 10.0 3.4 0 7.
50 1
0.9296
52 Good
5
10.0 6.5 1 10.0 4.6 1 10.0
0
12.3 1 8.06 34 1 10.0
0
6.5 1 10.0 5.0 1 8.0 9.2 1 8.00
1 0.9903
25 Good
The data encoding is conducted to change the value of the nim feature to an integer range from 1-102. After that, checking for missing values is carried out. If there is a data value with a missing value, the value in the data is changed to 0—followed by checking for duplicate data. If there is duplicate data, The system will delete the data. After that, data weighting is carried out to adjust the weights for each data, especially on the features of quiz scores and large assignments.
Because the quiz score weights 65% according to the dataset table obtained, and the project value based on the dataset table weights 35%.
Furthermore, data normalization was carried out to improve the integrity and redundancy of the data [17]. The last stage is data binning, such as the scheme carried out in [16], namely the value scale used if 0-0.5 is in the Poor category, 0.51-0.75 is in the Satisfactory category, and 0.76-1 is in a Good category. Data binding is only conducted on performance attributes.
3.3 Naïve Bayes Classification
Table 4 shows the calculation result of the combined categorical Naive Bayes formula with the continuous Naive Bayes formula. Because the continuous Naive Bayes formula uses the standard deviation and mean for each attribute to be predicted, it is necessary to obtain each attribute's mean and standard deviation. After the mean and standard deviation of the attributes are obtained, it is continued with the Naive Bayes classification formula. As sample data for the classification test, data numbers 1 and 102 are taken as an example of calculating the naive Bayes classification:
Table 4. Classification Test Sample Data
Student ID Quiz1 WK1 DF1 Quiz2 WK2 DF2 Performance
102 1 13.0 1 0 18.0 1 Satisfactory
2 2 7.5 1 2 1.5 1 Satisfactory
3 0 2.0 1 10.0 0 Poor
4 2 5.0 1 2 4.7 1 Good
98 1 8.0 1 2 18.0 1 Good
100 0 12.0 1 0 9.0 1 Poor
Stage 1, looking for the probability of each possibility from each label, the following is the calculation:
P(Poor) = 2/6= 0.33 P(Satisfactory) = 2/6 = 0.33 P(Good) = 2/6 = 0.33
Stage 2, looking for the probability of each label on the attribute quiz 1 according to the classified data, the following is the calculation:
P(Quiz 1 = 1 | Poor) = 0/2 = 0 P(Quiz 1 = 1 | Satisfactory) = 1/2=0.5 P(Quiz 1 = 1 | Good) = 1/2 = 0.5
Stage 3, Look for the probability of each label on the WK 1 attribute (Quiz Time 1) according to the classified data.
The formula used is different from the previous stage because the data used is continuous. The following is the calculation:
P(Quiz Time 1 = 13 | Poor) = 1/(√2π(7.071 )^ ) e^(- ((13 - 7 )〖^2〗)/(2 (〖7.071)〗^2 )) = 0.393
P(Quiz Time 1 = 13 | Satisfactory) = 1/(√2π( 〖3.889)〗^ ) e^(-((13 - 10.25 )〖^2〗)/(2 (〖3.889)〗^2 )) = 0.079 P(Quiz Time 1 = 13 | Good) = 1/(√2π( 〖2.121)〗^ ) e^(- ((13 - 6.5 )〖^2〗)/(2 (〖2.121)〗^2 )) = 0.0017 Stage 4, looking for the probability of each label on the attribute DF 1 (Forum Discussion 1) according to the classified data. Here's the calculation:
P(Discussion Forum 1 = 1 | Poor) = 2/2 = 1 P(Discussion Forum 1 = 1 | Satisfactory) = 2/2= 1 P(Discussion Forum 1 = 1 | Good) = 2/2 = 1
Stage 5, looking for the probability of each label on the Quiz 1 attribute according to the classified data. Here's the calculation:
P(Quiz 2 = 0 | Poor) = 1/2 = 0.5 P(Quiz 2 = 0 | Satisfactory) = 1/2= 0.5 P(Quiz 2 = 0 | Good) = 0/2 = 0
Stage 6, Look for the probability of each label on the WK 2 attribute (Quiz Time 2) according to the classified data.
Here's the calculation:
P(Quiz Time 2 = 18 | Poor) = 1/(√2π(0.707 )^ ) e^(- ((18- 9.5 )〖^2〗)/(2 (〖0.707)〗^2 )) = 0.023*1030 P(Quiz Time 2 = 18 | Satisfactory) = 1/(√2π(11.667 )^ ) e^(- ((18- 9.75 )〖^2〗)/(2 (〖11.667)〗^2 )) = 0.026 P(Quiz Time 2 = 18 | Good) = 1/(√2π(9.404 )^ ) e^(- ((18- 11.35 )〖^2〗)/(2 (〖9.404)〗^2 )) = 0.033
The last stage, looking for the probability of each label on the Discussion Forum 2 attribute according to the classified data. Here's the calculation:
P(Discussion Forum 2 = 1 | Poor) = 1/2 = 0.5 P(Discussion Forum 2 = 1 | Satisfactory) = 2/2= 1 P(Discussion Forum 2 = 1 | Good) = 2/2 = 1
Because all probability calculations for each attribute on the label have been carried out, the next process is to calculate all attributes with the same label according to the Naive Bayes formula, here are the calculations:
P(Poor) P(Quiz 1 = 1 | Poor) P(Quiz Time 1 = 13 | Poor) P(Discussion Forum 1 = 1 | Poor) P(Quiz 2 = 0 | Poor) P(Quiz Time 2 = 18 | Poor) P(Discussion Forum 2 = 1 | Poor)
0.33 x 0 x 0.393 x 1 x 0.5 x 0.023*
1030x 0.5 = 0
P(Satisfactory) P(Quiz 1 = 1 | Satisfactory) P(Quiz Time 1 = 13 | Satisfactory) P(Discussion Forum 1 = 1 | Satisfactory) P(Quiz 2 = 0 | Satisfactory) P(Quiz Time 2 = 18 | Satisfactory) P(Discussion Forum 2 = 1 | Satisfactory)
0.33 x 0.5 x 0.079 x 1 x 0.5 x 0.026 x 1 = 0.0000169455
P(Good) P(Quiz 1 = 1 | Good) P(Quiz Time 1 = 13 | Good) P(Discussion Forum 1 = 1 | Baik) P(Quiz 2 = 0 | Baik) P(Quiz Time 2 = 18 | Baik) P(Discussion Forum 2 = 1 | Baik)
0.33 x 0.5 x 0.0017 x 1 x 0 x 0.033 x 1 = 0
So, the performance for the first data student is in the Satisfactory category.
3.4 Testing
The purpose of this test is to test whether the Naive Bayes algorithm can classify student performance quite well. Also to test whether Gaussian Naive Bayes is suitable for classifying student performance from LMS data. Testing is done by looking at the accuracy results from Gaussian Naive Bayes which is implemented based on F1-score, recall and Confusion Matrix. The analysis looks at the overall detail of the classification results data. The next section analyses the data resulted from those research attributes.
3.5 Half Semester Test Results
The results of half-semester data testing using the nave Bayes classification algorithm produce an accuracy of 93% with details as in Table 5. The average precision of the three categories on Gaussian Naïve Bayes is 95.3%, with an average recall of 81.3% and an average F1-Score of 85.3%. Figure 5 presents the test data of the half-semester classification results.
Table 5. Classification Results of Half Semester Test Data No Gaussian Naïve Bayes Score
1 Accuracy 93%
2 Average Precision 95.3%
3 Average Recall 81.3%
4 Average F1-score 85.3%
Figure 5. Details of half-semester test data classification results
Furthermore, Table 6 displays the results of the Gaussian Naïve Bayes classification based on the standard deviation value. The largest standard deviation value in the half-semester data is the Quiz 3 feature with a score of 0.167742 and the lowest in the table is in the Quiz Time 5 feature with the result 0.070968.
Table 6. The biggest feature based on the standard deviation value No Feature Deviation Standard
1 Quiz 3 0.167742 2 Quiz 4 0.116129 3 Quiz 1 0.083871 4 Quiz 6 0.077419 5 Quiz 5 0.070968
Figure 6 presents the details of the incorrect data in the classification, 50% of the incorrectly predicted data should be included in the Poor category but predicted to be included in the Satisfactory category and 0.59% of the incorrectly predicted data should enter the Good category but predicted to enter the Satisfactory label.
Figure 6. Gaussian Confusion Matrix of the half-semester 3.5 Full Semester Test Results
Figure 7 shows the results of the full-semester data test using the Naïve Bayes algorithm. The accuracy of the implementation of the Naive Bayes algorithm in classifying data is 70.9%, with a total of 102 data and the size of the test data is 30% of the total data. The average precision for Gaussian Naïve Bayes is 72.6%, with an average recall of 75% and an average F1-Score of 70% for the three classification categories.
Table 7. Details of the half-semester test data classification results No Gaussian Naïve Bayes Score
1 Accuracy 70.9%
2 Average Precision 72.6%
3 Average Recall 75%
4 Average F1-Score 70%
Figure 7. Details of full-semester test data classification results Poor
Satisfactory
Good
Poor Satisfactory Good
True Label
Predicted Label
Table shows the standard deviation values in the full-semester student performance classification. The Discussion Forum 2 feature obtained the highest standard deviation score with results of 0.080645, and the lowest score in the table is in the Quiz Time 8 feature, which is 0.009677.
Table 8. The biggest feature based on the standard deviation value No Feature Standard Deviation
1 Discussion Forum 2 0.080645
2 Project 1 0.035484
3 Discussion Forum 13 0.029032
4 Quiz 11 0.019355
5 Quiz Time 8 0.009677
Figure 8 shows the convolutional matrix predicting student performance using Gaussian Naive Bayes using the full- semester data. It shows that 17% of data should be included in the poor label but is predicted to be categorized as satisfactory by the system. Then, 33% of the data were incorrectly predicted, which should be classified as satisfactory, but the prediction results stated that the data was included in the Good label. After that, 25% of the data were incorrectly predicted in the Satisfactory category, which should have been labeled Good, but was predicted to be labeled Satisfactory.
Figure 8. Gaussian Confusion Matrix of the full-Semester 3.6 Analysis of Student Performance Comparison Results
Based on the naive Bayes algorithm classification results between half-semester and full-semester student data, more than 30% of the number of students who experienced changes in performance, most of them experienced a decrease in performance compared to an increase in performance. The decrease in performance based on the attributes taken in this study could be due to students reducing their focus when working on the Quiz, the lack of understanding of the material given after half a semester, and the reduced contribution and activeness of students to the Discussion Forum either Quiz or Project.
Table 9. Sample Results Comparison of student performance
No. Student ID Half-semester performance prediction Full-semester performance prediction
1 b1 Satisfactory Poor
2 b2 Good Poor
3 b3 Satisfactory Poor
4 b4 Good Poor
5 b6 Good Poor
6 b7 Good Satisfactory
7 b9 Good Satisfactory
8 b11 Good Satisfactory
9 b45 Good Satisfactory
10 b13 Good Satisfactory
3.7 Analysis of Test Results
Based on testing the data values of the Modeling and Database Implementation class students, the highest classification accuracy is 93%, namely accuracy in half-semester data, and the lowest is 70.9% in full-semester data. For example, if we randomly take ten data samples from the classification results. Then we compare the original performance and the predicted performance results; we find that the Naive Bayes algorithm produces a pretty good prediction result, as shown in Table 10.
It shows that only one error prediction data out of ten sample data.
Poor
Satisfactory
Good
Poor Satisfactory Good
True Label
Predicted Label
Thus, this proves that the Naive Bayes algorithm has successfully produced a good performance for this research dataset in the half-semester and full-semester dataset scenarios. According to the hypothesis of previous research, the Naive Bayes algorithm is proven to produce good accuracy for small datasets.
Tabel 10. Contoh Hasil Perbandingan klasifikasi performa mahasiswa No. Student
ID
Half-semester performance
Half-semester performance prediction
Full-semester performance
Full-semester performance prediction
1 b1 Satisfactory Satisfactory Poor Poor
2 b2 Satisfactory Good Satisfactory Poor
3 b3 Poor Satisfactory Poor Poor
4 b4 Good Good Poor Poor
5 b6 Good Good Poor Poor
6 b7 Good Good Satisfactory Satisfactory
7 b9 Good Good Satisfactory Satisfactory
8 b11 Good Good Satisfactory Satisfactory
9 b45 Good Good Satisfactory Satisfactory
10 b13 Good Good Satisfactory Satisfactory
4. CONCLUSION
The present study conducts Learning analytics to classify student performance in LMS with Naïve Bayes. Naive Bayes succeeded in predicting earlier students who experienced changes in performance. Initially divided into several parts, the dataset is filtered back into one. Attributes that are filtered are attributes that are related to activities in the LMS. The study used Student ID as an indicator of each data and student. Besides, the study used Quiz and Project scores as indicators of student understanding. We chose Quiz Time as an indicator of student focus during the half-semester. Then, we chose the discussion forum to indicate student activity in studying the material provided. Performance attributes were the final result value of all activities carried out by students in the attributes used. After the dataset has been selected and the binning process has been carried out into several performance labels consisting of Poor, Satisfactory and Good, proceed with the data classification process according to the attributes used. Results of the classification of student performance in the dataset of this study showed that more than 30% of students experienced a decline in performance based on the data after half a semester. Based on these data, the decline in student performance is due to the lack of student focus on the material provided, not understanding the material given well enough, and the lack of active discussion regarding material between students when a material is delivered. Furthermore, Gaussian Naïve Bayes managed to classify student performance quite well, indicating the highest accuracy obtained up to 93% on half-semester data. The highest average precision obtained in this study was 95.3%, the highest recall average was 85.3%, and the highest F1-score average was 85.3%. The feature that produces the highest standard deviation is the Quiz 3 feature. Further research could explore some important attributes that are not processed and can improve the quality of the output. In addition to this, datasets with more varied features and other data sources other than LMS. Moreover, a combination of Naive Bayes with other approaches such as PCA, Feature Scaling, and Laplace Correction would be a fruitful area for future work.
REFERENCES
[1] S. Maitra, S. Madan, R. Kandwal, and P. Mahajan, “Mining authentic student feedback for faculty using Naïve Bayes classifier,” in International Conference on Computational Intelligence and Data Science, 2018, vol. 132, pp. 1171–1183. doi:
10.1016/j.procs.2018.05.032.
[2] Y. B. Setyadi and S. Munawaroh, “PENINGKATAN MINAT BELAJAR MELALUI PENERAPAN STRATEGI TEAMS GAMES TOURNAMENT KOLABORASI DEBAT AKTIF DALAM PROSES PEMBELAJARAN PENDIDIKAN PANCASILA DAN KEWARGANEGARAAN,” in Seminar Nasional Kedua Pendidikan Berkemajuan dan Menggembirakan (The Second Progressive and Fun Education Seminar), 2017, pp. 439–446.
[3] S. L. Kanuru and M. Priyaadharshini, “Lifelong Learning in higher education using Learning Analytics,” in 9th World Engineering Education Forum, 2019, vol. 172, pp. 848–852. doi: 10.1016/j.procs.2020.05.122.
[4] A. Shettar, A. S. Nayak, and A. Shettar, “Assessing individual contribution in a team project using Learning Analytics,” in 9th World Engineering Education Forum, WEEF 2019, 2019, vol. 172, pp. 1001–1006. doi: 10.1016/j.procs.2020.05.147.
[5] M. Nasiri, F. Vafaei, and B. Minaei, “Predicting GPA and Academic Dismissal in LMS Using Educational Data Mining: A Case Mining,” 2012.
[6] J. Jacob, K. Jha, P. Kotak, and S. Puthran, “Educational Data Mining Techniques and their Applications,” in 2015 International Conference on Green Computing and Internet of Things (ICGCloT) , 2015, pp. 1344–1348.
[7] A. Zafra, C. Romero, and S. Ventura, “Multiple instance learning for classifying students in learning management systems,” Expert Systems with Applications 38, vol. 38, no. 12, pp. 15020–15031, Nov. 2011, doi: 10.1016/j.eswa.2011.05.044.
[8] E. Fernandes, M. Holanda, M. Victorino, V. Borges, R. Carvalho, and G. van Erven, “Educational data mining: Predictive analysis of academic performance of public school students in the capital of Brazil,” Journal of Business Research, vol. 94, pp. 335–343, Jan.
2018, doi: 10.1016/j.jbusres.2018.02.012.
[9] I. Dobre, “Learning Management Systems for Higher Education - An Overview of Available Options for Higher Education Organizations,” in The 6th International Conference Edu World 2014 “Education Facing Contemporary World , Nov. 2014, vol.
180, pp. 313–320. doi: 10.1016/j.sbspro.2015.02.122.
[10] Meryansumayeka, M. D. Virgiawan, and S. Marlini, “PENGEMBANGAN KUIS INTERAKTIF BERBASIS E-LEARNING DENGAN MENGGUNAKAN APLIKASI WONDERSHARE QUIZ CREATOR PADA MATA KULIAH BELAJAR DAN PEMBELAJARAN MATEMATIKA,” Journal Pendidikan Matematika, vol. 12, no. 1, pp. 29–42, Jan. 2018, [Online]. Available:
www.e-learning.unsri.ac.id
[11] S. Fatmawati, “EFEKTIVITAS FORUM DISKUSI PADA E-LEARNING BERBASIS MOODLE UNTUK MENINGKATKAN PARTISIPASI BELAJAR,” REFLEKSI EDUKATIKA: Jurnal Ilmiah Kependidikan, vol. IX, no. 2, pp. 211–216, Jun. 2019, [Online].
Available: http://jurnal.umk.ac.id/index.php/RE