Predicting Student Withdrawal from UAE CHEDS Repository using Data Mining

The author has also granted permission to the University to retain or make a digital copy for similar use and for the purpose of digitally preserving the work. I would also like to thank the one who guided me and advised me on the right approach in my research, Professor Dr.

R ESEARCH M OTIVATION

First, to perform education data mining on a unique and standardized database of the UAE Ministry of Education (Central Higher Education Data Store) CHEDS database. Before that, the wide range of implementations would lead to an improvement of the UAE education sector and the index of the UAE in other countries.

R ESEARCH O BJECTIVE

The result of the EDM's model will be applicable to any higher education institute recognized and accredited by MoE. The identification will apply to all Higher Education Institutions that follow data statistics reporting CHEDS template format.

R ESEARCH M ETHODOLOGY

R ESEARCH S TRUCTURE

Non-homogeneous students' population makes it more difficult for the university to predict the students with a higher probability of withdrawing from a course of study and the nature of the reason for withdrawing (Beer and Lawson 2017). Previous literature is reviewed based on the work done to predict students most likely to withdraw due to lack of achievement using EDM methods and techniques.

T HEORETICAL MODEL OF STUDENT ATTRITION

Tinto's theory has been examined by other researchers who confirmed its prediction of student attrition (Pascarella and Chapman 1983). Bean and Matzner developed their theory of non-traditional student attrition based on the work of Bean (Bean 1980).

EDM APPROACHES TO STUDY STUDENT ATTRITION

Introduction to data mining and EDM

Pedagogical data mining aims to improve the students' learning environment together with the institute's efficiency (Bucos and Drăgulescu 2018). The increased use of educational data mining improved educational systems by stabilizing and integrating students and faculty.

EDM studies’ approaches

Numerous literatures presented the importance of educational data mining and its unique contribution in providing analysis and management support tools to assist institutes in successful decision making. Vera L Miguéis (Miguéis et al. 2018) found the best algorithm; Random Forest algorithm which achieved high accuracy prediction based on academic prediction of first year academic performance results.

F ACTORS THAT CONTRIBUTE TO STUDENT ATTRITION

Student’s demographics, social and psychological factors
Student’s prior performance and academic factors
Student’s engagement and institutional factors
Financial factors

Vera L Miguéis (Miguéis et al. 2018) demonstrated the influence of the high school category attended on student attrition. Asif (Asif et al. 2017) showed that the second year was associated with student withdrawal from the course.

B USINESS U NDERSTANDING

The targeted institute in this research is based in the UAE and followed CAA (Commission for Academic Accreditation) certification guidelines. The first level is Bachelor's Degree, where it has two programs: Sharia Bachelor's Degree and Law Bachelor's Degree. The second level is Masters, which is not in the scope of this study.

In each semester, new student enrollments are around 70 to 100 students out of a total of almost 500 student enrollments (mixed of: new, continuing and readmissions). A large number of students are employees, and their studies take place in the evening hours. Dropped out' in this study is considered when: a student's withdrawal is official or unofficial, or a student has not registered for two consecutive semesters.

D ATA U NDERSTANDING

Data Acquisition

Collecting student data and performance data could lead to dealing with multiple databases built in different systems. In addition, with multiple databases, table versions can change over time, allowing new attributes to be added to the database, or attributes to be removed or changed. For example, if a type of language test exam was changed (from TOFEL to EmSAT), it would make it difficult to map old and new attributes to a common attribute due to the difference in value range.

CHEDS Database

Due to the small number of students in the participating institute in this research, it was decided to extend the range of selected academic semesters for data collection to cover one complete study of the bachelor's course cycle. Students with a high graduate GPA score (Range between 3.7 and 4) completed university studies (bachelor's degree) in seven semesters plus summer semesters (which are not included in the data set). These database files were retrieved for 12 semesters to cover one cycle of undergraduate study for all five types of reports, from Fall 2015-2016 to Spring 2020-2021.

Two types of reports were available for all semesters, namely: Student Enrollment Report and Graduate Report. Other reports were not available for all semesters because it was introduced in the fall of the 2020-2021 academic year.

Table 2: Comparison of Enrollment and Graduate based on GPA

D ATA P REPARATION

Data Cleaning

Enroll_Health_Fitness_Certificate From 16-17 Spring and above Enroll_Marital_Status From 16-17 Spring and above. Enroll_Student_Degree From 17-18 Fall and older Enroll_Mode_of_Study From 16-17 Spring and older Enroll_Employment_Status. Enroll_Required_Academic_Period From 19-20 Spring and Senior Enroll_Required_Credits_Graduation From 18-19 Fall and Senior Enroll_Current_Registered_Credits. Enroll_Transfer_Institution Enroll_Language_Test_Name Enroll_Language_Test_Score Enroll_High_School_System Enroll_High_School_Score Enroll_High_School_Country.

Data from summer semesters have been removed from the databases, as they negatively affect attribute transformation and lead to incorrect prediction. Due to difficulties in the consistency of the attributes throughout the long observation of the data during the academic year, it was decided to select 13 attributes as the final data frame. Enroll_High_School_Country The country in which the candidate obtained their last high school diploma/certificate. Table 8: List of final selected properties.

Table 4: Attributes Changes in Progressinve Years

Data Integration

Enroll_Student_DOB Date of Birth in YYYY-MM-DD format Enroll_Student Current State of Citizenship as. Enroll_Home_Emirate The emirate where the student resides as stated on the passport or visa. The first seminar the student is registered for his/her current PROGRAM in.

Cumulative grade point average (CGPA) from the beginning of the student's record to the last enrolled academic period. Column filter option was used to filter out 60 non-required attributes and 12 selected attributes remained. Third, repeat merge option was used on the 12 graduate files to include graduate GPA value for all graduate students.

Data Transformation

Since the dataset was obtained from the Ministry of Education (CHEDS) portal, the percentage of missing information was low as shown in Table 9. After the dataset was encoded based on “One-Hot Encoding”, the total number of attributes reached 68 attributes. The following methods (information gain, information gain ratio, correlation, chi-square, and Gini index) were used to generate attribute weights based on the class label.

Accuracy: It is a method that increases the accuracy of the result tree based on attribute splitting selection. It is an automated algorithm that assigns a label to an object based on an example (Noble 2006). The performance of the KNN model was low, almost to 0.5 the diagonal line (AUC=0.6), leading to an unsatisfactory model.

D ATA M ODELING

Modeling Technique

Information Gain: It is a method of selecting attributes with a huge number of values; and it performs splitting at least entropy. The second stage is to classify unknown data by nearest neighbor majority vote. It is a model built on the bagging method where several bootstrap samples of training data are used to learn the decision tree.

The difference between Random Forest and Bagging is that RF selects nodes in each splitting tree based on a small set of random attributes. This technique uses continuous sampling with replacement from the data set based on the uniform probability distribution. The increment technique assigns weight to the training instance and may change at the end of each instance's round weight increment based on performance.

Testing Design

In our study four classifiers were used which are: Decision Tree, Naive Bayes, k-NN and Neural Network. In our model, a number of decision trees (default value=50 recommended by RapidMiner) were used with weak classification algorithm that gradually changed the data.

Building Model

The third sub-stage in data modeling was to build a data model algorithm for classification prediction based on the test design. Each classification model considered in section 3.4.1 has setup parameters that are illustrated in the following tables.

Table 12: DT Set Parameters 3.4.3.2 k – Nearest Neighbor

M ODEL A SSESSMENT AND E VALUATION

Figure-11 illustrates the four main receiver operator characteristic (ROC) shapes based on the area under the curve value.

Figure 10: Accuracy and Precision Difference

D ATA D EPLOYMENT

After successfully going through the CRISP-Data Mining workflow and completing the most complex phases of data preparation and data modeling, it was time to evaluate the model classification output with respect to the study motivation research questions. The goals of this study were: to identify students who are likely to withdraw from the institute, to identify the most efficient variable in predicting student withdrawal, and to estimate the accuracy of the prediction.

R ESEARCH Q UESTION 1 AND 2

Best Performance Model

As for finding the positive class (Extraction) based on true positive, the model got good results with a result of 62.96% in Sensitivity. The Area Under the Curve of Receiver Operator Characteristic scored 0.805 which was a good sign to find True Positive from True Negative class. This is judged by the area between the red curve and the dashed line.

The best score in accuracy, precision, F-measure and TNR found in the separation criteria was Accuracy. Table – 23 presents the performance of the model with separation criteria (gain ratio, information gain, Gini index and accuracy). On the other hand, the best prediction classification from the Ensemble classifiers was the Random Forest model which is considered the best model.

Least Performance Model

Summary of Planned Models

R ESEARCH Q UESTION 3

This research adds a valuable contribution to the EDM field in the UAE, as this rating prediction model is applicable to other Higher Education Institutes in the UAE (HEIs following accreditation by the CAA). This research clearly illustrates that withdrawal prediction is primarily related to the academic performance and is consistent with the findings of the research (Aguiar et al. 2014; Almarabeh 2017; Adekitan and Noma-Osaghae 2019). This research is also similar to other studies (Bucos and Drăgulescu 2018; Miguéis et al. 2018; Alomari, K.M., AlHamad, A.Q. and Salloum 2019) in the result that the prediction classifier (Random Forest) is a particularly good classifier for predicting a student's withdrawal .

A data mining approach to predicting the performance of first-year university students using admission criteria. Examining student attrition in higher education using big data analytics and data mining techniques. Educational data mining: Predictive analysis of academic performance of public school students in the capital city of Brazil.