Work Readiness Prediction of Telkom University Students Using Multinomial Logistic Regression and Random Forest Method

(1)

Work Readiness Prediction of Telkom University Students Using Multinomial Logistic Regression and Random Forest Method

Haura Athaya Salka^*, Kemas Muslim Lhaksmana

Fakultas Informatika, Program Studi Informatika, Telkom University, Bandung, Indonesia Email: ¹hauraathaya@student.telkomuniversity.ac.id, ²kemasmuslim@telkomuniversity.ac.id

Email Penulis Korespondensi: haura.athaya@gmail.com

Abstrak−Kesiapan kerja bagi para lulusan perguruan tinggi merupakan hal yang sangat penting dan signifikan untuk mendapatkan pekerjaan segera setelah dinyatakan lulus. Tetapi yang terjadi banyak lulusan yang menganggur setelah lulus atau tidak mendapatkan pekerjaan yang sesuai dengan bidang jurusan yang mereka pelajari selama 4 tahun lebih. Karena itu dengan menggunakan pendekatan people analytics, penelitian ini bertujuan untuk memprediksi kesiapan kerja Mahasiswa Telkom University dan mencari tahu faktor apa saja yang mempengaruhi kesiapan kerja Mahasiswa setelah lulus. Model yang dibangun adalah model klasifikasi multi-classes. Model ini menggunakan penghitungan Chi-square Test untuk seleksi fitur, Multinomial Logistic Regression dan Random Forest sebagai metode klasifikasi, serta confusion matrix sebagai metode evaluasi.

Multinomial Logistic Regression digunakan karena beberapa riset menggunakan algoritma ini untuk data yang sifatnya kategoris, sementara Random Forest digunakan untuk membandingkan model mana yang memberikan akurasi paling baik.

Penelitian ini melakukan beberapa skenario pengujian, dimana didapatkan model terbaik dengan melakukan hyperparameter tuning dan menangani data yang tidak seimbang dengan SMOTE-ENN. Penanganan data yang tidak seimbang dengan SMOTE-ENN digunakan untuk meningkatkan nilai akurasi dan memprediksi kelas dengan baik terutama untuk kelas minoritas.

Akurasi terbaik dari metode Multinomial Logistic Regression sebesar 53.9% dan Random Forest sebesar 48.5%.

Kata Kunci: People Analytics; Kesiapan Kerja; Performansi Mahasiswa; Multinomial Logistic Regression; Random Forest Abstract−Work readiness for college graduates is an essential and significant thing to get a job immediately after graduation.

But what happens is that many graduates are unemployed after graduation or do not get jobs that match the majors they have studied for more than four years. Therefore, by using a people analytics approach, this study aims to predict the work readiness of Telkom University students and find out what factors affect student work-readiness after graduation. The model built is a multi-classes classification model. This model uses Chi-square Test calculation for feature selection, Multinomial Logistic Regression and Random Forest as a classification method, and confusion matrix as an evaluation method. Multinomial Logistic Regression is used because several studies use this algorithm for categorical data, while Random Forest is used to compare which model produces better accuracy. This study conducted several test scenarios, which obtained the best model by performing hyperparameter tuning and handling unbalanced data with SMOTE-ENN. Handling imbalanced data with SMOTE- ENN is used to improve accuracy scores and predict classes well, especially for minority class. The best accuracy of the Multinomial Logistic Regression method is 53.9%, and Random Forest is 48.5%.

Keywords: People Analytics; Work Readiness; Students Performance; Multinomial Logistic Regression; Random Forest

1. INTRODUCTION

People analytics is now one of the essential instruments for talent management [1] .This analytic combines data mining and business analytics, which are later applied to human resource data [2]. Data science and data analytics help companies provide descriptive, predictive, and prescriptive analysis [1]. People analytics is also used as a new solution to make decisions based on actual facts, which can improve an organization's performance [3].

Companies can now detect and identify what kind of people they want to employ and whether their soon-to-be employees can improve their work performance, reach goals, and innovate. Therefore, more in-depth research on people analytics is very needed. It is profitable not only for companies that actively recruit new talents but also for the fresh graduates who will soon enter the world of professional work. With this research, fresh graduates know which factors are helpful for their work readiness and the strategy to fulfill their dreamed company's criteria.

People are all competing to get a job, and it's not easy, especially with unavoidable external factors like a pandemic. Mental readiness and soft and hard skills are very much needed for fresh graduates to adapt to their work environment. Work readiness is a must for students who are currently preparing to enter the world of professional work. Work readiness is a long-standing concept adapted from the term “consumer readiness” written by Bowen in 1986 under the title “Managing customers as human resources in service organizations”. The term was later used as the work readiness concept which is useful for further research on the effectiveness of the internship program on the employability of fresh graduates [4]. Through an internship, the fresh graduates honed their soft and hard skills and allowed them to identify which factors that boost them to be able to complete their task given. Other study stated that the work readiness was not only developing a capability to finish tasks given, but also able to work independently and contribute to the work beyond expectations [5]. It can be concluded that work readiness is a situation where graduates have been prepared to achieve optimal readiness in the world of work professional.

Many relevant studies applied the people analytics concept to analyze performance data with appropriate classification methods. One study conducted by Nasril et al. in 2021 [6] predicts the performance analysis at PT Angkasa Pura II with the Multinomial Logistic Regression method, feature selection with ANOVA (Analysis of

(2)

DOI: 10.30865/mib.v6i4.4546

Variance), and Random Forest method to provide prediction with better accuracy. The result of this study is a predictive error rate of as much as 29.2%, and the average error reached 65.4% using Multinomial Logistic Regression, which leads to the conclusion that this type of model is not entirely accurate for small data representatives. Therefore, the Random Forest produces a better prediction accuracy with a predictive error rate of as much as -0.1% and an average error of 0%.

Another study conducted by Supriadi et al. in 2020 [7] used the Multinomial Logistic Regression method to determine how much is the impact of soft skills and hard skills on students’ work readiness. This study used the F-test and T-test to indicate that soft skills and hard skills simultaneously have a significant impact on work readiness for Information System Major students at Universitas Nusa Putra by as much as 37.3%.

Another study conducted by Necula et al. in 2019 [1] stated the soft skills and hard skills of an individual on their resumes analysis using several classification approaches have a significant impact to talent acquisition.

The classification approach that they used is regression, k-NN, Random Forest, Naïve Bayes, Support Vector Machine, and Decision Tree. The result from this study shows that the highest accuracy was obtained with the Random Forest method, with an accuracy score of 98%. Another study conducted by Saling et al. in 2020 [8] also stated that the people analytics approach significantly impacts the development of a new structure of a complex human resource system which is important to optimizing decisions.

Another study conducted by Anthony et al. in 2020 [9] used Soft-System Methodology to describe the students' work readiness problem in this Industrial Revolution 4.0. The result of this study provides recommendations for students to be better prepared to enter the world of work of the industrial revolution 4.0. It also includes information about students who are ready and less ready to enter the world of work obtained from statistical analysis results.

In several previous related works to the people analytics approach, it is necessary to build a classification model that predicts valuable factors that might be helpful to talent management and human resources using machine learning. This study aims to predict the work readiness of Telkom University students by analyzing their performance throughout their active study period using Multinomial Logistic Regression. This study also used Random Forest for comparison to know which model obtained a better accuracy score. Multinomial Logistic Regression has been used in research with categorical data yet produces a relatively poor accuracy score for a small dataset, as shown in the study [6]. Then it was proven that Random Forest could produce a great performance model with an average error rate of 0%. But the feature selection used in the study [6] is ANOVA, which mainly used in research with categorical independent variables and numerical dependent variables. This study used Chi- square Test as feature selection because both independent and dependent variables in this study are categorical types. This study later experiments with SMOTE-ENN to balance the data to prove whether imbalanced data have a significant effect on the accuracy score. In the study conducted by Lin et al. in 2021 [10], the SMOTE-ENN increased the accuracy score approximately to 3%. Furthermore, this study also aims to determine which factors and competencies that support the students' work readiness and get permanent work right after they graduate.

2. RESEARCH METHODOLOGY

This study builds a classification model to analyze the work readiness of Telkom University students using the Multinomial Logistic Regression method and Random Forest method for accuracy score comparison. Building this classification model consists of several stages, including dataset preparation, data preprocessing, splitting the dataset into training and testing sets, feature selection using Chi-square, hyperparameter tuning, and evaluating the model using the confusion matrix. The flowchart of the whole system can be seen in Figure 1.

Figure 1. Flowchart System

The raw data will go through preprocessing, before splitting it into train and test data. The features in the train data will be selected, and the selected features will go through a model tuning process. Train and test data will be fitted to the best model, then the confusion matrix will evaluate the model performance.

(3)

This study used Data Tracer Study from 2015 to 2020 released by Telkom University, including students’ records and survey questions that imply their performance after graduating. This study also used students’ GPA records that merged with the base dataset, the Data Tracer Study, by using the Student ID as the unique number to identify which GPA belongs. The raw dataset contains more than 9000 students’ records and more than 28000 GPA records. A further explanation of the data attributes is shown in Table 1.

Table 1. Data Attribute

2.2 Data Preprocessing

Preprocessing is a step to prepare the raw dataset that has not yet been validated, contains missing values, outliers, duplicated data, and many more. It is a crucial first step in building a classification model, as it cleans and converts the data into a proper one so it can be used in the model. The preprocessing step is carried out in several stages:

2.2.1 Data Cleaning

This stage includes handling missing values by looking for Null or NaN values in each column and filling in the value with mode, 0, or dropping the rows that contain missing values. Then search for duplicate data on the attribute ‘nim’ or Student Identification Number, which is a unique number for each student. The handling of duplicate data is handled by dropping the rows with duplicated data and keeping the first row. Next, handling outliers is done by dropping the rows that contain outliers. The outliers are handled by calculating the interquartile range and visualizing the feature column with a boxplot to ease the observation. Values below Q1 and above Q3 are considered outliers.

2.2.2 Feature Engineering

This stage includes adding new features based on other features that can affect the final result of the system, binning numerical data into categorical data to ease observation, and removing features that are not needed. This stage also encodes categorical variables using LabelEncoder and One-Hot Encoding. LabelEncoder is part of the scikit-learn in Python which used on an ordinal or sequential data to convert categorical data into numbers [11].

Meanwhile One-Hot Encoding is a technique or method which used on a non-ordinal data to convert categorical data into numbers and transformed it into a new column [11]. The new column is taken from the unique value of the one-hot encoded column, and the value for a new column will consists of either 1 or 0.

Besides engineering features, this stage also adds a target column by classifying job waiting period into three classes; students who get a job before they graduate, students who get a job below the average of the alumni

Attribute Description Attribute Description

id Students’ ID critical_thinking Critical thinking

level Students’ education level ablt_research Research ability

major Students’ major ablt_study Study ability

faculty Students’ faculty ablt_communication Communication

ability

batch Students’ batch pressure_study Under-pressure

learning

gpa Students’ final GPA manage_time Time management

entry_year Year of enter independent_work Independent work

graduate_year Year of graduate team_work Teamwork

job_year Year of get a job problem_solving Problem solving

job_position Job position negotiation Negotiation

job_location Job location ablt_analysis Analysis ability

org_experience Organization experience tolerance Tolerance

rate_field Field suitability with job position

ablt_adapt Adaptability rate_education Level suitability with job

position

loyal_intergrate Loyalty and integrity

job_field Job field diff_culture Different culture work

cat_company Company category leadership Leadership

type_company Company type ablt_responsible Responsibility

knowledge_study_field Level of study field knowledge

initiative Initiative

knowledge_non_study_field Level of non-field knowledge manage_project Project management general_knowledge General knowledge ablt_present Presentation ability

ablt_internet Internet ability ablt_writing Report writing ability

ablt_computer Computer ability learn_desire Desire to learn

(4)

DOI: 10.30865/mib.v6i4.4546

waiting period, and students who get a job above the average of the alumni waiting period. The average alumni waiting period is 2.49 months, which was taken from the Data Tracer Study Telkom University 2020 conclusion.

The purpose of this stage is to transform features into the right terms for modelling.

2.2.3 Treat Imbalance Data

Many classification models have difficulty in accurately predicting data because of the low amount and lack information of the minority class [10]. An excellent accuracy score doesn’t guarantee a good model and balanced prediction for every classes. Therefore, the treatment of imbalanced data is needed to achieve a good prediction as well to detect the minority class effectively [10].

This stage uses SMOTE-ENN, which is a combined method to over and under sampling data using Synthetic Minority Oversampling Technique and Edited Nearest Neighbors to treat imbalance data. This method works by generating noisy samples by interpolating new points between marginal outliers and inliers, then cleaning the remaining space from oversampling [12].

2.3 Feature Selection

The type of feature selection used in this study is the calculation of the Chi-square Test score. The author used this method because most features are categorical data. The feature selection is done by calculating the correlation between feature columns and with target column with the formula:

𝑋^𝑛= ∑ ^(𝑂^𝑖^{− 𝐸}^𝑖⁾²

𝐸_𝑖

𝑛𝑖=1 (1)

with O as the observed value, and E as the expected value [13]. The calculation of the Chi-square Test score was conducted by testing two hypotheses: Null Hypothesis and Alternative Hypothesis [13]. The feature will get eliminated if the Null hypothesis is accepted, that is, when the significance value is more than the alpha value, which is 0,05. Meanwhile, the feature will pass the selection and be used in the model if the significance value is below 0,05.

The hypothesis for this feature selection is as follows:

H0 = Features do not affect the target class significantly H1 = Features do affect the target class significantly 2.4 Classification Model

The classification models used are Multinomial Logistic Regression and Random Forest as the base model for this system. Then hyperparameter tuning is done to find the parameters of each model that can produce maximum performance.

2.4.1 Multinomial Logistic Regression

The Multinomial Logistic Regression method is the more constructed version of Logistic Regression. Logistic Regression is a method that can be used to analyze and predict data with binary classes. Meanwhile, the constructed version of Logistic Regression can predict more than two classes.

Logistic Regression is originally a part of the larger class of Generalized Linear Model (GLM) [14] in which the algorithm is expected to predict the dependent variable to be a linear combination of the independent variables. The outcome variable is binary or dichotomous, such as yes or no, live or dead, good or bad, and many more [15]. The formula for the Generalized Linear Models (GLM) as follows:

𝑔(𝐸(𝑦)) = 𝛼 + 𝛽𝑥1 + 𝛾𝑥2 (2)

where g() is the link function, E(y) is the expected of the outcome variable, and 𝛼 + 𝛽𝑥1 + 𝛾𝑥2 as the linear predictor. Meanwhile, the formula for Logistic Regression is as follows:

𝑙𝑜𝑔𝑖𝑡(𝑌) = ln ( ^𝜋

1− 𝜋) = 𝛼 + 𝛽𝑥 (3)

where 𝜋 is the probability value for outcome variable, 𝑋 as the predictor or the explanatory variable, 𝛼 as the intercept of 𝑌, and 𝛽 as the regression coefficient. This study has more than two classes for the outcome variables, therefore this study used Multinomial Logistic Regression with the following formula:

𝑙𝑜𝑔𝑖𝑡(𝑌) = ln ( ^𝜋

1− 𝜋) = 𝛼 + 𝛽₁𝑥₁+ 𝛽₂𝑥₂ (4)

The parameter used for Multinomial Logistic Regression can be seen in Table 2. The multi_class argument is set to ‘multinomial’ because this study has more than two classes for the outcome variables. Logistic Regression doesn’t have any critical parameter to tune but experimenting with solver and penalty (or regularization) might show a significant performance difference. The ‘multinomial’ is only supported by ‘lbfgs’, ‘sag’, ‘saga’, and

‘newton-cg’ solver [16]. Meanwhile, ‘newton-cg’, ‘lbfgs’, and ‘sag’ only supported by L2 regularization [16]. The C parameter also helps find optimal performance, as it controls the penalty strength.

(5)

Parameter Value

multi_class multinomial

solver [‘lbfgs’, ‘sag’, ‘saga’, ‘newton- cg’]

penalty [‘l2’, ‘none’]

C [0, 0.01, 0.1, 1.0]

max_iter [50000]

2.4.2 Random Forest

Random Forest is originally part of the ensemble method family, which goal is to combine predictions of several base estimators built with a given learning algorithm to improve the robustness of a single estimator [17]. Random Forest is then classified as averaging methods, which principle is to build several estimators and then average their predictions [17].

Random Forest is one of the machine learning methods that are multipurpose and intelligent [18]. It is called multipurpose because Random Forest can do classification and regression tasks simultaneously. Random Forest performance gives lower classification error than other classification methods [19]. This algorithm is a combination of hundreds to thousands of decision trees, and the final result of a random forest is the average of the total number of decision trees [20]. The equation for the estimated n-tree is as follows:

𝑚_𝑛(𝑥; 𝜃_𝑗, 𝐷_𝑛) = ∑

1𝑋𝑖𝜖𝐴𝑛(𝑥; 𝜃𝑗,𝐷𝑛)𝑌_𝑖 𝑁𝑛(𝑥; 𝜃𝑗,𝐷𝑛)

𝑖 ∈ 𝐷_𝑛^∗(𝜃_𝑗) (5)

where 𝐷_𝑛^∗(𝜃_𝑗) is a set of selected point data through the tree construction, 𝐴_𝑛(𝑥; 𝜃_𝑗, 𝐷_𝑛) is a cell that enveloped the X value, and 𝑁_𝑛(𝑥; 𝜃_𝑗, 𝐷_𝑛) is the total of the selected point and included in the 𝐴_𝑛(𝑥; 𝜃_𝑗, 𝐷_𝑛) equation. After calculating the estimated n-tree, the next step is to calculate the estimated combined trees that form a finite forest with the equation as follows:

𝑚_𝑀,𝑛(𝑥; 𝜃1, … , 𝜃𝑚, 𝐷_𝑛) = ¹

𝑀∑^𝑀_𝑗=1𝑚_𝑛(𝑥; 𝜃𝑗, 𝐷_𝑛) (6)

The parameter used for Random Forest can be seen in Table 3. The most critical parameter in the Random Forest model is the ‘max_features’, which is the number of features to consider when finding the best split [17].

The input value can be an integer from scale 1 to half of the number of input features, but we can simply input the default value from the sklearn library. The ‘n_estimators’ parameter shows the number of trees in a forest.

Table 3. Hyperparameter Tuning Random Forest

Parameter Value

n_estimators [100, 200, 300, 400, 500]

max_features [‘sqrt’, ‘log2’]

2.5 Validation Model

The validation model used for evaluating the performance of the built model system is the confusion matrix. The confusion matrix used is 3x3 to predict three classes as shown in Table 4.

Table 4. Confusion Matrix 3x3

Class

Predicted Before Graduate

(0) Below Average (1) Above Average (2)

Actual

Before Graduate

(0) T00 T01 T02

Below Average (1) T10 T11 T12

Above Average (2) T20 T21 T22

From the confusion matrix above, the performance of the built classification model can be measured by calculating Accuracy, Precision, Recall or Sensitivity, and F1-Score. The formulas are carried out as:

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ^𝑇⁰⁰^+𝑇¹¹^+𝑇²²

𝑇00+𝑇10+𝑇20+𝑇01+𝑇11+𝑇21+𝑇02+𝑇12+𝑇22 (7)

{

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛⁰= ^𝑇⁰⁰

𝑇00+𝑇10+𝑇20

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛₁= ^𝑇⁰¹

𝑇01+𝑇11+𝑇21

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛₂= ^𝑇⁰²

𝑇02+𝑇12+𝑇22

(8)

(6)

DOI: 10.30865/mib.v6i4.4546

{

𝑅𝑒𝑐𝑎𝑙𝑙⁰= ^𝑇⁰⁰

𝑇00+𝑇01+𝑇02

𝑅𝑒𝑐𝑎𝑙𝑙₁= ^𝑇¹⁰

𝑇10+𝑇11+𝑇12

𝑅𝑒𝑐𝑎𝑙𝑙₂= ^𝑇²⁰

𝑇20+𝑇21+𝑇22

(9)

{

𝐹1 − 𝑆𝑐𝑜𝑟𝑒⁰=^{2 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙}⁰∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛0 𝑅𝑒𝑐𝑎𝑙𝑙0 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛0

𝐹1 − 𝑆𝑐𝑜𝑟𝑒₁=^{2 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙}¹∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛1 𝑅𝑒𝑐𝑎𝑙𝑙₁ + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛₁

𝐹1 − 𝑆𝑐𝑜𝑟𝑒₂=^{2 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙}²∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛2 𝑅𝑒𝑐𝑎𝑙𝑙2 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛2

(10)

3. RESULTS AND DISCUSSION

This classification model is tested on two datasets: one-faculty and all-faculty. Each dataset will be tested and classified into three classes: Before Graduate (0), Below Average (1), and Above Average (2). The one-faculty dataset contains 1495 rows of raw data categorized as a small dataset, and the all-faculty dataset contains 9479 rows of raw data categorized as a big dataset.

There are two testing scenarios carried out. The first test scenario is to show the result of the model performance after using the hyperparameter tuning. And the second test scenario is to show the effects of imbalance data treatment using SMOTE-ENN. Every testing scenario will also compare Multinomial Logistic Regression with Random Forest to see which model produces the best accuracy score.

3.1 Feature Selection

After going through the preprocessing steps, the next step is to find features that significantly affect the target class. As mentioned in Research Methodology, this study used Chi-square Test score calculations to eliminate features that have no significant impact on the target class. Features with the p-value less than 0,05 are considered to pass the feature selection and can be seen in Table 5.

Table 5. Features Selected Attribute P-Value

(10^-2)

Chi Score Attribute P-Value

(10^-2)

Chi Score

diff_culture 2.50 × 10^-9 66.44 ablt_responsible 8.36 × 10^-5 43.11

independent_work 1.17 × 10^-9 68.11 ablt_communication 1.81 × 10^-8 62.09

team_work 7.41 × 10^-9 84.06 ablt_writing 3.64 × 10^-8 60.54

pressure_study 7.36 × 10^-8 58.99 problem_solving 2.02× 10^-7 56.74

critical_thinking 4.01 × 10^-7 55.22 ablt_present 1.23 × 10^-4 42.23

job_field_Finance 9.12 × 10^-1 9.4 leadership 9.73 × 10^-7 53.23

job_field_Service 1.40 8.54 ablt_internet 3.05× 10^-3 30.58

job_field_CompEle ctro

1.38 8.57 ablt_computer 2.00 × 10^-6 46.85

job_field_Consulta nt

2.55 × 10^-2 16.55 ablt_research 1.92 × 10^-8 61.96

job_field_Program ming

6.81 × 10^-10 51.42 loyal_integrate 1.60 × 10^-6 52.11

job_field_Property 2.44 7.43 manage_project 2.11 × 10^-5 46.27

job_field_SocialPol itic

2.94 7.05 negotiation 2.09 × 10^-6 51.51

job_field_Telecom 4.31 6.29 org_HIMA 1.40 × 10^-3 22.355

initiative 1.73 × 10^-10 72.26 org_Spirituality 3.91 6.48

type_Local 9.29 × 10^-3 18.57 org_Sport 3.52 6.69

type_Multinational 3.76 × 10^-4 24.98 org_Reasoning 1.09 9.03

type_National 1.37 × 10^-8 45.43 org_ArtCulture 4.28 × 10^-1 10.9

level 1.59 12.2 org_NoParticipate 4.63 × 10^-

11

56.8 cat_Government 5.16 × 10^-4 24.35 knowledge_study_field 1.54 × 10^-5 46.98 cat_Private 3.28 × 10^-2 16.04 knowledge_non_study_fi

eld

2.80 × 10^-8 61.12

learn_desire 2.77 × 10^-9 66.22 general_knowledge 2.55 × 10^-4 40.53

ablt_adapt 1.11 × 10^-9 68.22 rate_field 2.09 × 10^-3 35.58

(7)

Haura Athaya Salka, Copyright © 2022, MIB, Page 1909 There are 47 features selected by calculating their Chi-square Test score, and these features will be proceeded in the first testing scenario.

3.2 Result and Discussion Performance of Hyperparameter Tuning

After implementing the data preprocessing above, the first scenario test is carried out by optimizing the hyper parameters for both Multinomial Logistic Regression and Random Forest method. The optimizing step uses the Grid Search method, which is the simplest algorithm used for hyperparameter tuning. The first scenario was tested on the small dataset, which is the one-faculty dataset.

Table 6. Result of First Scenario for One-Faculty Dataset Split

(train : test)

Parameter Accuracy

with MLR

Parameter Accuracy with RF 60:40 'C': 0, 'max_iter': 50000,

'multi_class': 'multinomial', 'penalty': 'none', 'solver': 'newton-

cg'

1 'max_features': 'sqrt', 'n_estimators': 400

0.990

70:30 'C': 0, 'max_iter': 50000, 'multi_class': 'multinomial', 'penalty': 'none', 'solver': 'newton-

cg'

1 'max_features': 'log2', 'n_estimators': 500

0.987

cg'

0.993

cg'

0.974

As we can see from Table 6, the best parameter for Multinomial Logistic Regression is the same for each split and achieved a perfect 100% accuracy score for all split data ratios. The best C parameter equals 0, and the penalty parameter equals ‘none’. This means that no regularization is applied, which can be concluded that the model probably has an overfitting condition. Regularization is needed especially for Logistic Regression to calibrate the models to minimize the probability of overfitting or underfitting. Meanwhile, the Random Forest achieved the best accuracy score of 99.93% by splitting the data into 80% train data and 20% test data with the best parameter ‘log2’ for max features and the number of trees equal to 400.

Table 7. Result of First Scenario for All-Faculty Dataset Split

(train : test)

with MLR

Parameter Accuracy with RF 60:40 'C': 0.01, 'max_iter': 50000,

'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'saga'

0.542 'max_features': 'sqrt', 'n_estimators': 500

0.514

70:30 'C': 0.01, 'max_iter': 50000, 'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'saga'

0.522

80:20 'C': 0.01, 'max_iter': 50000, 'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'saga'

0.549 'max_features': 'sqrt', 'n_estimators': 200

0.499

90:10 'C': 0.01, 'max_iter': 50000, 'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'newton-

cg'

0.502

Next, the first scenario is tested on the all-faculty dataset. From Table 7, the best accuracy score for Multinomial Logistic Regression is 54.9% with a split data ratio of 80:20, and the best parameter C equals 0.01 with penalty equals ‘l2’. The regularization applied is the Ridge Regression which tells the model probably has no

Attribute P-Value (10^-2)

Chi Score Attribute P-Value

(10^-2)

Chi Score

ablt_analysis 5.98 × 10^-4 38.54 tolerance 7.93 × 10^-

11

73.97

ablt_study 2.79 × 10^-7 56.02

(8)

DOI: 10.30865/mib.v6i4.4546

overfitting or underfitting condition because the penalty ‘l2’ already handles it. Meanwhile, the best accuracy score for Random Forest is 52.2%, with a split data ratio of 70:30. This implies that the model is able to guess correctly for more than half of the population of the dataset. But the other indicator score means otherwise.

Table 8. Precision, Recall, F1-Score for First Scenario on All-Faculty Dataset

Model Class Precision Recall F1-Score

Multinomial Logistic Regression (80:20)

0 0.51 0.48 0.48

1 0.00 0.00 0.00

2 0.55 0.66 0.66

Random Forest (70:30) 0 0.51 0.48 0.50

1 0.23 0.10 0.14

2 0.56 0.68 0.60

As shown in Table 8, both models return a very low Precision, Recall, and F1-Score for class 1. Low precision means that the model returns many false positives and inaccurate predicted results. Precision is seen as a quality measure; that way, both models have a poor quality to be tested on this dataset. As for low recall means that the model returns a lot of False Negatives. In other words, the model doesn’t return many or maybe no results at all. Recall is seen as a measure of quantity; that can be said that both models produce a small amount of predicted results on this dataset. Low precision and low recall certainly return low F1-Score as well because F1-Score is the harmonic mean for both Precision and Recall.

Low precision, recall, and F1-Score might be the outcome of imbalanced data as we can see that it only scores the lowest on Class 1, which has the fewest support and differ significantly from the other two classes. It can be concluded that the first test scenario doesn’t produce a good result for both the one-faculty dataset and the all-faculty dataset. Tuning the model hyperparameter doesn’t assure a great model. We must prepare our data beforehand and ensure there is no significant difference in our target classes to prevent imbalanced class problems.

This leads us to the next test scenario.

3.3 Result and Discussion Effect of Treat Imbalance Data

After going through the first test scenario, the treatment of imbalanced data seemed to be needed to prevent the imbalanced class problem. Before moving to the next step, the distribution of target classes for the one-faculty and all-faculty datasets can be seen in Figure 2.

Figure 2. Target Classes Distribution. (a) One-Faculty (b) All-Faculty

From the figure above, Class 1 differs significantly from Class 0 and Class 2. This explains why the first scenario produces low precision and recall for both datasets. To treat this imbalance class problem, the SMOTE- ENN method is used. As mentioned in Research Methodology, SMOTE-ENN is a hybrid method for over and under sampling data. The second scenario is tested on the one-faculty dataset, with results as shown in Table 9.

Table 9. Result of Second Scenario for One-Faculty Dataset Split

(train : test)

with MLR

Parameter Accuracy with RF 60:40 'C': 0, 'max_iter': 50000,

'multi_class': 'multinomial', 'penalty': 'none', 'solver':

'lbfgs'

0.989

(9)

(train : test)

with MLR

'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'lbfgs'

0.966

80:20 'C': 1.0, 'max_iter': 50000, 'multi_class': 'multinomial', 'penalty': 'l2', 'solver': 'lbfgs'

1

90:10 'C': 1.0, 'max_iter': 50000, 'multi_class': 'multinomial',

'penalty': 'l2', 'solver': 'sag'

1

As we can see from Table 9, the accuracy score of Multinomial Logistic Regression is decreased by 0.011 for splitting the data into 60:40 ratio and 0.004 into 70:30 ratio. But the regularization is applied in some splitting ratios and produces the best accuracy score of 100% for 80:20 and 90:10 ratios. Meanwhile, for the Random Forest, the accuracy score is increased by 0.007 for splitting the data into an 80:20 ratio from the first test scenario.

Table 10. Result of Second Scenario for All-Faculty Dataset Split

(train : test)

with MLR

'multi_class': 'multinomial', 'penalty': 'l2', 'solver':

'newton-cg'

0.522 'max_features': 'log2', 'n_estimators': 500

0.483

70:30 'C': 0, 'max_iter': 50000, 'multi_class': 'multinomial',

'penalty': 'none', 'solver':

'lbfgs'

0.539 'max_features': 'sqrt', 'n_estimators': 200

0.485

'newton-cg'

0.520 ''max_features': 'sqrt', 'n_estimators': 500

0.467

'saga'

0.326

The second scenario is tested on all-faculty dataset with results as shown in Table 10. Both models produce the best accuracy score by splitting the data into a 70:30 ratio. The accuracy score for Multinomial Logistic Regression is 53.9%, which is decreased by 0.01 from the first test scenario. Random Forest produces an accuracy score of 48.5%, which is also reduced by 0.037 from the first test scenario.

Table 11. Precision, Recall, F1-Score for First Scenario on All-Faculty Dataset

Model Class Precision Recall F1-Score

Multinomial Logistic Regression (70:30)

0 0.49 0.49 0.49

1 0.57 0.60 0.59

2 0.55 0.52 0.53

Random Forest (70:30) 0 0.44 0.44 0.44

1 0.52 0.50 0.51

2 0.49 0.51 0.50

As shown in Table 11 above, the Precision, Recall, and F1-Score are more stable and balanced for both models than in the first test scenario. The three scores for each class have no significant gap in between. This can be concluded that the imbalance data treatment using SMOTE-ENN helps increase the Precision, Recall, and F1- Score for the minority class, which is Class 1. Even though the accuracy scores drop from the first test scenario for both models, each target class is distributed equally and gives a better result for the other measured scores.

Therefore, the second test scenario provides the best model by splitting the data into 70% training data and 30%

testing data for the all-faculty dataset.

3.4 Prediction Results

Prediction results are a picture of the value obtained from the research process carried out, the prediction results can be seen in the following figure:

(10)

DOI: 10.30865/mib.v6i4.4546

Figure 3. Prediction Results. (a) Before prediction. (b) After Multinomial Logistic Regression. (c) After Random Forest.

The prediction for the best model for all-faculty dataset can be seen in Figure 3. The distribution of the target classes before implementation to the best model has Class 1 or Below Average with the highest value, which is 569 or 35.15%. Class 2 or Above Average has 548 or 33.85%, and the fewest class is the Before Graduate with a total of 502 or 31.01%. After being implemented with Multinomial Logistic Regression, the Below Average class increased by 1.95% with a total of 601, the Above Average decreased by 1.67% with a total of 521, and the Before Graduate decreased by 0.31% with a total of 497. Meanwhile, after being implemented with Random Forest, the Below Average decreased by 1.24% with a total of 549, the Above Average increased by 1.95% with a total of 568, and the Before Graduate had the same amount as before.

4. CONCLUSION

Based on the results and discussion that have been carried out, it can be concluded that there is a probability that students’ performance throughout their study period can improve their work readiness. Students’ field and level of education suitability for their first job also can be included as features that significantly affect their work readiness. Other features that are eliminated from the feature selection can also be used as supporting data for additional information and have the probability of improving the prediction of work readiness even better. It can also be concluded that the best model is produced by tuning the hyperparameter and treating the imbalance data with SMOTE-ENN. The imbalance class problem can be a burden, especially for big datasets as it affects the precision and recall score. By handling the imbalance data with SMOTE-ENN, the precision score for the minority class increased by approximately 57% and recall score increased by 60% with Multinomial Logistic Regression.

Meanwhile the precision score increased by 30% and recall score increased by 40% with Random Forest.

Therefore, it is indicated that balancing data with SMOTE-ENN perform a major difference for the minority class prediction. The best Multinomial Logistic Regression model produces an accuracy score of 53.9%, and the best Random Forest model produces an accuracy score of 48.5%. The accuracy score itself is not very high, considering Multinomial Logistic Regression and Random Forest are powerful machine learning algorithms. This is due to a large amount of dataset with a lot of missing values and outliers, leading to many dropping rows of data and might impact the data distribution. The model rather works better on the one-faculty dataset, with an accuracy of 100%

for both Multinomial Logistic Regression and Random Forest because there are fewer missing values and outliers to drop, and the data still keeps the original distribution. For further research, finding other students’ performances with more variety is one way to increase the model performances. The use of more complete data also may improve the model performance. Furthermore, try to use a more diverse classification model to find which model gives the best performance for this type of research.

REFERENCES

[1] S. C. Necula and C. Strîmbei, “People analytics of semantic web human resource résumés for sustainable talent acquisition,” Sustainability (Switzerland), vol. 11, no. 13, Jul. 2019, doi: 10.3390/SU11133520.

[2] M. J. D. Kavanagh and R. D. D. Johnson, Human Resource Information Systems : Basics, Applications, and Future Directions. SAGE Publications, Incorporated, 2020.

[3] P. Leonardi and N. Contractor, “Better people analytics,” Harvard Business Review, vol. 2018, no. November-December, pp. 1–22, 2018.

(11)

readiness,” Higher Education, Skills and Work-based Learning, vol. 9, no. 4, pp. 538–549, 2019, doi: 10.1108/HESWBL- 08-2018-0086.

[5] I. P. Herbert, A. T. Rothwell, J. L. Glover, and S. A. Lambert, “Graduate employability, employment prospects and work- readiness in the changing field of professional work,” International Journal of Management Education, vol. 18, no. 2, p.

100378, 2020, doi: 10.1016/j.ijme.2020.100378.

[6] F. Nasril, D. Indiyati, and G. Ramantoko, “Talent Performance Analysis Using People Analytics Approach,” Budapest International Research and Critics Institute (BIRCI-Journal): Humanities and Social Sciences, vol. 4, no. 1, pp. 216–230, 2021, doi: 10.33258/birci.v4i1.1585.

[7] I. Supriadi, A. Hariyanti, M. Z. Abidin, Rinrin, and D. Gustian, “Penerapan regresi linier berganda dalam kesiapan kerja mahasiswa,” Seminar Nasional Informatika 2020, vol. 1, no. 1, pp. 204–211, 2020.

[8] K. C. Saling and M. D. Do, “Leveraging people analytics for an adaptive complex talent management system,” Procedia Computer Science, vol. 168, pp. 105–111, 2020, doi: 10.1016/J.PROCS.2020.02.269.

[9] A. Anthony, E. Sediyono, and A. Iriani, “Analisis Kesiapan Kerja Mahasiwa di Era Revolusi Industri 4.0 Menggunakan Soft - System Methodology,” Jurnal Teknologi Informasi dan Ilmu Komputer, vol. 7, no. 5, p. 1041, 2020, doi:

10.25126/jtiik.2020752380.

[10] M. Lin et al., “Detection of Ionospheric Scintillation Based on XGBoost Model Improved by SMOTE-ENN Technique,”

2021, doi: 10.3390/rs13132577.

[11] T. Al-Shehari and R. A. Alsowail, “An Insider Data Leakage Detection Using One-Hot Encoding, Synthetic Minority Oversampling and Machine Learning Techniques,” 2021, doi: 10.3390/e23101258.

[12] “SMOTEENN — Version 0.9.1.” https://imbalanced-

learn.org/stable/references/generated/imblearn.combine.SMOTEENN.html (accessed Jul. 08, 2022).

[13] K. F. Weaver, V. C. Morales, S. L. Dunn, K. Godde, and P. F. Weaver, An introduction to statistical analysis in research : with applications in the biological and life sciences. John Wiley & Sons, 2017.

[14] A. J. Dobson and A. G. Barnett, An Introduction to Generalized Linear Models, 4th ed. Taylor & Francis Group, 2018.

doi: https://doi.org/10.1201/9781315182780.

[15] H. A. Park, “An introduction to logistic regression: From basic concepts to interpretation with particular attention to nursing domain,” J Korean Acad Nurs, vol. 43, no. 2, pp. 154–164, 2013, doi: 10.4040/jkan.2013.43.2.154.

[16] “sklearn.linear_model.LogisticRegression — scikit-learn 1.1.1 documentation.” https://scikit- learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regression#skle arn.linear_model.LogisticRegression (accessed Jul. 08, 2022).

[17] “sklearn.ensemble.RandomForestClassifier — scikit-learn 1.1.1 documentation.” https://scikit- learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html (accessed Jul. 08, 2022).

[18] W. Sullivan, Machine Learning For Beginners Guide Algorithms: Supervised & Unsupervsied Learning. Decision Tree

& Random Forest Introduction. Healthy Pragmatic Solutions Inc., 2017.

[19] N. Farnaaz and M. A. Jabbar, “Random Forest Modeling for Network Intrusion Detection System,” Procedia Computer Science, vol. 89, pp. 213–217, 2016, doi: 10.1016/j.procs.2016.06.047.

[20] Y. Li et al., “Random forest regression for online capacity estimation of lithium-ion batteries,” Applied Energy, vol. 232, no. September, pp. 197–210, 2018, doi: 10.1016/j.apenergy.2018.09.182.