This is a study of various employee attrition factors to see how they can be used to predict employee attrition for an organization, based on a human resources dataset available online for open use. Sharafat Hossain in the Department of Software Engineering, Daffodil International University, has been accepted in fulfillment of the requirements for the degree of M.Sc. Imtiaz-Ud-Din, Assistant Professor, Department of Software Engineering, Daffodil International University, in fulfillment of my original work.
In the truest sense of the word, the cost of replacing a well-trained or high-performing employee can be very high, sometimes even exceeding what makes sense in terms of money. Understanding which factors contribute most to an employee's departure can help management plan actions to improve employee retention and pre-plan a new hire. Our research is aimed at discovering how data science can help find out which characteristics from the dataset contribute more to an employee who leaves his/her organization.
We will also try to build a prediction model that will predict whether an employee will leave the organization so that management can take appropriate steps to reduce employee turnover. We then applied a clustering algorithm to the data set to identify how the selected features affect employee attrition.
INTRODUCTION
What if the management of the organization could scientifically predict which employees leave and the reason behind their turnover. Feature selection is the method of selecting the features that contribute most to the predictor variable or output of which the model is concerned. Including irrelevant features in the data can greatly reduce the accuracy of the model.
Correlation between the factors can also play a significant role in the analysis of how the factors interact with each other within the model. The feature importance of every single feature of the data set can be obtained using the feature importance property of the model. Correlation is one of the well-known statistical tools that provides the information about how the variables within the model are related to each other.
In other words, it is a measure of the extent to which changes in the value of one variable are used to predict changes in the value of another variable. The correlation coefficient is calculated by dividing the covariance by the product of the standard deviations of the two variables. The model is used to predict an outcome in a future state or time depending on changes in the model's input variables.
Gradient boosting is recognized as one of the most accepted and powerful techniques used for building predictive models. The general idea is to train a decision tree so that each new tree fits a modified version of the original data set. Therefore, the predictions of the final collaboration model are the weighted sum of the predictions made by the previous tree models.
Gradient boosting identifies the shortcomings of using gradients in the loss function (y=ax+b+e, e needs special mention here as it is the error term). The loss function is described as a measure of how well the model's coefficients fit the original data set. The scope of our research was limited to our HR analytics dataset and the employee attributes available in the dataset.
LITERATURE REVIEW
RESEARCH
For our research, we decided to see which employee characteristics were found to contribute the most to their attrition from the organization. This means that these outcomes have the most influence on an employee leaving the organization from our chosen data set. After selecting the most important features from the data set, we calculated the correlation coefficients between those features.
This allowed us to examine how the traits are interrelated and how they affect the value of the final predictor variable (positively or negatively). We used scikit-learns train_test_split() method to get a training dataset and a testing dataset for our model. The training dataset (60% of the total data) is used to train the model and the testing dataset (40% of the total data) is used to test the model.
Model performance is measured using scikit-learn's evaluation metrics, accuracy, precision, and recall. For our research work, the data we had considered consisted of various employee attributes while working in the company. Univariate selection and attribute importance have been chosen as feature selection methods for employee attrition from the used HR data set.
The correlation coefficient has been calculated with data from the features that will be common from both feature selection techniques. Finally, a prediction model is built based on the dataset and its performance is measured based on accuracy, precision, and recall.
RESULTS AND
DISCUSSIONS
In our case, we used the chi squared (chi^2) statistical test for non-negative features to select 5 of the best features from our dataset. In our case, we constructed an ExtraTreesClassifier classifier on our dataset to identify the columns in the dataset that are most influential in determining the values of the 'left' column. It is also assumed that the association is linear, that one variable increases or decreases by a fixed amount for a unit increase or decrease in the other.
Now we have calculated the correlation coefficients between the predictor variable of our data set (employees who left the organization) marked by the 'left' column with the common set of features (from the two feature selection methods) to have the influence greater in the value of 'left'. To see how many employees from the data set left the organization, we calculated and found that the number of employees left is 23% of total employment. We can see that the level of satisfaction is lower for employees who have left the organization.
This is also verified from our correlation coefficient analysis, where correlation coefficient between 'left' and 'satisfaction level' was negative. From the definition of correlation, this means that if the value of 'satisfaction_level' decreases, the value of 'left' increases. Further visualization gives us more clarity about the effects of the selected characteristics on the employees who left the organization.
We can see that the people who left were less satisfied with the organization and the employee with five years of experience leaves more. The positive correlation coefficient between 'left' and 'time_spend_company' and between 'left' and 'average_monthly_hours'. The main focus of the research was that whether we can identify the important features behind employee attrition and use them to build a predictive model that can help organizational management in predicting attrition.
What we will do here is split the dataset into a training set and a test set using the train_test_split() function. In our case, when the Gradient Growth model predicted that an employee would quit, that employee actually quit 93% of the time. Reminder: If there is an employee who has left is present in the test set, our model was able to identify it 91% of the time.
From our analysis, we found out that according to our data set, employees who are less satisfied tend to leave the organization. Our prediction model was able to show good enough performance to predict that an employee leaves the organization, given values of the most important characteristics.
CONCLUSION AND RECOMMENDATIONS
Organizations must pay more attention to increasing employees' job-related satisfaction if they want to retain their valuable assets. The more involved in the job, and the time spent with the company, are also very important for the employee's decision about when to leave for the organization. Therefore, it is very important that the organization's management also focuses on these two factors. The dataset could have been larger, with more employee attributes that could be even more important in identifying key contributing factors.
If we could get more data in hand, the research work would have been much more efficient. The training data would be higher in number and as a result the prediction result would have become much more accurate. Future research work on this topic may include collecting a larger data set and include more important contributing factors and features.
This will ensure that the results of the research become more accurate and make a good contribution to the employee turnover forecasts for any organization.
Baysinger (1984), “Optimal and Dysfunctional Turnover: Towards an Organizational Level Model”, Academy of Management Review, Vol. Arthur, W., Bell, S., Donerspike, D., &Villado, A The use of person organization fits employment decision-making; An assessment of criterion-related validity”, Journal of Applied Psychology, Vol.91, pp. Barrick, M.R., & Zimmerman, R.D., (2005), “Reducing voluntary employee turnover, avoidable employee turnover through selection”, Journal of Applied Psychology, Vol. .
Berg, T.R., (1991), The importance of perceived equity and job satisfaction in predicting employees' intention to stay in television stations. Hendricks S (2006), "Recruiting and Retaining Appropriately Qualified People for the Public Service to Meet the Challenges of a Developing State", Conference of Senior Managers of Free State Provincial Government, Local Authorities, State Agencies and the Public Sector business. Susmita Datta and Somnath Datta, "Comparisons and validation of statistical clustering techniques for microarray gene expression data," Bioinformatics, vol.
J, "A graphical aid to the interpretation and validation of cluster analysis", Journal of Computational Appl Math, vol 20, p.