STUDY ON CREDIT RISK MODELING SYSTEM USING MACHINE LEARNING TECHNIQUES

This project entitled "A STUDY ON A CREDIT RISK MODELING SYSTEM USING MACHINE LEARNING TECHNIQUES", submitted by Sima Akter to the Department of Computer Science and Engineering at Daffodil International University, has been accepted as satisfactory in partial fulfillment of the requirements for the degree of B.Sc. Department of Computer Science and Engineering Faculty of Science and Information Technology Daffodil International University. I declare that this project was carried out by me under the supervision of Ahmed Al Marouf (AAM), a lecturer in the department of CSE Daffodil International University.

I also declare that neither this project nor any part of this project has been submitted elsewhere for the award of any degree or diploma. I am truly grateful and wish my deep indebtedness to Ahmed Al Marouf (AAM), Lecturer, Department of CSE Daffodil International University, Dhaka. Deep knowledge and great interest of my supervisor in the field of "Machine Learning" to realize this project.

His endless patience, scientific guidance, constant encouragement, constant and energetic supervision, constructive criticism, valuable advice, reading many inferior drafts and correcting them at all stages have made it possible to complete this project. I would like to express my sincere gratitude to the Head of CSE Department for his kind assistance in completing my project and also to other faculty members and staff of CSE Department of Daffodil International University. I would like to thank my entire coursemate at Daffodil International University who participated in this discussion while completing the coursework.

Now he has analyzed that the problem can be optimized using machine learning technique and predict the behavior of the customer.

Introduction

Motivation

Rationale of the Study

Research Questions

Expected Output

Report Layout

There is a large literature on credit and risk scoring models, but few use machine learning methods or credit card data.

Related Works

One explanation could be the lack of credit datasets, as such data cannot be published given their sensitive nature.

Research Summary

Scope of the Problem

Summarizing all these different dimensions into one score is challenging, but machine learning techniques help achieve this goal. The common goal behind machine learning and traditional statistical learning tools is to learn from data. Typically, statistical learning methods rely on formal relationships between variables in the form of mathematical equations, while machine learning methods can learn from data without the need for rule-based programming.

As a result of this flexibility, machine learning methods can better adapt to patterns in the data.

Challenges

Brief overview of machine learning and theory on models to be put into practice.

Research Subject and Instrumentation

Python Language and Library: In this study, we used Python as programming language and NumPy, Pandas, Matplotlib, Seaborn, Scikit-Learn as library for analyzing the data and building Machine Learning algorithms. Machine Learning: Machine learning is an arena of computer science that includes learning pattern identification and computational learning theory in AI. Machine learning generally refers to the changes in systems that perform tasks related to artificial intelligence (AI).

Machine learning is used to build programs with its tuning parameters that are consequently adjusted to increase their functioning by adapting to earlier data. To predict the class of the loan, we use four classification algorithms (logistic regression, k-nearest neighbors, decision trees and random forest). Our goal is to predict class level that is a choice from the predefined list of possibilities and make accurate predictions for new, never-before-seen data.

It is used in the binary classification problem where the problem is divided into two classes. It is mostly used in classification problems like we used to predict the defaulter. On the other hand, it is fast, but for larger training set, prediction can be slow. It does not perform well when dataset has many features.

Strengths, Weaknesses and Parameters: As discussed earlier, the parameters that control model complexity in decision trees are the pre-pruning parameters that stop the construction of the tree before it is fully developed. Usually, picking one of the pre-pruning strategies that set either maximum depth, maximum leaf nodes, or minimum leaf samples is sufficient to prevent overfitting. Advantages include: the resulting model can be easily visualized and understood by non-experts (at least for smaller trees), and the algorithms are completely invariant to the scale of the data.

Since each feature is processed separately and possible data partitions are independent of scaling, decision tree algorithms do not require pre-processing such as normalization or standardization of features. It is a tree algorithm that builds multiple trees and then combines their output to improve the performance of the model. It can process thousands of input variables and identify the most important variables, so it is considered as one of the dimensionality reduction methods.

Data Collection Procedure

Since we build the model at the loan level and not at the customer level, we consider the characteristics of loans, but not the customer.

Statistical Analysis

The following graph shows that Term loan is more vulnerable, about term loan given to 25 percent of people, among whom 48 percent are defaulters, which is 12 percent of given data set. It appears that short term loan is safer as it has 20 percent default which is 26 percent of the given short term loan. Purpose of loans: There are sixteen types of purposes in which about 80 percent of loans are used for debt consolidation and 25 percent among them are defaulters.

15 percent are defaulters which is more than 34 percent of that loan given to people with rental houses. Year in current job: From the chart below, it can be seen that most of the loans were given to the employee who has more than 10 years of experience. Coding of categorical features: We identified that there are four categorical features Purpose, Term, Home Ownership and Year in Current Work.

Imputation of missing values: Data sets are often full of missing data, extreme data points called outliers, and others are odd values. Detecting missing values is the easy part: deciding how to deal with them is much harder. In cases where we have a lot of data and only a few missing values, it may make sense to simply delete the records with missing values.

On the other hand, if we have more than a handful of missing values, removing records with missing values can get rid of a lot of data. Missing values in categorical data are not of particular concern because NA can simply be treated as an additional category. Missing values in numeric variables are more difficult because we cannot treat a missing value as a number.

Correlation Coefficient: Before building a model and evaluating its results, we need to examine the relationships between the variables in the data set. The goal is to identify the variables that have a strong linear relationship and is done by developing a correlation matrix that takes each continuous variable and finds the correlation coefficient for each pairing in the data set. The correlation coefficient is calculated using Pearson or Spearman measurements, with values ranging from -1 (negative correlation) to 1 (positive correlation).

Implementation Requirements

A total of 4 different machine learning algorithms (Logistic regression, kNN, Random Forests and Decision tree) have been performed.

Experimental Results

So from the above bar chart we can conclude that Logistic Regression has less training time and testing time despite showing better accuracy for our dataset and performing better in other evaluation metrics.

Descriptive Analysis

Evaluation of the result: In this section we listed the result of each model, where we used 40 percent of the data for the test set and 60 percent for the training set and found that Logistic Regression is the best performing model.

Summary

Summary of the Study

Conclusions

Recommendations

Implication for Further Study

Using data mining to improve the assessment of creditworthiness through credit scoring models.” Expert systems with applications. After a lot of research I finally got my interest towards credit risk modeling as it is an important problem to deal with and it affects all major financial institutions including insurance companies, banks etc. Additionally, this particular dataset gave me the opportunity to experiment with different machine learning algorithms.

Which is exactly what I wanted to grow my knowledge in the field of Machine Learning and Data Science.

Error Rate vs. K Value

Term Loan

Purpose of Loan

House Ownership

Year in Current Job

Missing Value Representation

Correlation Coefficient

Final Result Comparison