Loan Default Prediction

Problem Definition

The Context:

The objective:

The key questions:

The problem formulation:

Data Description:

The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.

Import the necessary libraries and Data

Data Overview

Load the data

Data Overview

Observations

Missing Values

Observations

Summary Statistics

Observations from Summary Statistics

Observations for Categorical Summary

Exploratory Data Analysis (EDA) and Visualization

Leading Questions:

  1. What is the range of values for the loan amount variable "LOAN"?
  2. How does the distribution of years at present job "YOJ" vary across the dataset?
  3. How many unique categories are there in the REASON variable?
  4. What is the most common category in the JOB variable?
  5. Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
  6. Do applicants who default have a significantly different loan amount compared to those who repay their loan?
  7. Is there a correlation between the value of the property and the loan default rate?
  8. Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?

Univariate Analysis

Distributions and Outliers

Create count plots to identify the distribution of the data

Create box plots to determine if there are any outliers in the data.

Observations

Univariate Analysis - Categorical Data

Major Observations

Bivariate Analysis

BAD vs LOAN

Observations

BAD vs. MORTDUE

Observations

BAD vs. VALUE

Observations

BAD vs. DEBTINC

Observations

Continuous Variables

VALUE and DEROG

VALUE and DELINQ

Bivariate Analysis: BAD vs Categorical Variables

The stacked bar graph allows you to look at numerical values across two categorical variables.

Plot stacked bar plot for for LOAN and JOBS

observations

Plot stacked bar plot for for LOAN and DEROG

Observations

Plot stacked bar plot for for LOAN and DELINQ

Observations

Multivariate Analysis

Correlation heat map

Observations

Pair Plot

Treating Outliers

Identifying outliers

Replace the outliers

  1. if outlier is < Lower Whisker then replace outlier with Lower Whisker
  2. if outlier is > Upper Whisker then replace outlier with Upper Whisker

Checking that outliers have been removed

Treating Missing Values

Adding new columns in the dataset for each column which has missing values

Actions Taken

2596 rows which have missing values

Replace the missing values in the numerical columns with the mean of the column

Important Insights from EDA

What are the the most important observations and insights from the data based on the EDA performed?

Model Building - Approach

Data Preparation

Separating the target variable from other variables

Splitting the data into 70% train and 30% test set

Shape

Logistic Regression

Observation

Undersampling

Oversampling

Resampling Training Set

Decision Tree often perform well on imbalanced datasets. The splitting rules that look at the class variable used in the creation of the trees, can force both classed to be addressed.

Model Evaluation Criterion

Model is attempting to find those that will default(1) on their loan, which will be our True Positive(TP), and therefore non-defaulters(0) will be our True Negative(TN)

Model can make wrong predictions as:

  1. Predicting a applicant will not default on a loan but, in reality, the applicant would default this is a major loss in profit for the BANK.
  2. Predicting a applicant will default on a loan but, in reality, the applicant would have paid it off results in the bank loosing profit from the interest of that potential customer.

Which case is more important?

Predicting a applicant will not default on a loan but, in reality, the applicant would default this is a major loss in profit for the BANK.

How to reduce the losses?

The bank would want recall to be maximized. The greater the recall score, higher the chances of minimizing False Negative. In this case the false negative is predicting an applicant will not default(0), when the applicant would default(1)

That being said a high F1-Score is still preferable as that would result in more profits, as long as recall remains high.

METRIC Function

Decision Tree

Build Decision tree model

Checking the performance on the train dataset

Checking the performance on the test dataset

Add data to results table

Observations

Decision Tree - Hyperparameter Tuning

Criterion {“gini”, “entropy”}

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

max_depth

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf

The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

You can learn about more Hyperpapameters on this link and try to tune them.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Using GridSearchCV for Hyperparameter tuning on the model

Checking the performance on the train dataset

Checking the performance on the test dataset

Observations

Plotting the Decision Tree

Observations

The next high priority splits are made on:

Plotting Feature Importance

Observations

Building a Random Forest Classifier

Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.

The results from all the decision trees are combined together and the final prediction is made using voting or averaging.

Checking the performance on the train dataset

Checking the performance on the test dataset

Add data to results table

Observations

Random Forest with class weights

Checking the performance on the train dataset

Checking the performance on the test dataset

Weighting the random forest has dropped both the f1-score and recall scores

Random Forest Classifier Hyperparameter Tuning

Checking the performance on the train dataset

Checking the performance on the test dataset

Observations

Plot the Feature importance of the tuned Random Forest

Observations

Conclusion and Recommendations

Conclusions: It became clear as we refined the models that the size (5960 data points) and composition (80/20 non-defaulters to defaulters) of the dataset was contributing to lower that ideal accuracy scores. Therefore, our initial objective of maximizing recall and maintaining a high f1-score was unlikely, and therefore we shifted to focus solely on maximizing recall even at the expense of overall accuracy.

We built multiple decision tree based models that can predict if a loan is likely to default, of those two models performed and generalized well:

Tuned Decision Tree Model - f1-score 67% and recall 75% Hyper Tuned Random Forest Model - f1-score 63% and recall 75%

The two models are evenly balanced, with the tuned model being leaning more towards overall accuracy and the random forest model leaning more towards recall. For the reasons stated above it is preferable to maximize recall even at the expense of accuracy, therefore, it is recomended we use the Hyper Tuned Random Forest Model.

With the challenge of a limited dataset, we were able to successfully able to tune our model to maximize a recall score of 75%.

Debt to income ratio is a very powerful tool in predicting defaulters.

  1. The bank can use debt to income ratio as a initial indicator when evaluating a loan. Those with a higher dept to income ratio can also be made aware of the potential difficulties of paying off a loan when already in a larger portion of debt to income. Potentially even counseling those on how to lower their debt or raise their income to qualify for future loans. Those who have a higher value of their current property and are asking for a larger loan are generally more likely not to default. This makes sense as those who are wealthier are more financially stable.

  2. The credit age of an individual is a feature to also review.
    The amount of time someone has had credit the better the bank can gauge how well they will repay their loans.

  3. A persons Derogatory report and delinquency credit report are also very important indicators as to who will default on their loans. DEROG>6 and DELINQ>5 resulted in a defaults.

  4. Persons whom are self-employed have a higher likelihood of defaulting on their loans. However this is on the 5th level of the tree model.

Overall

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

2. Refined insights:

3. Proposal for the final solution design: