GitHub - showkatewang/Credit_Risk_Analysis: Predicts credit risk of individuals based on information within their application utilizing supervised machine learning models

Overview

The purpose of this project is to utilize supervised machine learning to assist Fast Lending, a company that offers peer-to-peer lending service, to predict credit risk using the company's 2019 first-quarter data. This data contains 86 variables, including features and the target, and a total of 68,817 applications.

I begin by converting all qualitative data to numerical data before splitting the dataset into training and testing sets. Applications tend to have an imbalance of quantity between high- and low-risk loans. Thus to reduce bias created by the weight of low-risk loans, I re-sample the data via various algorithms including the naive random over-sampling algorithm, synthetic minority over-sampling technique (SMOTE) algorithm, cluster centroids under-sampling algorithm, and synthetic minority over-sampling technique edited nearest neighbors (SMOTEENN) combination sampling algorithm, before fitting the re-sampled data to a logistic regression model. I also fit the dataset to two other models, balanced random forest classifier and easy ensemble classifier. Lastly, I employ the evaluation metrics, balanced accuracy score, confusion matrix, and imbalanced classification report, to determine the performance of each algorithm or model.

Results

Naive Random Over-Sampling Algorithm with Logistic Regression Model
- The balanced accuracy score is roughly 0.64.
- The precision, or positive predictive value (PPV), of the model is 0.01 and the sensitivity, or recall, of the model is 0.69 for predicting high-risk applications.
- The precision is 1.00 and the sensitivity is 0.59 for predicting low-risk applications.
- The F1 score or harmonic mean of a high-risk prediction is 0.02; the F1 score of a low-risk prediction is 0.74.

Synthetic Minority Over-Sampling Technique (SMOTE) Algorithm with Logistic Regression Model
- The balanced accuracy score is roughly 0.66.
- The precision is 0.01 and the sensitivity is 0.63 for predicting high-risk applications.
- The precision is 1.00 and the sensitivity is 0.69 for predicting low-risk applications.
- The F1 score of a high-risk prediction is 0.02; the F1 score of a low-risk prediction is 0.82.

Cluster Centroids Under-Sampling Algorithm with Logistic Regression Model
- The balanced accuracy score is roughly 0.54.
- The precision is 0.01 and the sensitivity is 0.69 for predicting high-risk applications.
- The precision is 1.00 and the sensitivity is 0.40 for predicting low-risk applications.
- The F1 score of a high-risk prediction is 0.01; the F1 score of a low-risk prediction is 0.57.

Synthetic Minority Over-Sampling Technique Edited Nearest Neighbors (SMOTEENN) Algorithm with Logistic Regression Model
- The balanced accuracy score is roughly 0.67.
- The precision is 0.01 and the sensitivity is 0.76 for predicting high-risk applications.
- The precision is 1.00 and the sensitivity is 0.59 for predicting low-risk applications.
- The F1 score of a high-risk prediction is 0.02; the F1 score of a low-risk prediction is 0.74.

Balanced Random Forest Classifier Model
- The balanced accuracy score is roughly 0.68.
- The precision is 0.88 and the sensitivity is 0.37 for predicting high-risk applications.
- The precision is 1.00 and the sensitivity is 1.00 for predicting low-risk applications.
- The F1 score of a high-risk prediction is 0.52; the F1 score of a low-risk prediction is 1.00.

Easy Ensemble Classifier Model
- The balanced accuracy score is roughly 0.93.
- The precision is 0.09 and the sensitivity is 0.92 for predicting high-risk applications.
- The precision is 1.00 and the sensitivity is 0.94 for predicting low-risk applications.
- The F1 score of a high-risk prediction is 0.16; the F1 score of a low-risk prediction is 0.97.

Summary

Due to the nature of this project, I desire a higher accuracy for correctly predicted high-risk applications (true positives) over correctly predicted low-risk applications (true negatives). I also desire a higher predictability of low-risk applications falsely predicted as high-risk (false positives) over high-risk applications falsely predicted as low-risk (false negatives). This implies that even if I desire both values to be low in quantity, I would favor a model with high sensitivity over one with high precision when predicting high-risk applications.

Notice that based on their evaluation metrics, re-sampling data with the various algorithms affected very little the performance of the logistic regression model. The balanced accuracy scores of all four algorithms are below 0.7, meaning they are also below the general minimum of 0.8. Interestingly, to predict high-risk applications, the precisions are all 0.01 while sensitivities are between 0.65 and 0.8. The balanced random forest classifier also observed a suboptimal accuracy of 0.68, with a precision of 0.88 and sensitivity of 0.37 for predicting high-risk applications.

In comparison, the easy ensemble classifier obtains an astounding accuracy of 0.93, a low precision of 0.09, and a high sensitivity of 0.92 for predicting high-risk applications. Therefore I recommend at this point that the company employ the easy ensemble classifier to best predict high-risk applications while being cognizant of the possibility of overfitting. Overfitting may be detected and reduced by either obtaining more training data or lowering model capacity.

Resources

Data Source (file too large for upload):

LoanStats_2019Q1.csv

Software:

imbalanced-learn version 0.7.0
Jupyter Notebook version 1.0.0
NumPy version 1.20.3
Pandas
Python  3.7.11
SciPy version 1.7.1
Scikit-learn version 0.24.2

Contact

Email: show.wang94@gmail.com

LinkedIn: https://www.linkedin.com/in/s-k-wang

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
photos		photos
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
credit_risk_ensemble.ipynb		credit_risk_ensemble.ipynb
credit_risk_resampling.ipynb		credit_risk_resampling.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

photos

photos

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

credit_risk_ensemble.ipynb

credit_risk_ensemble.ipynb

credit_risk_resampling.ipynb

credit_risk_resampling.ipynb

Repository files navigation

Overview

Results

Summary

Resources

Contact

About

Releases

Packages

Languages

License

showkatewang/Credit_Risk_Analysis

Folders and files

Latest commit

History

Repository files navigation

Overview

Results

Summary

Resources

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages