Skip to content

Used Python (scikit-learn) to develop supervised machine learning models to predict credit risk

Notifications You must be signed in to change notification settings

teresa-le/Credit_Risk_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 

Repository files navigation

Credit Risk Analysis

Purpose

The purpose of this analysis was to develop multiple supervised machine learning models to predict credit risk and determine which one(s) performed the best. The models were developed using different techniques to resolve the class imbalance issues; the number of low risk loans greatly outnumbered the number of high risk loans in the dataset.

Results

Random Oversampling

  • Balanced accuracy score:0.657
  • Precision: 0.01
  • Recall: 0.71

SMOTE Oversampling

  • Balanced accuracy score: 0.662
  • Precision: 0.01
  • Recall: 0.63

Cluster Centroids Undersampling

  • Balanced accuracy score: 0.544
  • Precision: 0.01
  • Recall: 0.69

SMOTEEENN Combination Sampling

  • Balanced accuracy score: 0.645
  • Precision: 0.01
  • Recall: 0.72

Balanced Random Forest Classifier

  • Balanced accuracy score: 0.789
  • Precision: 0.03
  • Recall: 0.70

Easy Ensemble AdaBoost Classifier

  • Balanced accuracy score: 0.932
  • Precision: 0.09
  • Recall: 0.92

Based on the balanced accuracy scores, the AdaBoost algorithm performed the best in terms of accurately predicting the classes, and the cluster-based undersampling technique performed the worst.

All the models have fairly low precision scores when it comes to high risk applications, which indicates there were many false positives.

The model that used the AdaBoost algorithm has the the best recall score (0.92) when it comes to high risk applications. On the other hand, the model produced using the SMOTE oversampling technique has the worst recall score (0.63). The recall score of the AdaBoost model is very high indicating that there weren't many false negatives.

Summary

Out of the 6 machine learning models, the model that used the AdaBoost algorithm has the best score when it comes to balanced accuracy, precision and recall.

Although it has a fairly low precision score, it has a high recall score. In this situation, recall is more important than precision because the lending company doesn't want to lose money by giving loans to people who are high risk and more likely to default on their loans. On the other hand, the company may miss out on potential opportunities by rejecting good loans.

Therefore, I recommend that revisions be made to see if additional changes can be made to increase the precision score of the model without significantly decreasing accuracy.

About

Used Python (scikit-learn) to develop supervised machine learning models to predict credit risk

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published