Skip to content
This repository has been archived by the owner on Aug 16, 2019. It is now read-only.

JifuZhao/Lending-Club-Loan-Analysis

Repository files navigation

Lending Club Loan Analysis and Modeling

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. Lending Club operates an online lending platform that enables borrowers to obtain a loan, and investors to purchase notes backed by payments made on loans. Lending Club is the world's largest peer-to-peer lending platform. (from Wikipedia).

The goal of this project is to analyze and model Lending Club's issued loans. A summary of the whole projects can be found in the corresponding Jupyter notebook: 0. Summary.ipynb.


Data

The loan data is available through multiple sources, including Kaggle Lending Club Loan Data, All Lending Club Load Data, or Lending Club Statistics. In this project, I use the data from Kaggle Lending Club Loan Data, which contains the issued load data from 2007 to 2015. In addition, I also use the issued loan data from 2016 from Lending Club Statistics.

The data collection and concatenation process can be found in the corresponding notebook: 1. Data Collection and Concatenation.ipynb.


Data Cleaning


Feature Engineering


Visualization

Since the above notebooks have relatively large file sizes, to view them, there are two suggest ways.

  • Download the corresponding html files from folder ./htmls/
  • View the notebook in nbviewer: nbviewer.jupyter.org/

The corresponding nbviewer pages are as follows:


Machine Learning

For binary classification problems, there are some commonly used algorithms, from the widely used Logistic Regression, to tree-based ensemble models, such as Random Forest and [Boosting](https://en.wikipedia.org/wiki/Boosting_(machine_learning) algorithms.

For imbalanced classification problems, despite the naive method, there are several re-sampling based methods, including:

  • Without Sampling
  • Under-Sampling
  • Over-Sampling
  • Synthetic Minority Oversampling Technique (SMOTE)
  • Adaptive Synthetic (ADASYN) sampling

Here, the performance of several commonly used algorithms under the conditions of without sampling and over-sampling are compared. The metric used here is AUC, or Area Under the ROC Curve.

While the famous scikit-learn has been widely used for a lot of problems, it requires manually transformation of categorical variable into numerical format, which is not always a good choice. There are several new packages that naively support categorical features, including H2O, LightGBM, and CatBoost.

In this projects, several widely used algorithms are explored, including:

  • Logistic Regression
  • Random Forest
  • Boosting
  • Stacked Models

Model Performance Comparison

Model Logistic Regression Random Forest Random Forest Boosting Boosting
Package H2O H2O LightGBM LightGBM CatBoost
Without oversampling AUC 0.6982 0.7007 0.6882 0.7204 0.7222
With oversampling AUC 0.6982 0.7008 0.6893 0.7195 0.6814

As a comparison, I also use DataRobot, An Automated Machine Learning for Predictive Modeling platform, to run the classification. Below is the performance

Model GBM GBM GBM GBM
Package H2O LightGBM LightGBM XGBoost
Test AUC 0.7155 0.7133 0.7147 0.7113

Note:

Detailed analysis can be found in my blog. Feel free to read through it.

Copyright @ Jifu Zhao 2018

About

Lending Club Historical Loan Analysis and Modeling

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published