Lending Club Loan Analysis and Modeling

LendingClub is a US peer-to-peer lending company, headquartered in San Francisco, California. It was the first peer-to-peer lender to register its offerings as securities with the Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. Lending Club operates an online lending platform that enables borrowers to obtain a loan, and investors to purchase notes backed by payments made on loans. Lending Club is the world's largest peer-to-peer lending platform. (from Wikipedia).

The goal of this project is to analyze and model Lending Club's issued loans. A summary of the whole projects can be found in the corresponding Jupyter notebook: 0. Summary.ipynb.

Data

The loan data is available through multiple sources, including Kaggle Lending Club Loan Data, All Lending Club Load Data, or Lending Club Statistics. In this project, I use the data from Kaggle Lending Club Loan Data, which contains the issued load data from 2007 to 2015. In addition, I also use the issued loan data from 2016 from Lending Club Statistics.

The data collection and concatenation process can be found in the corresponding notebook: 1. Data Collection and Concatenation.ipynb.

Data Cleaning

Notebook: 2. Data Cleaning.ipynb

Feature Engineering

Notebook: 3. Feature Engineering.ipynb

Visualization

Categorical and discrete features: 4. Data Visualization - Discrete Variable.ipynb
Numerical features: 4. Data Visualization - Numerical Variable.ipynb
Summary of influential features: 4. Data Visualization Summary.ipynb

Since the above notebooks have relatively large file sizes, to view them, there are two suggest ways.

Download the corresponding html files from folder ./htmls/
View the notebook in nbviewer: nbviewer.jupyter.org/

The corresponding nbviewer pages are as follows:

Categorical and discrete features: 4. Data Visualization - Discrete Variable.ipynb
Numerical features: 4. Data Visualization - Numerical Variable.ipynb
Summary of influential features: 4. Data Visualization Summary.ipynb

Machine Learning

For binary classification problems, there are some commonly used algorithms, from the widely used Logistic Regression, to tree-based ensemble models, such as Random Forest and [Boosting](https://en.wikipedia.org/wiki/Boosting_(machine_learning) algorithms.

For imbalanced classification problems, despite the naive method, there are several re-sampling based methods, including:

Without Sampling
Under-Sampling
Over-Sampling
Synthetic Minority Oversampling Technique (SMOTE)
Adaptive Synthetic (ADASYN) sampling

Here, the performance of several commonly used algorithms under the conditions of without sampling and over-sampling are compared. The metric used here is AUC, or Area Under the ROC Curve.

While the famous scikit-learn has been widely used for a lot of problems, it requires manually transformation of categorical variable into numerical format, which is not always a good choice. There are several new packages that naively support categorical features, including H2O, LightGBM, and CatBoost.

In this projects, several widely used algorithms are explored, including:

Logistic Regression
Random Forest
Boosting
Stacked Models

Model Performance Comparison

Model	Logistic Regression	Random Forest	Random Forest	Boosting	Boosting
Package	H2O	H2O	LightGBM	LightGBM	CatBoost
Without oversampling AUC	0.6982	0.7007	0.6882	0.7204	0.7222
With oversampling AUC	0.6982	0.7008	0.6893	0.7195	0.6814

As a comparison, I also use DataRobot, An Automated Machine Learning for Predictive Modeling platform, to run the classification. Below is the performance

Model	GBM	GBM	GBM	GBM
Package	H2O	LightGBM	LightGBM	XGBoost
Test AUC	0.7155	0.7133	0.7147	0.7113

Note:

Detailed analysis can be found in my blog. Feel free to read through it.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
catboost_info		catboost_info
data		data
deprecated		deprecated
htmls		htmls
result		result
.gitignore		.gitignore
0. Summary.ipynb		0. Summary.ipynb
1. Data Collection and Concatenation.ipynb		1. Data Collection and Concatenation.ipynb
2. Data Cleaning.ipynb		2. Data Cleaning.ipynb
3. Feature Engineering.ipynb		3. Feature Engineering.ipynb
4. Data Visualization - Discrete Variable.ipynb		4. Data Visualization - Discrete Variable.ipynb
4. Data Visualization - Numerical Variable.ipynb		4. Data Visualization - Numerical Variable.ipynb
4. Data Visualization Summary.ipynb		4. Data Visualization Summary.ipynb
5. Machine Learning with Over-sampling.ipynb		5. Machine Learning with Over-sampling.ipynb
5. Machine Learning without Sampling.ipynb		5. Machine Learning without Sampling.ipynb
6. DataRobot Comparison.ipynb		6. DataRobot Comparison.ipynb
LICENSE		LICENSE
README.md		README.md
utils.py		utils.py

License

JifuZhao/Lending-Club-Loan-Analysis

Folders and files

Latest commit

History

Repository files navigation

Lending Club Loan Analysis and Modeling

Data

Data Cleaning

Feature Engineering

Visualization

Machine Learning

Model Performance Comparison

Note:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages