Customer churn prediction of telco provider

In this project we use this data from Kaggle. The main goal of the project is to predict whether a customer will change telco provider.

Overview of the data

The training dataset contains 4250 samples. Each sample contains 19 features and 1 boolean variable "churn" which indicates the class of the sample. The 19 input features and 1 target variable are:

state, string. 2-letter code of the US state of customer residence
account_length, numerical. Number of months the customer has been with the current telco provider
area_code, string="area_code_AAA" where AAA = 3 digit area code.
international_plan, (yes/no). The customer has international plan.
voice_mail_plan, (yes/no). The customer has voice mail plan.
number_vmail_messages, numerical. Number of voice-mail messages.
total_day_minutes, numerical. Total minutes of day calls.
total_day_calls, numerical. Total minutes of day calls.
total_day_charge, numerical. Total charge of day calls.
total_eve_minutes, numerical. Total minutes of evening calls.
total_eve_calls, numerical. Total number of evening calls.
total_eve_charge, numerical. Total charge of evening calls.
total_night_minutes, numerical. Total minutes of night calls.
total_night_calls, numerical. Total number of night calls.
total_night_charge, numerical. Total charge of night calls.
total_intl_minutes, numerical. Total minutes of international calls.
total_intl_calls, numerical. Total number of international calls.
total_intl_charge, numerical. Total charge of international calls
number_customer_service_calls, numerical. Number of calls to customer service
churn, (yes/no). Customer churn - target variable.

Methods used

Exploratory Data Analysis (EDA)
Inferential Statistics
Data Visualisation
Oversampling & Undersampling for Class Imbalance
Feature Engineering
Feature Selection
Cross Validation
Clustering
Predictive Modeling
Machine Learning
Hyperparameter Tuning

Technlogies

Python, Jupyter Notebook
Pandas, numpy
Seaborn, matplotlib
ImbLearn
Scikit-Learn (SkLearn), AutoSklearn
MLFlow
SHAP
XGBoost, LightGBM, Catboost
HyperOpt

Notebooks & Python Scripts

For simplicity, brief information on each Python file and each Notebook is given in the table below. For more complete information, you can look into them. The order is kept as it was developed and tried.

Notebook	Description
Research	First notebook with EDA and Data Visualisation
Preprocessing	Python script with functions for Data Handling
Feature Engineering	Notebook with experiments for feature engineering (all necessary engineering techniques were included in `preprocessing.py`)
KMeans Research	Notebook with first clustering approach (fails)
KMeans + SVM	Cluster as a feature + Support Vector Machine Classifier (no handling with imbalance)
Undersample + KMeans + SVM	Undersampling technique + everything else as in above notebook
SMOTE + KMeans + SVM	Oversampling and Undersampling techniques (SMOTE, SMOTETomek, SMOTEENN)
Logistic Regression	SMOTE, SMOTEENN and no handling with imbalance with basic Logistic regression
SkLearn Models + XGB	Different basic models and XGB were tried on SMOTEENN & SMOTE data (best so far is XGB with SMOTEENN)
AutoSkLearn	Auto SkLearn implementation (works only in Google Colab)
Feature Selection	Notebook with different feature selection techniques (final selection function was included in `preprocessing.py`)
XGB Tuning	Tuning of xgb model, with the help of hyperopt for parameters and MLFlow for tracking
CatBoost Tuning	Tuning of catboost model
LightGBM Tuning	Tuning of lightgbm model
Train	Final script for XGB model training and saving
Model Inference	Final script for test data prediction (and save to `submission.csv` for Kaggle)

Results

A lot of different models, methods, as well as frameworks were tried and used. The final score, accuracy, on Kaggle Platform is 0.88 both for public and private leaderboards.

To sum up, having such a result in real life, the company will save itself a lot of money by using the developed model.

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
clustering_approach		clustering_approach
data		data
data_notebooks		data_notebooks
model_notebooks		model_notebooks
references		references
tuning		tuning
.gitignore		.gitignore
README.md		README.md
inference.py		inference.py
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
research.ipynb		research.ipynb
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustering_approach

clustering_approach

data

data

data_notebooks

data_notebooks

model_notebooks

model_notebooks

references

references

tuning

tuning

.gitignore

.gitignore

README.md

README.md

inference.py

inference.py

preprocessing.py

preprocessing.py

requirements.txt

requirements.txt

research.ipynb

research.ipynb

train.py

train.py

Repository files navigation

Customer churn prediction of telco provider

Overview of the data

Methods used

Technlogies

Notebooks & Python Scripts

Results

About

Releases

Packages

Languages

TayJen/Customer-Churn-Prediction

Folders and files

Latest commit

History

Repository files navigation

Customer churn prediction of telco provider

Overview of the data

Methods used

Technlogies

Notebooks & Python Scripts

Results

About

Topics

Resources

Stars

Watchers

Forks

Languages