Telco Customer Churn

Exploratory Data Analysis + Data Visualization + Modelling

1 - Abstract

In this project I made Exploratory Data Analysis, Data Visualisation and lastly Modelling. Dataset contains 7043 rows in csv file. Each example row represent a churned or not churned customoer with 21 different information(columns). Before modelling part I have to do Data Cleaning and I need to understand which features(columns) are more important for understand the customer behavior. Also I checked distribution of churned and not churned to make Undersampling or Oversampling but luckily dataset distribution is acceptable, but if a person will make Oversampling, accuracy it may be slightly higher. Later I made visualization of each columns to understand the dataset better and of course feature importance. In the modelling part, I used 9 different algorithms CatBoost , K-Neighbors , XGBoost , AdaBoost , LightGBM , Logistic Regression , Gradient Boosting , Random Forest , D-Tree Classifier . Gradient Boosting algorithm gives the best accuracy with 0.804500 which is slightly better CatBoost (0.803600) and AdaBoost (0.803600),also D-Tree Classifier gives the worst accuracy(0.803600). But my aim is Recall Score; because objective of this types of project is decreasing False Negatives(FN). In the end Gradient Boosting (0.541800)recall score decent but with Oversampling or using scale_pos_weight in boosting algorithms can increase Recall Score.

2 - Data

Dataset contains 21 columns and 7043 rows.

Columns Description:

CustomerID = A unique ID that identifies each customer.
Gender = The customer’s gender.
Senior Citizen = Indicates if the customer is 65 or older.
Partner = Indicates if the customer is married.
Dependents = Indicates if the customer lives with any dependents.
Tenure = Indicates the total amount of months that the customer has been with the company.
Phone Service = Indicates if the customer subscribes to home phone service with the company
Multiple Lines = Indicates if the customer subscribes to multiple telephone lines with the company.
Internet Service = Indicates if the customer subscribes to Internet service with the company.
Online Security = Indicates if the customer subscribes to an additional online security service provided by the company.
Device Protection = Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company.
Tech Support = Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times.
Streaming TV = Indicates if the customer uses their Internet service to stream television programing from a third party provider.
Streaming Movies = Indicates if the customer uses their Internet service to stream movies from a third party provider.
Contract = Indicates the customer’s current contract type: Month-to-Month, One Year, Two Year.
Paperless Billing = Indicates if the customer has chosen paperless billing.
Payment Method = Indicates how the customer pays their bill.
Monthly Charges = Indicates the customer’s current total monthly charge for all their services from the company.
Total Charges = Indicates the customer’s total charges, calculated to the end of the quarter specified above.
Churn = Yes = the customer left the company this quarter. No = the customer remained with the company. Directly related to Churn Value.

3 - Exploratory Data Analysis

Firstly, I would like to see distribution of the data, because we may need to use Undersampling or Oversampling. As in the image below, dataset is fairly acceptable.

Customer did churn: 26.54 % --> (1869 customer)
Customer did not churn: 73.46 % --> (5174 customer)

Secondly, I visualized all columns according to numerical and categorical types. As you can see in the images below, there are four columns as numerical and rest of the columns are categorical.

In these plots we are observing, New clients most likely to churn Clients with higher monthly charges more likely to churn Clients with less total charges more likely to churn Senior citizen less likely to churn Tenure and MontlyCharges are important features for churn

In those images we are observing, gender is not important for churn.

After these steps I want to see, Pearson correlation and Spearmen correlation. I used both because, Pearson evaluates linear relationship of columns, and Spearmen evaluates Monotonic relationship of columns.

Important! : In a monotonic relationship, variables are likely to move in same direction but not necessarily at stable rate, but in Linear relationship, variables are move in same direction and stable rate.

Pearson Correlation

Spearman Correlation

Lastly, we are seeing Feature importance according to churn, result is not suprising. Top 3 important features are 'Tenure', 'Monthly Contract', and 'Total Charges'.

4 - Modelling

In this part we are choosing which algorithms we will use and compare algorithms according to classification report. In this report Accuracy, Recall, Precision and F1 Score exist. For us Accuracy most important feature. Let's see each of for understand better.

Accuracy: A Measure of a how good of the model.
Recall: Fraction of relevant instances that were retrieved.
Precision: Fraction of relevant instances among the retrieved instances.
F1: A measure of a test's accuracy.

Let's examine general structure of modelling part our example is Light-GBM.

First we are creating list for each items, because we want to keep them for comparing as a one list.

models =[]
accuracy= []
recall =[]
roc_auc= []
precision = []
f1 = []

In this part we are dropping gender and PhoneService and doing Label Encoding(encode target labels with value between 0 and n_classes-1) for Churn column.

df1 = df.drop(['gender','PhoneService'],axis=1).copy()
le = LabelEncoder()
df1['Churn']=le.fit_transform(df1['Churn'])

Then we are changing tenure data type int to float and for X we are dropping Churn column and for Y just giving Churn column.

df1['tenure']= df1['tenure'].astype(float)
df1= pd.get_dummies(df1)
X= df1.drop('Churn', axis=1)
y= df1['Churn']

Now we are using train test split and deciding test size and random state. we can add more parameters such as shuffle.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Here, we are calling our function also prediction our X_test data part.

lgbmc = LGBMClassifier()
lgbmc.fit(X_train, y_train)
y_pred = lgbmc.predict(X_test)

After our classifier works we need to see result, we are rounding results and showing 4 numbers after point(.).

accuracy.append(round(accuracy_score(y_test, y_pred),4))
recall.append(round(recall_score(y_test, y_pred),4))
roc_auc.append(round(roc_auc_score(y_test, y_pred),4))
precision.append(round(precision_score(y_test, y_pred),4))
f1.append(round(f1_score(y_test, y_pred),4))

Finally, we are adding our results to the list then print on the screen.

models = ['LightGBM']
result_df5 = pd.DataFrame({'Accuracy':accuracy,
                           'Recall':recall, 
                           'Roc_Auc':roc_auc, 
                           'Precision':precision, 
                           'F1 Score':f1}, 
                           index=models)
result_df5

Let's look output of our models.

CatBoost

Accuracy	Recall	Roc_Auc	Precision	F1 Score
0.8036	0.5035	0.7095	0.6897	0.5821

K-Neighbors

Accuracy	Recall	Roc_Auc	Precision	F1 Score
0.7747	0.4634	0.6771	0.6129	0.5278

XGBoost

Accuracy	Recall	Roc_Auc	Precision	F1 Score
0.7875	0.4861	0.693	0.6443	0.5541

AdaBoost

Accuracy	Recall	Roc_Auc	Precision	F1 Score
0.8036	0.5244	0.7161	0.6795	0.5919

LightGBM

Accuracy	Recall	Roc_Auc	Precision	F1 Score
0.7993	0.5261	0.7137	0.6652	0.5875

Logistic Regression

Accuracy	Recall	Roc_Auc	Precision	F1 Score
0.7998	0.5331	0.7162	0.6638	0.5913

Gradient Boosting

Accuracy	Recall	Roc_Auc	Precision	F1 Score
0.8045	0.5418	0.7222	0.6746	0.601

Random Forest

Accuracy	Recall	Roc_Auc	Precision	F1 Score
0.7875	0.4774	0.6903	0.6478	0.5496

D-Tree Classifier

Accuracy	Recall	Roc_Auc	Precision	F1 Score
0.7307	0.5122	0.6622	0.5043	0.5082

5 - Result & Future Work

Totally we used 9 different algorithms CatBoost , K-Neighbors , XGBoost , AdaBoost , LightGBM , Logistic Regression , Gradient Boosting , Random Forest , D-Tree Classifier . Gradient Boosting algorithm gives the best accuracy with (0.804500) which is slightly better than CatBoost (0.803600) and AdaBoost (0.803600),also D-Tree Classifier gives the worst accuracy(0.803600). But my aim is Recall Score; because objective of this types of project is decreasing False Negatives(FN). In the end Gradient Boosting (0.541800)recall score decent but with Oversampling or using scale_pos_weight in boosting algorithms can increase Recall Score and gives better prediction about customer.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
images		images
README.md		README.md
Telco_Churn_Prediction.ipynb		Telco_Churn_Prediction.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

README.md

README.md

Telco_Churn_Prediction.ipynb

Telco_Churn_Prediction.ipynb

Repository files navigation

Telco Customer Churn

Exploratory Data Analysis + Data Visualization + Modelling

1 - Abstract

2 - Data

3 - Exploratory Data Analysis

4 - Modelling

5 - Result & Future Work

About

Releases

Packages

Languages

HalukSumen/Telco_Churn_Prediction

Folders and files

Latest commit

History

Repository files navigation

Telco Customer Churn

Exploratory Data Analysis + Data Visualization + Modelling

1 - Abstract

2 - Data

3 - Exploratory Data Analysis

4 - Modelling

5 - Result & Future Work

About

Topics

Resources

Stars

Watchers

Forks

Languages