Fighting Churn with Data

This project demonstrates how the analytics event data and customer subsciption data of a product or service can be combined to identify user behaviours which predict customer churn. That analysis can be used to develop strategies to improve retention.

The SQL and python scripts included in this repo are intended to be used on the simulated social network 'SocialNet7' dataset which can be generated by running the setup of fight-churn for the book Fighting Churn with Data by Carl Gold.

The SQL scripts and Python notebooks of this project are arranged in the order in which they'd be executed when conducting a churn analysis. They should be followed and executed in this order. Any intermediate outputs are held in the output folder.

Part 1: Creating the Churn Dataset

Focuses on setting up the dataset of regular customer observations used for churn analysis.

churn-calculations includes SQL scripts for calculating:
- activity event based churn
- MRR churn
- net retention
- standard account-based churn
insert-metrics includes a SQL script for inserting aggregated metrics for each kind of analytics event.
event-quality-assurance contains a notebook and SQL scripts for plotting events over time.
metric-quality-assurance contains a notebook and SQL scripts for spotting anomalous metric values which might indicate problems with event collection.
account-tenure contains scripts for calculating account tenure (the length of time for which there is a continuous subscription for a single account) and inserting this into the data warehouse as its own metric.
identify-active-periods contains SQL scripts for calculating the active periods (allowing for a maximum 7 day gap between subscriptions) and inserting these into an active_period table. These are used to determine whether or not a metric observation ended in churn.
create-churn-dataset is where the the fun begins! Here, we create a dataset of 'per-month' event metric observations which form the basis of our churn analysis.

Part 2: Cohort and Behaviour Analysis

Focuses on cohort analysis and clustering metrics into groups of behaviours. These metrics and groups could then be used by the business to target interventions to stop people from churning from the product.

metric-summary-stats contains a notebook for checking summary statistics for all metrics (so that we can check the percentage of zero-values).
metric-scores contains a notebook for producing normalised ("scored") versions of each event metric.
metric-cohorts contains notebooks for performing cohort analysis on inidividual and grouped versions of our metrics.
metric-correlations contains a notebook for calculating and visualising the matrix of Pearson correlation coefficients between metrics.
group-behavioural-metrics contains notebooks for:
- grouping metrics together using hierarchical clustering (using SciPy's linkage and fcluster) and generating a loading matrix for averaging together the scores of those groups
- applying the loading matrix to create grouped scores.

Part 3: Churn Probability Predictions

Focuses on forecasting churn probability with logistic regression.

train-model contains notebooks for:
- training the logistic regression model.
- showing the impact of each model parameter (grouped or individual behaviour) on the likelihood of retention/churn, and checking model calibration.
forecast-churn contains a notebook and SQL script for creating a revised version of the metric dataset which contains current customers only, ready for forecasting.

The subscription data, analytics data and the churn metrics produced from them are stored locally in a PostgreSQL database.

About the model and dataset

The simulated product is a simple social network. Eight kinds of analytics events are measured, and the names are fairly self-explanatory:

Events
AdView	sees a single instance of an ad
Dislike	dislikes a piece of content
Like	likes a piece of content
Message	sends a message
New friend	makes a new friend
Post	posts a piece of content
Reply	replies to a message
Unfriend	removes a friend connection

The simulation is designed so that there are realistic relationships between the occurrence of these events and the customers’ simulated churn and renewal. These events, along with customer subscription information, are stored locally in a Postgres data warehouse.

The simulation consists of 15,000 customers over a period of 6 months from 01/01/2020 onwards. The total count for each event type over time follows a similar pattern:

Creating the metrics

We insert into the data warehouse a table of measurements. For each customer, for each event type, we count the number of times that event occurred in the last 28 days. We make these measurements every 7 days. The metric table looks like this:

where metric names are:

By checking how the maximum, average, minimum and total number of events of each type varies over time, we can be confident that there aren't any gaps in our events:

We also add a metric for account tenure, defined as the length of time a customer uses the product their current, uninterrupted period of activity, allow for a short gap of 31 days.

The churn dataset

The churn dataset is a set of customer observations which tell us the number of each event type which occurred over the 28 days prior to the observation, along with an indicator of whether or not that observation occurred immediately before that customer churned.

We can create this dataset by working out which customer active periods did or didn't end in churn. An active period is defined as a period of time when an account is continuously subscribed through one or more individual subscriptions. We allow for a maximum gap of 7 days between subscriptions in this model.

By knowing which active periods for customers did or did not end in churn, we know which metric observations did or did not end in churn, and can create a churn dataset like this:

Clustering metrics and performing cohort analysis

If we suppose that people using the product a lot are less likely to churn than people who are using the product a little or not at all, then a cohort analysis of churn that uses common behaviours from the cohorts lets us test that hypothesis. If we divide our customers into quantile-based cohorts based on their level of activity for a particular behaviour, if an activity is related to lower churn, we should find that the churn rate on the most active group is the lowest.

We can see that for behaviours which we intuitively associate with engagement with the service, such as liking posts and making new friends, more active cohorts have lower churn rates (on the left we see the raw cohort averages, and on the right, we see standard-deviation-based scores centered on the mean):

Whereas for the negative behaviour of unfriending, more active cohorts have a higher churn rate:

Perhaps most interesting are some of the rate-based or time-based metrics which can be produce from the base metrics:

Based on these relationships and further investigation, the business may decide to create interventions which encourage users to make friends more often, or consider balancing the number of ads that users see.

Although the number event types in our simulation is small, for real products and services, it can be benefitial to group certain event types together into behaviours in order to prevent information overload and discover clearer relationships with churn than individual event types might exhibit.

We can discover groups of correlated metrics perform hierarchical clustering of metrics using SciPy's linkage and fcluster. Doing a brief binary search shows that a correlation threshold of 0.5 gives behaviour groups which make intuitive sense (in the context of this simulation). Running the clustering algorithm which this value on an expanded set of metrics gives us these groups:

Group:	Descriptive name:	Metrics:
metric_group_1	Posting and making friends	adview_per_month\|like_per_month\|newfriend_per_month\|post_per_message\|post_per_month
metric_group_2	Unfriending	unfriend_per_month\|unfriend_per_newfriend
metric_group_3	Messaging	message_per_month\|reply_per_month
	Dislikes per month	dislike_per_month
	Seeing lots of ads	adview_per_post
	reply_per_message	reply_per_message
	More dislikes than likes	dislike_pcnt
	Increase in new friends	newfriend_pcnt_chng
	Time since new friend	days_since_newfriend

We can see that the algorithm groups some metrics into intuitive behaviours, such as posting and making friends, unfriending, and using the platform's messaging features. We can see from the cohort analysis of this first group that the highest scoring cohort has 1/4 of the churn rate of the lowest scoring cohort:

Regession analysis and forecasting

Building a simple logistic regression model allows us to:

better understand how much each metric/behaviour contributes to the probability of a customer churning
make a forecast of how likely it is that current customers churn before their next billing renewal date

In this project, we use LogisticRegression from sklearn.linear_model.

The model is build on the assumption that while increased customer engagement (as measured by the metrics) leads to an increased chance of retention, this relationship is subjected to diminishing returns, i.e. for customers well above or below the average on a particular metric/behaviour, even large increases/decreases in this behaviour will only lead to small increases/decreases in retention probability. To capture this relationship, we use a sigmoid (s-curve) function in the model.

Although engagement is not directly measurable, we assume that behaviour can be estimated from the customer metrics that we've produced.

Each behavioural metric score is multiplied by an engagement strength (weight/coefficient) that captures how much the behaviour (or group of behaviours) contributes to engagement. Overall engagement is the sum of the contributions for each behaviour, plus an intercept (offset) term which shifts the sigmoidal curve such that a user with zero engagement (average user) has a realistic probability forecast for retention and churn. Without this offset term, a customer in our model with zero engagement would have a 50% churn probability, which is unrealistic.

We set up the model to predict retention because this is easier to interpret: a positive number to represent something good is more intuitive than a negative number.

The result of training the model is a set of weights which capture how much each metric (or group) contributes to engagement, and a set of retention impacts which represent how much difference in the retention probability it would make to be one standard deviation above average for this metric (or group), assuming a customer was average in all other aspects.

group_metric_offset	weight	retain_impact	group_metrics
metric_group_1	0.610597898	0.01275172	adview_per_month\|like_per_month\|newfriend_per_month\|post_per_message\|post_per_month
metric_group_2	-0.10435934	-0.003020317	unfriend_per_month\|unfriend_per_newfriend
metric_group_3	0.677811622	0.013750674	message_per_month\|reply_per_month
dislike_per_month	-0.112771914	-0.003276928	dislike_per_month
adview_per_post	-0.318860417	-0.010235154	adview_per_post
reply_per_message	-0.026286454	-0.00073306	reply_per_message
dislike_pcnt	-0.00200644	-5.53E-05	dislike_pcnt
newfriend_pcnt_chng	0.175516025	0.004454125	newfriend_pcnt_chng
days_since_newfriend	-0.147217027	-0.004348952	days_since_newfriend
offset	3.534485894	0.97165323	(baseline)

The results for the model offset are included. The weight for the offset is not a weight, but just the amount of the offset. The retain_impact of the offset is the retention probability forecast for a perfectly average customer (i.e. one with zero scores in all metrics).

The relative impacts of each behaviour or group are more easily understood when represented like this:

If the retention impact for a metric is 2%, a customer who is one standard deviation above average on that metric and average in all the other metrics has a forecast retention probability 2% higher than the average retention probability.

These results would suggest that being above average in messaging activity, posting and continuing to make friends have the biggest positive impact on retention, while being above average for seeing lots of ads, going long periods between making new friends and disliking content have the biggest negative impact on retention.

Interestingly, the ratio of likes to dislikes has a negligible impact on churn/retention.

It's also important to note that:

Being below average will have an approximately equal and opposite effect.
If a customer is multiple standard deviations above average, there are diminishing returns, meaning that each additional standard deviation above average has less impact on the churn or retention probability.
The same diminishing returns goes for being above average in multiple respects: the combined churn probability reduction will be lower than the sum of the quoted retention probability impacts.

If we create a metrics dataset for just our current customers who have at least 14 days tenure (in order to account for the fact that most new customers will have low metrics due to the short observation period), we can produce a forecast of how likely each customer is to churn or retain before their next observation date:

account_id	observation_date	churn_prob	retain_prob
3	10/05/2020	0.021592298	0.978408
4	10/05/2020	0.009901682	0.990098
5	10/05/2020	0.03178912	0.968211
6	10/05/2020	0.023116958	0.976883
7	10/05/2020	0.005575805	0.994424

The overall distribution of churn probabilities looks like this:

There is a tail of accounts with significantly higher churn probability, where it would make sense for the business to focus its efforts with appropriate interventions, but no accounts have anywhere near 100% churn probability. The vast majority have less than 20% churn probability.

We can gain confidence in our model by comparing our predictions for current customers with predictions on historical data (i.e. check that the model is calibrated) and with the _actual historical data. To do this, we compare the average churn rates:

measurement of churn	average
current forecasts	0.044886
historical forecasts	0.047886
historical average	0.047846

We see that the mean predicted churn probability is very close to, although slightly lower than, those of the historical predictions, and close to the actual mean of the dataset. What might be responsible for this difference?

When comparing the ratio of the average metrics values for current and historical datasets, we see that the average metric values for current customers are around 5% higher than those for the historical dataset in most cases:

metric	historical mean	current mean	current / historical
like_per_month	102.701945	110.268551	1.073675
newfriend_per_month	7.079585	7.599192	1.073395
post_per_month	42.991330	46.135846	1.073143
adview_per_month	41.563855	44.131359	1.061773
dislike_per_month	15.801018	16.870076	1.067658
unfriend_per_month	0.307689	0.313352	1.018404
message_per_month	62.230044	66.495200	1.068539
reply_per_month	23.687189	25.177120	1.062900
adview_per_post	1.618636	1.637879	1.011889
reply_per_message	0.378669	0.398294	1.051825
post_per_message	4.225650	4.486532	1.061738
unfriend_per_newfriend	0.086465	0.085726	0.991462
dislike_pcnt	0.953466	0.984926	1.032995
newfriend_pcnt_chng	0.184072	0.239295	1.300005
days_since_newfriend	7.193891	7.997308	1.111680

Because most metrics have a positive impact on engagement, higher metrics in the current dataset might explain the lower churn probability forecast.

Future work

Some ways in which this project could be expanded are:

Measure the accuracy of the forecasts using area-under-the-curve and lift.
Create churn cohorts with demographic and firmographic categories to see which kinds of customers tend to be more engaged.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
example-images		example-images
insert-ratio-metrics		insert-ratio-metrics
output		output
part1		part1
part2		part2
part3		part3
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example-images

example-images

insert-ratio-metrics

insert-ratio-metrics

output

output

part1

part1

part2

part2

part3

part3

README.md

README.md

Repository files navigation

Fighting Churn with Data

Part 1: Creating the Churn Dataset

Part 2: Cohort and Behaviour Analysis

Part 3: Churn Probability Predictions

About the model and dataset

Creating the metrics

The churn dataset

Clustering metrics and performing cohort analysis

Regession analysis and forecasting

Future work

About

Releases

Packages

Languages

richhuwtaylor/fighting-churn

Folders and files

Latest commit

History

Repository files navigation

Fighting Churn with Data

Part 1: Creating the Churn Dataset

Part 2: Cohort and Behaviour Analysis

Part 3: Churn Probability Predictions

About the model and dataset

Creating the metrics

The churn dataset

Clustering metrics and performing cohort analysis

Regession analysis and forecasting

Future work

About

Topics

Resources

Stars

Watchers

Forks

Languages