In this post I will outline how you can build a churn model in 8 simple steps.
The discussion will be fairly high level and should provide a good breakdown for anyone embarking on the journey regardless of what you would like to build a model for. The general process can be applied to any domain. The concepts are discussed at a high level and do not delve into the nitty gritty technical details of building a model. If you came for more than just that then you might be disappointed.
Generally churn can be defined as two cohorts – those that churn and those that don’t. The exercise in a large part is to predict those that do with highest likelihood over a given timeframe.
This post centres around a telco customer base use case – but the principles can be applied universally.
If you’re interested here’s a list of the tools that I’ve used:
- SQL to extract data from the data warehouse/lake
- Alteryx to do most of the post SQL ETL as well as all of the modelling
- Tableau to explore the data as well as visualise and track the performance of the models
Here’s a video of me demonstrating how to fit models using Alteryx on a prepared set of university grant data from Kaggle.
With all that said let’s begin.
1. Define the Business Problem
The first step for all good data science projects is to DEFINE the problem you’re looking to address.
For example, I want to predict customers that are likely to churn within the next 14 days, or the likelihood of a customer to turn off renewal in the next 7 days, etc.
Also identify the population of customers you want to model. You want to ensure that this population is not too diverse. Don’t try and model your whole customer base with one model. Be specific about the group of customers you will build a model for. As an example, build a churn model for high value customers on data only plans.
2. Data Acquisition
You then want to acquire the data that you will need to build a model to address the business problem.
Extract or obtain all the information you can get your hands on that will be readily available well in to the future so that you’ll be able to score and refit your models.
We acquired 13 data sources ranging from demographic, billing, product type, utilisation, customer contacts, etc.
These will be at varying grains of detail depending on how and where you extract the data from (hourly, daily, monthly to event level per subscriber).
The most important consideration when choosing which data sources to pull in is determining whether the data is reproducible and can be updated on an ongoing basis.
3. Subset the Data to the problem set
We then need to ensure that all data sources are subset to the problem that we’re modelling.
We might have datasets that records all events up until today – we need to cut this dataset at the point in time that we are going to make the prediction. E.g. for a dataset that contains every time a customer logs in to the app we need to filter any logins to the app after the event date. In the case of a model that predicts churn in the next 14 days then we need to subset the logins dataset prior to the event date whether that be churn or non churn and only use that in the fitting process of building the model. For those customers that are still active today we need to filter login records to 14 days prior to the date the data was extracted.
4. Univariate Analysis of all data
The next step is to cull any unnecessary, redundant or useless data fields.
What you’re looking for here is to remove any fields which show erroneous data. E.g. look at % null or missing, averages, total unique observations, distribution or spread, etc.
5. Base table build (including feature creation)
Once you have a set of cleansed datasets ready to go you can begin the formation of an analytical base table. This will be used to build and fit the models.
We created approximately a thousand different features metrics/fields across the datasets. Features that we built included:
- Number of calls in the past X days
- Amount of data used in the past X days
- Last bill amount
- Overdue bill flag
- Payment made in the last X days
It would be impossible to feed a thousand different features through any modelling exercise due to the sheer number. So we did some testing of these ‘new’ features against churn using a number of bivariate tests like correlation matrices etc. We were able to cull these down to about 70 odd features that showed some correlation to churn before we began modelling.
6. Model building stage
Once you have an analytical base table the next step is to split your data into 2 or 3 sets depending on how rigorous you want to be.
- Training set (40% of records)
- Validation set (30% of records)
- Holdout set (30% of records)
Use the training set to fit a number of different models e.g. Tree, Random Forest, Boosted, GLM’s, Neural Nets, etc. Then use a lift chart to compare the performance between each of these using the validation set. Iterate fitting the models a number of times using different features until you have the the best performing model possible that you’re content with.
We ended up with 1 champion model and 3 challenger models. You can see their performance is quite comparable.
7. Further feature creation
Optional: You can then create further features to improve your model or cater for areas you didn’t consider prior to modelling and refit the models using these.
8. Put the model into production
You can then build a production workflow whereby you recreate the analytical base table (ABT) for existing customers that are active as of today and score their likelihood to churn based on the selected model.
We opted to do this on a fortnightly basis due to the amount of time it took to run (almost half a day!). You can then track the predicted likelihood to churn against the actual churn rate to track the performance of your models. Once this dips you’d be well advised to refit/retrain your models. Otherwise they will become redundant.
Thanks for Reading
I hope this gives you a rough idea of the general process for building models and you found value in the above. This post has been sitting in my draft for almost 3 years and I finally got around to finalising it and publishing it.
In compiling this I have made use of resources found across the web. For commercial reasons and obligations to my employer, I have not disclosed any of the specifics of the modelling and the underlying data.
I hope to write a part 2 which delves more into the learnings I got from building this churn model and the things I felt could’ve improved the results. That might take another 3 years.