8 Simple Steps To Build a Churn Model

In this post I will outline how you can build a churn model in 8 simple steps.

The discussion will be fairly high level and should provide a good breakdown for anyone embarking on the journey regardless of what you would like to build a model for. The general process can be applied to any domain. The concepts are discussed at a high level and do not delve into the nitty gritty technical details of building a model. If you came for more than just that then you might be disappointed.

Generally churn can be defined as two cohorts – those that churn and those that don’t. The exercise in a large part is to predict those that do with highest likelihood over a given timeframe.

This post centres around a telco customer base use case – but the principles can be applied universally.

If you’re interested here’s a list of the tools that I’ve used:

  • SQL to extract data from the data warehouse/lake
  • Alteryx to do most of the post SQL ETL as well as all of the modelling
  • Tableau to explore the data as well as visualise and track the performance of the models

Here’s a video of me demonstrating how to fit models using Alteryx on a prepared set of university grant data from Kaggle.

With all that said let’s begin.

1. Define the Business Problem

The first step for all good data science projects is to DEFINE the problem you’re looking to address.

For example, I want to predict customers that are likely to churn within the next 14 days, or the likelihood of a customer to turn off renewal in the next 7 days, etc.

Also identify the population of customers you want to model. You want to ensure that this population is not too diverse. Don’t try and model your whole customer base with one model. Be specific about the group of customers you will build a model for. As an example, build a churn model for high value customers on data only plans.

2. Data Acquisition

You then want to acquire the data that you will need to build a model to address the business problem.

Extract or obtain all the information you can get your hands on that will be readily available well in to the future so that you’ll be able to score and refit your models.

We acquired 13 data sources ranging from demographic, billing, product type, utilisation, customer contacts, etc.

These will be at varying grains of detail depending on how and where you extract the data from (hourly, daily, monthly to event level per subscriber).

The most important consideration when choosing which data sources to pull in is determining whether the data is reproducible and can be updated on an ongoing basis.

3. Subset the Data to the problem set

We then need to ensure that all data sources are subset to the problem that we’re modelling.

We might have datasets that records all events up until today – we need to cut this dataset at the point in time that we are going to make the prediction. E.g. for a dataset that contains every time a customer logs in to the app we need to filter any logins to the app after the event date. In the case of a model that predicts churn in the next 14 days then we need to subset the logins dataset prior to the event date whether that be churn or non churn and only use that in the fitting process of building the model. For those customers that are still active today we need to filter login records to 14 days prior to the date the data was extracted.

4. Univariate Analysis of all data

The next step is to cull any unnecessary, redundant or useless data fields.

What you’re looking for here is to remove any fields which show erroneous data. E.g. look at % null or missing, averages, total unique observations, distribution or spread, etc.

5. Base table build (including feature creation)

Once you have a set of cleansed datasets ready to go you can begin the formation of an analytical base table. This will be used to build and fit the models.

We created approximately a thousand different features metrics/fields across the datasets. Features that we built included:

  • Number of calls in the past X days
  • Amount of data used in the past X days
  • Last bill amount
  • Overdue bill flag
  • Payment made in the last X days
  • etc.

It would be impossible to feed a thousand different features through any modelling exercise due to the sheer number. So we did some testing of these ‘new’ features against churn using a number of bivariate tests like correlation matrices etc. We were able to cull these down to about 70 odd features that showed some correlation to churn before we began modelling.

6. Model building stage

Once you have an analytical base table the next step is to split your data into 2 or 3 sets depending on how rigorous you want to be.

  1. Training set (40% of records)
  2. Validation set (30% of records)
  3. Holdout set (30% of records)

Use the training set to fit a number of different models e.g. Tree, Random Forest, Boosted, GLM’s, Neural Nets, etc. Then use a lift chart to compare the performance between each of these using the validation set. Iterate fitting the models a number of times using different features until you have the the best performing model possible that you’re content with.

We ended up with 1 champion model and 3 challenger models. You can see their performance is quite comparable.

7. Further feature creation

Optional: You can then create further features to improve your model or cater for areas you didn’t consider prior to modelling and refit the models using these.

8. Put the model into production

You can then build a production workflow whereby you recreate the analytical base table (ABT) for existing customers that are active as of today and score their likelihood to churn based on the selected model.

We opted to do this on a fortnightly basis due to the amount of time it took to run (almost half a day!). You can then track the predicted likelihood to churn against the actual churn rate to track the performance of your models. Once this dips you’d be well advised to refit/retrain your models. Otherwise they will become redundant.

Thanks for Reading

I hope this gives you a rough idea of the general process for building models and you found value in the above. This post has been sitting in my draft for almost 3 years and I finally got around to finalising it and publishing it.

In compiling this I have made use of resources found across the web. For commercial reasons and obligations to my employer, I have not disclosed any of the specifics of the modelling and the underlying data.

I hope to write a part 2 which delves more into the learnings I got from building this churn model and the things I felt could’ve improved the results. That might take another 3 years.

Building Optimal Alteryx Workflows and Best Practice

These are a few things you should consider when building Alteryx workflows, I hope they help!

Build, Optimise, Repeat

Write an Aggregated Query or Use In-DB Tools

Writing a query direct to the DB or using the In-DB tools that does all the complex joins and aggregations will provide the best results, rather than doing these in Alteryx. Compared to an individual workstation most databases will provide shorter data processing times. Remember to cache your results as you don’t want to overburden the database with needless requests. It will also allow you to develop workflows faster.

Manually ‘Parallelise’ Workflows

R is single threaded and so most of the R tools only use one processor core at a time.  If you’re building/scoring models or using other R based tools and have a multicore CPU take advantage of it by running multiple workflows at the same time. Depending on the number of cores you have, you won’t notice a significant increase in runtime per workflow. Run as many workflows as you cores/threads. Just be vary that some of your workflows may use a secondary core so use trial & error to see what you can get away with.

Memory Usage

Alteryx stores data in both RAM and a temporary location on disk. Minimising the use of ‘swapping’ between the two when dealing with large datasets can save a lot time.  Alteryx performs optimally when all of the data can be kept in RAM during processing.  Generally, RAM is the fastest storage to access, while SSDs and regular HDDs are a magnitude slower than RAM. Before you can work with data, a computer has to load it into RAM (and consequently write much of the data currently in RAM to the disk). This is what is termed as the ‘swap’.  

It’s fortunate that Alteryx is very efficient at keeping data in memory for as long as possible.  It will look ahead in your stream to see how far it can get, and often load, process, and begin to write a smaller set of records rather than try and load the entire dataset into memory before beginning processing.  When it reaches a point where it can no longer process the data it has in RAM without processing other data, it will write out its RAM contents to an Alteryx temp file (much like the traditional way that computers use the ‘swap’ or ‘pagefile’).

Alteryx Discussion Board User

There are a few ways to help keep your data in memory:

  1. Use Select and Filter Tools Keep only the fields or records that you need. Do this early in your process, and keep trimming as things become unnecessary.  This is true of all data processing regardless of environment.
  2. Use enough RAM – If you are running multiple workflows simultaneously ensure you are allocating enough RAM to each of them based on what each will need.  This can be tricky to achieve with large datasets, and especially when you want to run multiple streams simultaneously.
  3. Use fast storage – Set your temp file location to your fastest source of storage. If you have the option, set it to an SSD.  You’ll see an immediate increase in performance.
  4. Limit the use of block until done – Because Alteryx tries do look ahead processing down stream putting block until done can reduce the efficiency with which Alteryx can handle the data. So minimise the amount of ‘checkpoints’ that require all data to be processed before it can continue. Only use the block until done tool sparingly.
  5. Limit Sorts & Joins …

Sorts & Joins

Sorts and joins are usual suspects for bad performance because it hits on both RAM and processing power, and can often be a ‘checkpoint’ in your workflow.  Alteryx will pick up on the fact that data is already sorted, especially if it’s sorted by a key value going into a join (so sort on the way out of the database if you can!)  If you have to join multiple data streams together, it is usually best to use a Join Multiple rather than a series of regular Joins.

Use Index Keys

The concept of using Index Keys is to replace columns of large data types with a much smaller data type (to act as an index to limit the amount of data you need to process). So if you have a large text string as the key, replace it with an integer key early in the process (before running a crosstab, transposition, join etc.). Rejoin the large key when you need it afterwards.  It may not save you much with a 10 byte string, but it definitely saves you time with a large key.  That said, see if you can minimise crosstabs and transpositions, as they are also inherently expensive.

All said …

“Premature optimisation is the root of all evil.”

Donald Ervin Knuth, The Art of Computer Programming, Volume 1: Fundamental Algorithms

As always try and forget about optimising until you’ve got it working – optimise later.

Using Alteryx to Build Predictive Models

I recently filmed a video on how to use Alteryx to build predictive models for the Institute of Actuaries Australia for a competition they’re running called Weapons of Mass Deduction. My video was shortlisted as a finalist.

Check it out, like it and vote for me if you can.

The video demonstrates how to use the predictive capabilities of Alteryx to build and fit models. I’ve used a Kaggle dataset from a competition run 8 or so years ago. The dataset tracks the success of past grants for funding. The models I’ve built aren’t great but the video does give you a rough idea of how to do this.

I like to make my workflows look as neat and organised as possible. A bit OCD like that.

I didn’t compare my results on Kaggle after putting this together, I doubt it would have performed that well. I purely used this dataset for demonstration purposes only. I find that to place high on kaggle leaderboards you need to use bespoke packages in R or Python like XGBoost or H20.ai etc. It would be difficult to produce really high performing models using the standard packages in Alteryx. That said the packages in Alteryx still allow you to produce pretty decent working models and are super easy and quick to set up and get working compared to building it in Python or R. It’s great for doing a POC and iterating fairly quickly before you invest time and effort into building the best model possible using other methods.

The example I’ve shown here is pretty basic and I didn’t delve into building an ensemble model although that would be a relatively easy way to eke out more performance.

I also don’t focus too heavily on parameter selection and tuning to build the best possible model in this video although it’s relatively easy to do in Alteryx. I keep it pretty general and high level.

Enjoy.

Finding the Root Parent in a Hierarchy Using Alteryx

This week I’ve been attempting to solve the problem of finding the root parent in a hierarchy. In this post I’ll try and cover off the approaches I’ve taken to solve this problem.

There is quite a number of methods to do this across multiple software. Most prominently I’ve found solutions to this problem in SQL Server, MySQL, Python, etc.

Here is one really useful article I found that discusses a SQL based solution. It goes through a number of examples and possible situations you may need to solve for.

SQL-queries-to-manage-hierarchical-or-parent-child

It also may be worthwhile referring to the following Stack Overflow question:
Finding-a-top-level-parent-in-sql

This doesn’t help me as I am tied to having to do this in either Redshift, Alteryx or R due to the software at my disposal. Most solutions I’ve come across are not capable of scaling to millions of records with potentially that many hierarchies.

If you are really interested in delving into the theory behind hierarchies and hierarchical data structures then I would suggest reading up on graph theory.

The problem as it stands can be summarised as follows. I have a table with 2 columns; one that contains the users current ID and the other that contains the prior ID. I am trying to solve for the first ID the user ever had.

CurrentID | PreviousID
x1        | 
x2        | x1
x3        | x2
x4        | x3
x5        | x4
y1        | 
y2        | y1
y3        | y2
y4        | y3
y5        | y4
y1        | y5

In the first column I have a unique list of all ID’s ever assigned – the above example is simplified in the actual data the ID is not formatted such that it can be easily grouped as above.

Basically I took a recursive approach to trace back to the root ID.

    1. Left join table on itself (using Join and Union tool in conjunction) by using the previous id and current id
Merge on ID_Prev[0] = ID_Curr'[0]
    1. Rename previous ID
ID_prev'[0] -> ID_prev[-1]
    1. Keep only 3 variables from the original table
ID_Prev[-1], ID_Prev[0], ID_Curr[0]

At this point you will have traced back to 2 levels of parentage

    1. Repeat until root parent is found
Using ID_prev[-1] merge against original table with ID_Curr'[0]
Keep and rename ID_prev'[0] to ID_prev[-2]

(Remember to drop those records that only 2 levels deep)
The table should now resemble:

ID_Prev[-2], ID_Prev[-1], ID_Prev[0], ID_Curr[0]
and you are 3 levels deep

A solution to this problem using a non serial join approach presented by Adam Riley on the Alteryx Community can be found at the link below. It uses an iterative macro – clearly miles ahead of my brute force attempt. I did something like 30 left joins to achieve my goal and even still hadn’t got to the root parent in some cases.

http://community.alteryx.com/t5/Data-Preparation-Blending/Producing-a-denormalised-output-from-Hierarchial-Data/m-p/5164

Useful Articles for Alteryx

Below are some handy articles that I’ve come across the web – that use Alteryx in intriguing ways.

These two articles delve into how to scrape websites for data – using a range of different tools.

A macro that provides the capability to parse strings containing anything from emails to social security numbers.

Automate email distribution of PDFs without the end user having to subscribe to a dashboard.

Alteryx Workflow Dependencies

Found an easy way to change local addresses in Alteryx workflows to UNC – prior to scheduling a workflow to run.

May be handy.

WFD1

It will display all destinations your workflow accesses.

WFD2

Click Edit to Change address reference.

WFD3

And you’re done.