Here’s a look back at how this blog has performed over the last year. Considering I started this blog as a study tool way back when I was doing my Fellowship exams I’m quite happy with the performance.
Yearly views were up close to double the year prior.
Most of the views clustered around the near the end of October. You can see the ARIMA post continue to drive most of the traffic while a number of newer posts also gained a fair bit of traffic. Likely due to sharing on cross platforms.
The ARIMA post was again the most visited.
Most of the traffic came from search.
Surprisingly Lithuania drove a lot of traffic.
My posting was fairly infrequent as you can see I barely posted this year. Although I did pick it up towards the end of the year.
You can see year on year the monthly views are tracking up. A lot of this is not due to any marketing or sharing on my part. Google is doing it’s thing. The more content I share the more people end up here.
This year I plan to write a lot more than I have previously. Not sure if I can top 2019 but here goes for 2020.
Fielding is one of the most crucial parts of the game and yet it almost always ranks after batting and bowling when it comes to order of importance on most player’s agendas. You would have noticed this at most in training sessions. Fielding has the potential to create the difference between a loss and a win. Most games are won and lost by narrow margins. The margin between victory and loss is usually in the vicinity of 10-20 runs, a few balls, or a couple of wickets. Missed fielding attempts, dropped catches, overthrows, missed run outs could all add up to this margin over the duration of the match.
There are certain factors in particular that lead to these errors:
Being aware of what is happening around you is a key fundamental to fielding. If you aren’t paying attention it becomes very hard to respond when the ball is hit in your direction. Fielding is no different to batting or bowling when it comes to concentration. A simple method most fielders use to aid their concentration is to walk in with the bowler as he runs in. This serves two purposes; the act of repetition just prior to a fielding attempt allows your mind to get into the zone where it knows to expect a ball to come flying at you, and secondly it keeps you on the move and agile to respond in either direction to chase down a ball faster than you would with a stop start attempt.
Pretty hard to field anything if you aren’t there. While you may be thinking it’s the responsibility of the captain to place you in the right spot, it’s still up to you make active decisions to your position as the game develops. The most obvious one is when a batsman changes ends and a new batsman is on strike. You will over time some intricacy of the playing
Lack of composure
If a ball is hit in your direction the worst thing you can do is panic. You don’t have very much time before the ball will reach you. So it’s best to remain composed at all times
Lack of technique
Using the correct technique when it comes to stopping, catching, or throwing the ball helps immensely. For catching, the simple rule is always use two hands, and one if two can’t get there.
For flat catches you can use the following decision making framework. If the ball is at or below waist height cup the hands with fingers pointing downwards. If the ball is at or above chest height then hands should be cupped pointing towards the sky. This can be difficult for the ones closer to your waist/chest which is a slight grey area. For these you should adjust your height using your knees to go into a slight crouching or jumping position and catch with hands pointing up or down respectively.
For the high ones you can cup your hands either way. The most important thing here is getting to where the ball is expected to land and intercept it right underneath it. To get to where the ball is going to land you need assess where the ball is headed. There are three proponents to this:
the line the ball is travelling
the distance it’s been hit
and the trajectory
Once there, use slight adjustments to pinpoint the best position. Keep your feet steady and hold a stable position. Maintain a slight flex in the knees to keep you agile. If you are catching with your fingers pointing backwards you will be more underneath the ball than if you are catching with your fingers pointing forward.
Keep your hands together. Some people interlace their smaller fingers one behind the other while others have their pinky’s and the side of their hands touching. It’s really personal preference on this.
Lack of Effort
Not much to be said here – some people are just too lazy and don’t put in enough because they don’t care enough about fielding or simply just give up.
If you address all these factors independently in our fielding, you will in aggregate improve.
In this post I will outline how you can build a churn model in 8 simple steps.
The discussion will be fairly high level and should provide a good breakdown for anyone embarking on the journey regardless of what you would like to build a model for. The general process can be applied to any domain. The concepts are discussed at a high level and do not delve into the nitty gritty technical details of building a model. If you came for more than just that then you might be disappointed.
Generally churn can be defined as two cohorts – those that churn and those that don’t. The exercise in a large part is to predict those that do with highest likelihood over a given timeframe.
This post centres around a telco customer base use case – but the principles can be applied universally.
If you’re interested here’s a list of the tools that I’ve used:
SQL to extract data from the data warehouse/lake
Alteryx to do most of the post SQL ETL as well as all of the modelling
Tableau to explore the data as well as visualise and track the performance of the models
Here’s a video of me demonstrating how to fit models using Alteryx on a prepared set of university grant data from Kaggle.
With all that said let’s begin.
1. Define the Business Problem
The first step for all good data science projects is to DEFINE the problem you’re looking to address.
For example, I want to predict customers that are likely to churn within the next 14 days, or the likelihood of a customer to turn off renewal in the next 7 days, etc.
Also identify the population of customers you want to model. You want to ensure that this population is not too diverse. Don’t try and model your whole customer base with one model. Be specific about the group of customers you will build a model for. As an example, build a churn model for high value customers on data only plans.
2. Data Acquisition
You then want to acquire the data that you will need to build a model to address the business problem.
Extract or obtain all the information you can get your hands on that will be readily available well in to the future so that you’ll be able to score and refit your models.
We acquired 13 data sources ranging from demographic, billing, product type, utilisation, customer contacts, etc.
These will be at varying grains of detail depending on how and where you extract the data from (hourly, daily, monthly to event level per subscriber).
The most important consideration when choosing which data sources to pull in is determining whether the data is reproducible and can be updated on an ongoing basis.
3. Subset the Data to the problem set
We then need to ensure that all data sources are subset to the problem that we’re modelling.
We might have datasets that records all events up until today – we need to cut this dataset at the point in time that we are going to make the prediction. E.g. for a dataset that contains every time a customer logs in to the app we need to filter any logins to the app after the event date. In the case of a model that predicts churn in the next 14 days then we need to subset the logins dataset prior to the event date whether that be churn or non churn and only use that in the fitting process of building the model. For those customers that are still active today we need to filter login records to 14 days prior to the date the data was extracted.
4. Univariate Analysis of all data
The next step is to cull any unnecessary, redundant or useless data fields.
What you’re looking for here is to remove any fields which show erroneous data. E.g. look at % null or missing, averages, total unique observations, distribution or spread, etc.
5. Base table build (including feature creation)
Once you have a set of cleansed datasets ready to go you can begin the formation of an analytical base table. This will be used to build and fit the models.
We created approximately a thousand different features metrics/fields across the datasets. Features that we built included:
Number of calls in the past X days
Amount of data used in the past X days
Last bill amount
Overdue bill flag
Payment made in the last X days
It would be impossible to feed a thousand different features through any modelling exercise due to the sheer number. So we did some testing of these ‘new’ features against churn using a number of bivariate tests like correlation matrices etc. We were able to cull these down to about 70 odd features that showed some correlation to churn before we began modelling.
6. Model building stage
Once you have an analytical base table the next step is to split your data into 2 or 3 sets depending on how rigorous you want to be.
Training set (40% of records)
Validation set (30% of records)
Holdout set (30% of records)
Use the training set to fit a number of different models e.g. Tree, Random Forest, Boosted, GLM’s, Neural Nets, etc. Then use a lift chart to compare the performance between each of these using the validation set. Iterate fitting the models a number of times using different features until you have the the best performing model possible that you’re content with.
We ended up with 1 champion model and 3 challenger models. You can see their performance is quite comparable.
7. Further feature creation
Optional: You can then create further features to improve your model or cater for areas you didn’t consider prior to modelling and refit the models using these.
8. Put the model into production
You can then build a production workflow whereby you recreate the analytical base table (ABT) for existing customers that are active as of today and score their likelihood to churn based on the selected model.
We opted to do this on a fortnightly basis due to the amount of time it took to run (almost half a day!). You can then track the predicted likelihood to churn against the actual churn rate to track the performance of your models. Once this dips you’d be well advised to refit/retrain your models. Otherwise they will become redundant.
Thanks for Reading
I hope this gives you a rough idea of the general process for building models and you found value in the above. This post has been sitting in my draft for almost 3 years and I finally got around to finalising it and publishing it.
In compiling this I have made use of resources found across the web. For commercial reasons and obligations to my employer, I have not disclosed any of the specifics of the modelling and the underlying data.
I hope to write a part 2 which delves more into the learnings I got from building this churn model and the things I felt could’ve improved the results. That might take another 3 years.
These are a few things you should consider when building Alteryx workflows, I hope they help!
Write an Aggregated Query or Use In-DB Tools
Writing a query direct to the DB or using the In-DB tools that does all the complex joins and aggregations will provide the best results, rather than doing these in Alteryx. Compared to an individual workstation most databases will provide shorter data processing times. Remember to cache your results as you don’t want to overburden the database with needless requests. It will also allow you to develop workflows faster.
Manually ‘Parallelise’ Workflows
R is single threaded and so most of the R tools only use one processor core at a time. If you’re building/scoring models or using other R based tools and have a multicore CPU take advantage of it by running multiple workflows at the same time. Depending on the number of cores you have, you won’t notice a significant increase in runtime per workflow. Run as many workflows as you cores/threads. Just be vary that some of your workflows may use a secondary core so use trial & error to see what you can get away with.
Alteryx stores data in both RAM and a temporary location on disk. Minimising the use of ‘swapping’ between the two when dealing with large datasets can save a lot time. Alteryx performs optimally when all of the data can be kept in RAM during processing. Generally, RAM is the fastest storage to access, while SSDs and regular HDDs are a magnitude slower than RAM. Before you can work with data, a computer has to load it into RAM (and consequently write much of the data currently in RAM to the disk). This is what is termed as the ‘swap’.
It’s fortunate that Alteryx is very efficient at keeping data in memory for as long as possible. It will look ahead in your stream to see how far it can get, and often load, process, and begin to write a smaller set of records rather than try and load the entire dataset into memory before beginning processing. When it reaches a point where it can no longer process the data it has in RAM without processing other data, it will write out its RAM contents to an Alteryx temp file (much like the traditional way that computers use the ‘swap’ or ‘pagefile’).
Alteryx Discussion Board User
There are a few ways to help keep your data in memory:
Use Select and Filter Tools –Keep only the fields or records that you need. Do this early in your process, and keep trimming as things become unnecessary. This is true of all data processing regardless of environment.
Use enough RAM – If you are running multiple workflows simultaneously ensure you are allocating enough RAM to each of them based on what each will need. This can be tricky to achieve with large datasets, and especially when you want to run multiple streams simultaneously.
Use fast storage – Set your temp file location to your fastest source of storage. If you have the option, set it to an SSD. You’ll see an immediate increase in performance.
Limit the use of block until done – Because Alteryx tries do look ahead processing down stream putting block until done can reduce the efficiency with which Alteryx can handle the data. So minimise the amount of ‘checkpoints’ that require all data to be processed before it can continue. Only use the block until done tool sparingly.
Limit Sorts & Joins …
Sorts & Joins
Sorts and joins are usual suspects for bad performance because it hits on both RAM and processing power, and can often be a ‘checkpoint’ in your workflow. Alteryx will pick up on the fact that data is already sorted, especially if it’s sorted by a key value going into a join (so sort on the way out of the database if you can!) If you have to join multiple data streams together, it is usually best to use a Join Multiple rather than a series of regular Joins.
Use Index Keys
The concept of using Index Keys is to replace columns of large data types with a much smaller data type (to act as an index to limit the amount of data you need to process). So if you have a large text string as the key, replace it with an integer key early in the process (before running a crosstab, transposition, join etc.). Rejoin the large key when you need it afterwards. It may not save you much with a 10 byte string, but it definitely saves you time with a large key. That said, see if you can minimise crosstabs and transpositions, as they are also inherently expensive.
I recently filmed a video on how to use Alteryx to build predictive models for the Institute of Actuaries Australia for a competition they’re running called Weapons of Mass Deduction. My video was shortlisted as a finalist.
Check it out, like it and vote for me if you can.
The video demonstrates how to use the predictive capabilities of Alteryx to build and fit models. I’ve used a Kaggle dataset from a competition run 8 or so years ago. The dataset tracks the success of past grants for funding. The models I’ve built aren’t great but the video does give you a rough idea of how to do this.
I like to make my workflows look as neat and organised as possible. A bit OCD like that.
I didn’t compare my results on Kaggle after putting this together, I doubt it would have performed that well. I purely used this dataset for demonstration purposes only. I find that to place high on kaggle leaderboards you need to use bespoke packages in R or Python like XGBoost or H20.ai etc. It would be difficult to produce really high performing models using the standard packages in Alteryx. That said the packages in Alteryx still allow you to produce pretty decent working models and are super easy and quick to set up and get working compared to building it in Python or R. It’s great for doing a POC and iterating fairly quickly before you invest time and effort into building the best model possible using other methods.
The example I’ve shown here is pretty basic and I didn’t delve into building an ensemble model although that would be a relatively easy way to eke out more performance.
I also don’t focus too heavily on parameter selection and tuning to build the best possible model in this video although it’s relatively easy to do in Alteryx. I keep it pretty general and high level.
Here is a breakdown of the performance of this blog over the last couple of years since I began writing it.
As you can see it’s been growing at a steady clip over time. A lot of this is likely due to being better represented in googles archives. This is the year on year reader/viewer counts. The overall number is pretty low however better than nothing.
Month wise you can see that the peaks in traffic happen around exam time – April and October get heavy traffic. Unusually December also sees a peak in traffic. There must be something festive about reading actuarial content. (wishful thinking)
The week on week performance is fairly volatile – however one thing is obvious someone really thoroughly went through my content on Christmas week. Here’s to you buddy. 🙂
So who’s been reading? Primarily Australians, Americans, Brits, Indians, and a good splattering of people from around the world. Not many Africans or middle eastern people by the looks of it.
My most popular piece by far was about finding a root parent from a long format dataset using a little known data manipulation software called Alteryx. My subsequent most popular pieces are based on modelling methodologies. My piece on the Actuarial Code of Conduct didn’t get much love with only 13 views.
Overall I’m pretty happy with what I’ve achieved with this blog thus far – and will continue to publish more pieces over time. Thanks for reading.
Ok here goes. It may be tough to get it all in layman’s terms but I’ve tried to simplify the concepts as much as I can without getting too mathematical. (note the charts below have been sourced from other websites)
Firstly differencing is as the name suggests taking the difference between two terms. For normal differencing you are trying to remove any trends. For seasonal differencing you are trying to remove any trends across seasons. The aim of this is to ensure the series is stationary before modelling. See below an example. The bottom one is non stationary while the one above is considered stationary.
Normal differencing as referred to in ARIMA (and more generally time series modelling) is used to de-trend the data. 1 level of differencing would mean you take the current value and subtract the prior value from it. This is done for all time points and you get another resultant time series. If this series still shows a trend then you can do another level of differencing with the first level differenced series. This would result in new series with 2 levels of differencing applied.
Seasonal differencing takes the difference between the current value and the previous seasons value at the same time. For example, if you have are modelling at the monthly level then the differencing applied would be x(t) – x(t-12). This will remove any first level seasonal trends. Higher levels would just be taking the difference of the differences. Below is an example where you can see that the seasonal component is getting larger and larger. This is a case where it would be reasonable to do seasonal differencing. Obviously in practice you are rarely working with such clean data. This series would also require normal differencing to remove the upwards trend.
If you were to combine both seasonal and normal differencing then it would look something like this. You take the normal difference first – then you take the seasonal difference.
(x(t) - x(t-1)) - (x(t-12) - x(t-13))
Moving on to the autoregressive and moving average components. If you want to think of it quite simply in terms of a regression model then autoregressive components refer to prior values of the current value. So if you consider x(t) as the current value then the first AR component is x(t-1) multiplied by a fitted coefficient. The second AR component would be x(t-2) and so on. These are often referred to as lagged terms. So the prior value is called the first lag, and the one prior that the second lag, and so on.
Moving average is where it gets a bit complicated. The name is misleading a bit. These components don’t actually refer to the moving average of the series itself. But rather try to use past errors to predict the current value. To put it simply if your series is stationary then it will have a constant mean, however at any point in time it may not actually be at the mean. The distance from the mean is called the error.
To put it mathematically:
x(t) = Mean + Error(t)
x(t-1) = Mean + Error(t-1)
So the first order of a moving average component would be Error(t-1), second order Error(t-2), and so on.
Hope that helps and improves your understanding of ARIMA time series models and it’s underlying components.
Alteryx has a few key tools that allow you to make your workflows design visually appealing.
The primary objects are:
You can also choose to have your workflow run vertically or horizontally This is personal preference in most scenarios I will tend to use horizontal. I have used vertical in some instances where I thought it would be better suited.
Elements of design
LINE – The linear marks made with a pen or brush or the edge created when two shapes meet.
SHAPE – A shape is a self contained defined area of geometric (squares and circles), or organic (free formed shapes or natural shapes). A positive shape automatically creates a negative shape.
DIRECTION – All lines have direction – Horizontal, Vertical or Oblique. Horizontal suggests calmness, stability and tranquillity. Vertical gives a feeling of balance, formality and alertness. Oblique suggests movement and action
SIZE – Size is simply the relationship of the area occupied by one shape to that of another.
TEXTURE – Texture is the surface quality of a shape – rough, smooth, soft hard glossy etc.
COLOUR – Colour is light reflected off objects. Color has three main characteristics: hue or its name (red, green, blue, etc.), value (how light or dark it is), and intensity (how bright or dull it is).