Using Alteryx to Build Predictive Models

I recently filmed a video on how to use Alteryx to build predictive models for the Institute of Actuaries Australia for a competition they’re running called Weapons of Mass Deduction. My video was shortlisted as a finalist.

Check it out, like it and vote for me if you can.

The video demonstrates how to use the predictive capabilities of Alteryx to build and fit models. I’ve used a Kaggle dataset from a competition run 8 or so years ago. The dataset tracks the success of past grants for funding. The models I’ve built aren’t great but the video does give you a rough idea of how to do this.

I like to make my workflows look as neat and organised as possible. A bit OCD like that.

I didn’t compare my results on Kaggle after putting this together, I doubt it would have performed that well. I purely used this dataset for demonstration purposes only. I find that to place high on kaggle leaderboards you need to use bespoke packages in R or Python like XGBoost or H20.ai etc. It would be difficult to produce really high performing models using the standard packages in Alteryx. That said the packages in Alteryx still allow you to produce pretty decent working models and are super easy and quick to set up and get working compared to building it in Python or R. It’s great for doing a POC and iterating fairly quickly before you invest time and effort into building the best model possible using other methods.

The example I’ve shown here is pretty basic and I didn’t delve into building an ensemble model although that would be a relatively easy way to eke out more performance.

I also don’t focus too heavily on parameter selection and tuning to build the best possible model in this video although it’s relatively easy to do in Alteryx. I keep it pretty general and high level.

Enjoy.

GLM Process

1-Pick distributions

  • Claim sizes and frequencies are generally modelled using Gamma and Poisson distributions respectively.
  • Otherwise, there would need to be further examination of the response variable.

2-Pick link function

  • This depends on the nature of the response variable. For example, a non-negative variable would use the log link function, whereas a variable between 0 and 1 would use a logit link function.

3-Analyse data

  • What explanatory variables have been provided?
  • What does the response variable look like by each explanatory variable? One way summaries or pivot tables could be analysed for insight.
  • Consider grouping for categorical variables.
  • Consider transformations for variables with non-linear shapes.

4-Optimising/fitting using maximum likelihood estimation (most likely done with a program).

5-Assessing output and p values

  • Some variables may be dropped based on their statistical insignificance.
  • Data mining or decision trees can help to find areas that are not fitting well and refine them.
  • Address any large individual observations or outliers distorting results.
  • Fitting curves to reduce over-fitting.
  • This may be an iterative process requiring judgment.

6-Testing how well the GLM predicts using a subset of the data

GLM Pros & Cons

Pros
– Removes bias caused by change in mix of business
– Gives standard error of the estimate, mean vs variance
– Pure effects of independent variables on the dependent variable can be observed
– Assumption that “error terms have a variance independent of the mean” does not need to be made
– Overcomes shortcomings of standard Linear Modelling

Cons
– High powered computing hardware required
– Processing time
– Large number of trial and error runs
– Difficult to review
– Require considerable experience and large portfolios to build in depth models

GLM Model Validation and Checks

Assess Output Reasonableness
– Standard errors < estimated values
– Base reflects categories with highest exposure
– Factors with similar estimates should be grouped
– Relativities for factors make intuitive sense
– ensuring continuous curves are only fitted to reasonable factors
– ensuring interaction effects have been allowed for
– assess changes to model one change at a time

Compare Actual vs Modelled

Deviance
– how far the actual data differs from the fitted model

Degrees of Freedom
– number of places the actual data could differ from model
– number of observations less parameters of model

Statistical tests
– F-Test: [(D2 – D1)/(DF2 – DF1)] / [D1/DF1]
.    checks overall goodness of fit
– t/z-Test

Use Residual Plots to assess:
– Heteroskedasticity
– Outliers
– Biasness

QQ Plot
– to assess residual quantiles against expected residuals of normal distribution
– should be a straight 45 degree line, otherwise:
.    if greater then tail heavier than assumed
.    if less then tail is not as heavy as assumed