Building Optimal Alteryx Workflows and Best Practice

These are a few things you should consider when building Alteryx workflows, I hope they help!

Build, Optimise, Repeat

Write an Aggregated Query or Use In-DB Tools

Writing a query direct to the DB or using the In-DB tools that does all the complex joins and aggregations will provide the best results, rather than doing these in Alteryx. Compared to an individual workstation most databases will provide shorter data processing times. Remember to cache your results as you don’t want to overburden the database with needless requests. It will also allow you to develop workflows faster.

Manually ‘Parallelise’ Workflows

R is single threaded and so most of the R tools only use one processor core at a time.  If you’re building/scoring models or using other R based tools and have a multicore CPU take advantage of it by running multiple workflows at the same time. Depending on the number of cores you have, you won’t notice a significant increase in runtime per workflow. Run as many workflows as you cores/threads. Just be vary that some of your workflows may use a secondary core so use trial & error to see what you can get away with.

Memory Usage

Alteryx stores data in both RAM and a temporary location on disk. Minimising the use of ‘swapping’ between the two when dealing with large datasets can save a lot time.  Alteryx performs optimally when all of the data can be kept in RAM during processing.  Generally, RAM is the fastest storage to access, while SSDs and regular HDDs are a magnitude slower than RAM. Before you can work with data, a computer has to load it into RAM (and consequently write much of the data currently in RAM to the disk). This is what is termed as the ‘swap’.  

It’s fortunate that Alteryx is very efficient at keeping data in memory for as long as possible.  It will look ahead in your stream to see how far it can get, and often load, process, and begin to write a smaller set of records rather than try and load the entire dataset into memory before beginning processing.  When it reaches a point where it can no longer process the data it has in RAM without processing other data, it will write out its RAM contents to an Alteryx temp file (much like the traditional way that computers use the ‘swap’ or ‘pagefile’).

Alteryx Discussion Board User

There are a few ways to help keep your data in memory:

  1. Use Select and Filter Tools Keep only the fields or records that you need. Do this early in your process, and keep trimming as things become unnecessary.  This is true of all data processing regardless of environment.
  2. Use enough RAM – If you are running multiple workflows simultaneously ensure you are allocating enough RAM to each of them based on what each will need.  This can be tricky to achieve with large datasets, and especially when you want to run multiple streams simultaneously.
  3. Use fast storage – Set your temp file location to your fastest source of storage. If you have the option, set it to an SSD.  You’ll see an immediate increase in performance.
  4. Limit the use of block until done – Because Alteryx tries do look ahead processing down stream putting block until done can reduce the efficiency with which Alteryx can handle the data. So minimise the amount of ‘checkpoints’ that require all data to be processed before it can continue. Only use the block until done tool sparingly.
  5. Limit Sorts & Joins …

Sorts & Joins

Sorts and joins are usual suspects for bad performance because it hits on both RAM and processing power, and can often be a ‘checkpoint’ in your workflow.  Alteryx will pick up on the fact that data is already sorted, especially if it’s sorted by a key value going into a join (so sort on the way out of the database if you can!)  If you have to join multiple data streams together, it is usually best to use a Join Multiple rather than a series of regular Joins.

Use Index Keys

The concept of using Index Keys is to replace columns of large data types with a much smaller data type (to act as an index to limit the amount of data you need to process). So if you have a large text string as the key, replace it with an integer key early in the process (before running a crosstab, transposition, join etc.). Rejoin the large key when you need it afterwards.  It may not save you much with a 10 byte string, but it definitely saves you time with a large key.  That said, see if you can minimise crosstabs and transpositions, as they are also inherently expensive.

All said …

“Premature optimisation is the root of all evil.”

Donald Ervin Knuth, The Art of Computer Programming, Volume 1: Fundamental Algorithms

As always try and forget about optimising until you’ve got it working – optimise later.

Advertisements

Using Alteryx to Build Predictive Models

I recently filmed a video on how to use Alteryx to build predictive models for the Institute of Actuaries Australia for a competition they’re running called Weapons of Mass Deduction. My video was shortlisted as a finalist.

Check it out, like it and vote for me if you can.

The video demonstrates how to use the predictive capabilities of Alteryx to build and fit models. I’ve used a Kaggle dataset from a competition run 8 or so years ago. The dataset tracks the success of past grants for funding. The models I’ve built aren’t great but the video does give you a rough idea of how to do this.

I like to make my workflows look as neat and organised as possible. A bit OCD like that.

I didn’t compare my results on Kaggle after putting this together, I doubt it would have performed that well. I purely used this dataset for demonstration purposes only. I find that to place high on kaggle leaderboards you need to use bespoke packages in R or Python like XGBoost or H20.ai etc. It would be difficult to produce really high performing models using the standard packages in Alteryx. That said the packages in Alteryx still allow you to produce pretty decent working models and are super easy and quick to set up and get working compared to building it in Python or R. It’s great for doing a POC and iterating fairly quickly before you invest time and effort into building the best model possible using other methods.

The example I’ve shown here is pretty basic and I didn’t delve into building an ensemble model although that would be a relatively easy way to eke out more performance.

I also don’t focus too heavily on parameter selection and tuning to build the best possible model in this video although it’s relatively easy to do in Alteryx. I keep it pretty general and high level.

Enjoy.

Who’s been reading

Here is a breakdown of the performance of this blog over the last couple of years since I began writing it.

As you can see it’s been growing at a steady clip over time. A lot of this is likely due to being better represented in googles archives. This is the year on year reader/viewer counts. The overall number is pretty low however better than nothing.

Screen Shot 2018-02-20 at 4.00.25 pm

Month wise you can see that the peaks in traffic happen around exam time – April and October get heavy traffic. Unusually December also sees a peak in traffic. There must be something festive about reading actuarial content. (wishful thinking)

Screen Shot 2018-02-20 at 3.58.07 pm

The week on week performance is fairly volatile – however one thing is obvious someone really thoroughly went through my content on Christmas week. Here’s to you buddy. 🙂

Screen Shot 2018-02-20 at 3.58.34 pm

So who’s been reading? Primarily Australians, Americans, Brits, Indians, and a good splattering of people from around the world. Not many Africans or middle eastern people by the looks of it.

Screen Shot 2018-02-20 at 3.59.57 pm

My most popular piece by far was about finding a root parent from a long format dataset using a little known data manipulation software called Alteryx. My subsequent most popular pieces are based on modelling methodologies. My piece on the Actuarial Code of Conduct didn’t get much love with only 13 views.

Screen Shot 2018-02-20 at 4.02.03 pm

Overall I’m pretty happy with what I’ve achieved with this blog thus far – and will continue to publish more pieces over time. Thanks for reading.

ARIMA Modelling in Layman Terms

Ok here goes. It may be tough to get it all in layman’s terms but I’ve tried to simplify the concepts as much as I can without getting too mathematical. (note the charts below have been sourced from other websites)

Firstly differencing is as the name suggests taking the difference between two terms. For normal differencing you are trying to remove any trends. For seasonal differencing you are trying to remove any trends across seasons. The aim of this is to ensure the series is stationary before modelling. See below an example. Green is non stationary while blue is considered stationary.

Inline image 1

 

Normal differencing as referred to in ARIMA (and more generally time series modelling) is used to de-trend the data. 1 level of differencing would mean you take the current value and subtract the prior value from it. This is done for all time points and you get another resultant time series. If this series still shows a trend then you can do another level of differencing with the first level differenced series. This would result in new series with 2 levels of differencing applied.

Seasonal differencing takes the difference between the current value and the previous seasons value at the same time. For example, if you have are modelling at the monthly level then the differencing applied would be x(t) – x(t-12). This will remove any first level seasonal trends. Higher levels would just be taking the difference of the differences. Below is an example where you can see that the seasonal component is getting larger and larger. This is a case where it would be reasonable to do seasonal differencing. Obviously in practice you are rarely working with such clean data. This series would also require normal differencing to remove the upwards trend.

Inline image 2

If you were to combine both seasonal and normal differencing then it would look something like this. You take the normal difference first – then you take the seasonal difference.



(x(t) - x(t-1)) - (x(t-12) - x(t-13))

Moving on to the autoregressive and moving average components. If you want to think of it quite simply in terms of a regression model then autoregressive components refer to prior values of the current value. So if you consider x(t) as the current value then the first AR component is x(t-1) multiplied by a fitted coefficient. The second AR component would x(t-2) and so on. These are often referred to as lagged terms. So the prior value is called the first lag, and the one prior that the second lag, and so on.

Moving average is where it gets a bit complicated. The name is misleading a bit. These components don’t actually refer to the moving average of the series itself. But rather try to use past errors to predict the current value. To put it simply if your series is stationary then it will have a constant mean, however at any point in time it may not actually be at the mean. The distance from the mean is called the error.

To put it mathematically:

x(t) = Mean + Error(t)

x(t-1) = Mean + Error(t-1)

So the first order of a moving average component would be Error(t-1), second order Error(t-2), and so on.

Hope that helps and was what you were looking for.

Alteryx – The elements of design

Alteryx has a few key tools that allow you to make your workflows design visually appealing.

The primary objects are:

  • Comment boxes
  • Containers
  • Explorer box

You can also choose to have your workflow run vertically or horizontally This is personal preference in most scenarios I will tend to use horizontal. I have used vertical in some instances where I thought it would be better suited.

Elements of design

  • LINE – The linear marks made with a pen or brush or the edge created when two shapes meet.
  • SHAPE – A shape is a self contained defined area of geometric (squares and circles), or organic (free formed shapes or natural shapes). A positive shape automatically creates a negative shape.
  • DIRECTION – All lines have direction – Horizontal, Vertical or Oblique. Horizontal suggests calmness, stability and tranquillity. Vertical gives a feeling of balance, formality and alertness. Oblique suggests movement and action
  • SIZE – Size is simply the relationship of the area occupied by one shape to that of another.
  • TEXTURE – Texture is the surface quality of a shape – rough, smooth, soft hard glossy etc.
  • COLOUR – Colour is light reflected off objects. Color has three main characteristics: hue or its name (red, green, blue, etc.), value (how light or dark it is), and intensity (how bright or dull it is).

 

 

Alteryx Workflow Design Principles

Alteryx as we all know makes it very easy to get work done fast. One thing that can become overwhelming very quickly is a growing workflow. Everyone who’s used Alteryx probably has built a gargantuan web of a workflow – or if not will fall into this trap soon enough. Below are some principles of design that can be applied to building workflows.

The principles of design:

  • Balance – there should be an even dispersion of elements throughout the workspace
  • Proximity – related sections in a workflow should be kept close to each other
  • Alignment – tools and subsections should line up along a common axis (use ctrl + right/left arrow to make fine adjustments)
  • Repetition – elements in the workflow should be designed similarly so as to create a sense of consistency
  • Contrast – highlight key elements using contrast. e.g. heading comments can be shaded a darker colour to contrast from to ensure sub-segments are easy to identify
  • Space – elements should be separated by an appropriate amount of space – clutter should be removed or hidden using containers

To achieve this I tend to use primarily comment tools to section up the workflow into neat liner and parallel sections. You can use the container tools to group a bunch of tools together or hide redundant groups of tools.

Alteryx Design Principles