Building Optimal Alteryx Workflows and Best Practice

These are a few things you should consider when building Alteryx workflows, I hope they help!

Build, Optimise, Repeat

Write an Aggregated Query or Use In-DB Tools

Writing a query direct to the DB or using the In-DB tools that does all the complex joins and aggregations will provide the best results, rather than doing these in Alteryx. Compared to an individual workstation most databases will provide shorter data processing times. Remember to cache your results as you don’t want to overburden the database with needless requests. It will also allow you to develop workflows faster.

Manually ‘Parallelise’ Workflows

R is single threaded and so most of the R tools only use one processor core at a time.  If you’re building/scoring models or using other R based tools and have a multicore CPU take advantage of it by running multiple workflows at the same time. Depending on the number of cores you have, you won’t notice a significant increase in runtime per workflow. Run as many workflows as you cores/threads. Just be vary that some of your workflows may use a secondary core so use trial & error to see what you can get away with.

Memory Usage

Alteryx stores data in both RAM and a temporary location on disk. Minimising the use of ‘swapping’ between the two when dealing with large datasets can save a lot time.  Alteryx performs optimally when all of the data can be kept in RAM during processing.  Generally, RAM is the fastest storage to access, while SSDs and regular HDDs are a magnitude slower than RAM. Before you can work with data, a computer has to load it into RAM (and consequently write much of the data currently in RAM to the disk). This is what is termed as the ‘swap’.  

It’s fortunate that Alteryx is very efficient at keeping data in memory for as long as possible.  It will look ahead in your stream to see how far it can get, and often load, process, and begin to write a smaller set of records rather than try and load the entire dataset into memory before beginning processing.  When it reaches a point where it can no longer process the data it has in RAM without processing other data, it will write out its RAM contents to an Alteryx temp file (much like the traditional way that computers use the ‘swap’ or ‘pagefile’).

Alteryx Discussion Board User

There are a few ways to help keep your data in memory:

  1. Use Select and Filter Tools Keep only the fields or records that you need. Do this early in your process, and keep trimming as things become unnecessary.  This is true of all data processing regardless of environment.
  2. Use enough RAM – If you are running multiple workflows simultaneously ensure you are allocating enough RAM to each of them based on what each will need.  This can be tricky to achieve with large datasets, and especially when you want to run multiple streams simultaneously.
  3. Use fast storage – Set your temp file location to your fastest source of storage. If you have the option, set it to an SSD.  You’ll see an immediate increase in performance.
  4. Limit the use of block until done – Because Alteryx tries do look ahead processing down stream putting block until done can reduce the efficiency with which Alteryx can handle the data. So minimise the amount of ‘checkpoints’ that require all data to be processed before it can continue. Only use the block until done tool sparingly.
  5. Limit Sorts & Joins …

Sorts & Joins

Sorts and joins are usual suspects for bad performance because it hits on both RAM and processing power, and can often be a ‘checkpoint’ in your workflow.  Alteryx will pick up on the fact that data is already sorted, especially if it’s sorted by a key value going into a join (so sort on the way out of the database if you can!)  If you have to join multiple data streams together, it is usually best to use a Join Multiple rather than a series of regular Joins.

Use Index Keys

The concept of using Index Keys is to replace columns of large data types with a much smaller data type (to act as an index to limit the amount of data you need to process). So if you have a large text string as the key, replace it with an integer key early in the process (before running a crosstab, transposition, join etc.). Rejoin the large key when you need it afterwards.  It may not save you much with a 10 byte string, but it definitely saves you time with a large key.  That said, see if you can minimise crosstabs and transpositions, as they are also inherently expensive.

All said …

“Premature optimisation is the root of all evil.”

Donald Ervin Knuth, The Art of Computer Programming, Volume 1: Fundamental Algorithms

As always try and forget about optimising until you’ve got it working – optimise later.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s