This week I’ve been attempting to solve the problem of finding the root parent in a hierarchy. In this post I’ll try and cover off the approaches I’ve taken to solve this problem.
There is quite a number of methods to do this across multiple software. Most prominently I’ve found solutions to this problem in SQL Server, MySQL, Python, etc.
Here is one really useful article I found that discusses a SQL based solution. It goes through a number of examples and possible situations you may need to solve for.
It also may be worthwhile referring to the following Stack Overflow question:
This doesn’t help me as I am tied to having to do this in either Redshift, Alteryx or R due to the software at my disposal. Most solutions I’ve come across are not capable of scaling to millions of records with potentially that many hierarchies.
If you are really interested in delving into the theory behind hierarchies and hierarchical data structures then I would suggest reading up on graph theory.
The problem as it stands can be summarised as follows. I have a table with 2 columns; one that contains the users current ID and the other that contains the prior ID. I am trying to solve for the first ID the user ever had.
CurrentID | PreviousID x1 | x2 | x1 x3 | x2 x4 | x3 x5 | x4 y1 | y2 | y1 y3 | y2 y4 | y3 y5 | y4 y1 | y5
In the first column I have a unique list of all ID’s ever assigned – the above example is simplified in the actual data the ID is not formatted such that it can be easily grouped as above.
Basically I took a recursive approach to trace back to the root ID.
- Left join table on itself (using Join and Union tool in conjunction) by using the previous id and current id
Merge on ID_Prev = ID_Curr'
- Rename previous ID
ID_prev' -> ID_prev[-1]
- Keep only 3 variables from the original table
ID_Prev[-1], ID_Prev, ID_Curr
At this point you will have traced back to 2 levels of parentage
- Repeat until root parent is found
Using ID_prev[-1] merge against original table with ID_Curr' Keep and rename ID_prev' to ID_prev[-2]
(Remember to drop those records that only 2 levels deep)
The table should now resemble:
ID_Prev[-2], ID_Prev[-1], ID_Prev, ID_Curr and you are 3 levels deep
A solution to this problem using a non serial join approach presented by Adam Riley on the Alteryx Community can be found at the link below. It uses an iterative macro – clearly miles ahead of my brute force attempt. I did something like 30 left joins to achieve my goal and even still hadn’t got to the root parent in some cases.