Data Scaffolding – Easy steps to fill in missing data
Some visualisations have a need to show rows of data where no data has actually been reported. For example, a physiotherapist would like a visual representation of the location of injuries on the body. A data set of all injuries across all sports will show all body parts, however when filtering down to an individual athlete, there will be gaps where there have not been any injuries. This is an ideal use case for using a scaffold data set to fill in those gaps so that a complete picture can be seen.
In Tableau we need to use a scaffold data set because Tableau will only represent marks where there are rows of data. There was a useful ‘Think Data Thursday’ all about scaffold data which is worth a watch if you need additional information: TDT – Data Scaffold
I’m going to run through a practical use case for using a scaffold data set using the Sample – Superstore data. What I want to be able to view is a list of all customers and when they made purchases. I then want to be able to filter certain Categories and see which customers made purchases, as well as those that didn’t. Currently if I filter the data to ‘Technology’ the number of customers being represented is reduced from 793 to 687. This is because certain customers did not make any ‘Technology’ purchases, and therefore do not have any rows of data in the ‘Technology’ Category.
Before I go into how to create a scaffold data set I need to list some limitations / rules of thumb that should be considered before creating a scaffold:
- Any fields that need to be filtered in the visualisation need to be included in the scaffold
- Every possible value in a field needs to be added to the scaffold
- The scaffold greatly increases in size the more fields you add to the scaffold
- The size of the scaffold may affect performance
Luckily, creating a scaffold data set is relatively simple in Alteryx via the Append Fields Tool:
This tool will add a row for each value that is being added to the original field in the same way as a Cartesian join
Using the Sample – Superstore data from the practical example you would set up the workflow like so:
As you can see the number of rows in the combination of Customer Name and Category increases from 2182 in the original data set to 2379 in the appended scaffold data set.
You can keep adding fields to this append to add more functionality to your scaffold. For example, if you wanted to add a year filter, you would add an additional append tool with an input of the years in the data set.
Conventionally, the larger data set is then used as an additional data source in Tableau and then blended with the original data source. If you use the scaffold data set as the primary then you will be able to view all instances of the Customers, even when you filter to a specific Category.
This blog has links to examples of various blending approaches
However, these blending approaches can be a bit technical and, depending on the size of the scaffold, can performing poorly. What I would suggest to do instead is to union the scaffold underneath the original data set. This will create a data set that includes all instances of the scaffold that are not present in the original data set. This approach will perform a lot better than the blending approach. It’s worth noting that this union data set may perform worse than the original data set without the scaffold, due to the additional rows being added.
The additional steps in the Alteryx workflow are as follows:
The important steps to add here are;
- Joining the scaffold data to the combined summarise output. The left output to this join will contain the combinations of Customer Name and Category that are in the scaffold that aren’t in the actual data.
- Union the scaffold data to the original data. Now the data set will have additional rows for those Customers which didn’t have any data for certain Categories.
And there you have it; a simple approach to filling in missing data that will perform well, without the need for blending.
N.B. There are applications where a scaffold data set may still require blending, for example; when trying to represent values from multiple data sources a scaffold can be useful to fill in gaps in both data sources.