Tableau Prep: Generate rows with TabPy
Tableau Prep Builder is consistently increasing in functionality with every new version release. The recent integration of TabPy and RServe now allow for Python and R code to be executed from within a Tableau Prep workflow. Aside from equipping the user with the statistical modelling and machine learning capabilities that R/Python are known for, they can also compensate for some of the more basic ETL processing techniques that Tableau Prep Builder, in its current version, is still lacking.
In this blog post I’ll demonstrate how to use TabPy and a short, simple Python script to perform a ‘generate rows’ function, which is currently unavailable in Prep Builder. As a use case for generating rows in Tableau Prep, take for example this simple mock dataset – a set of invoices with start and end dates:
I want to transform this into a structure that has a row for every month of a given invoice (e.g. 15 months for Invoice_Number 1), and spread the total invoice value across the invoice billing period.
After connecting to the dataset in Prep, the first port of call is to create a calculated field called ‘DateDiffMonths’ that gives us the number of months between the start and end date of each invoice:
Next, click ‘Add Script’ in the Prep flow pane. To connect to a Python script, you first need to start up TabPy via the command line (see this Tableau blog for information on how to download, install and run TabPy). Here’s the basic configuration in Prep:
- Set the Connection Type to ‘Tableau Python (TabPy) Server’.
- In the ‘Connect to Tableau Python (TabPy) Server’ dialogue box, connect to port 9004 of ‘localhost’.
- Browse to the Python script on your machine (you can download from my Github)
- The name of the function in the Python script is ‘GenerateRowsPrep’, so this goes in ‘Function Name’
Let’s have a look at the contents of the script. First we load the pandas and numpy Python libraries. The script depends on functions from those libraries to work.
The next chunk is the body of the script that carries out the process. Here it is broken down with some annotation.
Line 4: This defines the function (GenerateRowsPrep) that will be applied to the dataset (df). df stands for dataframe, by the way.
Line 7: The Invoice_Number and DateDiffMonths fields are subsetted from our dataset and assigned to the variable inv_months.
Line 8: The repeat function from numpy is used to repeat the Invoice_Number value for the associated number of months in the DateDiffMonth field. We now have a table (inv_months) that contains a row for each month of a given invoice.
In the following line (Line 11), we join this table back to the original dataset (df) on the Invoice_Number field using the pandas merge function. For every distinct Invoice_Number in df, we are joining to multiple rows of Invoice_Number in inv_months, so this will explode out the df table to have the required number of rows per invoice. It will look something like this:
Something we may want to append to this dataset is a field that describes what month of the invoice billing period each row corresponds to. For this we are going to group by Invoice_Number and do a cumulative count of the rows in each Invoice (Line 14). Lastly, we can spread the invoice value over the billing period by dividing the Value field by the DateDiffMonths field. The code:
The last section of the script is required by Prep when using TabPy, to define the names and types of the fields that we are bringing back into Prep:
Back in Prep, it’s possible to create a monthly date field by combining the MonthsToAdd and Invoice_Start fields with a DATEADD function:
That’s one method of generating rows in Tableau Prep Builder. I tried and tested a few other methods today that worked fine in Python itself, but sadly not in Prep, so looks like there are a few bugs to iron out before the Prep-Python experience becomes absolutely seamless.
Hope you find this useful. You can download the mock data, Python script and Prep workflow from my Github.