<< Back

Sankey Charts in Tableau

Sankey Charts in Tableau have had some great work done on them previously by Jeffrey Shaffer and Olivier Catherine but I wanted to add my own method to the melting pot, this is a technique I’ve perfected using some of the data densification techniques demonstrated by Noah Salvaterra and initially promoted by Joe Mako. Talk about standing on the shoulders of giants!

A Sankey diagram is one that visualises relationships and flows between multiple elements, typically they are better used with multiple dimensions to show the flow through a system. My own use-case is that I’d love to be able to view University student progression and view dropouts at a faculty / department level over each year of a cohort study. However to show you the idea here’s a simple Sankey showing the relationship in Sales Volumes and how they are split between two dimensions in the Superstore Sales data. (Click to see interactive version)

Let’s not get into best practice arguments; yes they can be hard to read so consider when to use them and what alternatives are available <blah> <blah>.

Let’s talk through how I’ve built this, we’ll do a simple version and a nicer more complex version. My method extends on Jeffrey and Olivier’s methods by requiring minimal data manipulation.

Step 1: Duplicate the data

I need to duplicate my data and add a field to tell me which is the Real and data and which is the Dummy (duplicate) rows. There are many ways of doing this, Alteryx, Copy / Paste in Excel, but as I’m using an Excel data source I’m going to open the sheet with the the Legacy connector:

Legacy Connector

and then convert my Orders table to Custom SQL:

Convert to Custom SQL

I’ll then copy and paste the code and UNION it and create my extra field, highlighted in red below, everything else is just as Tableau provided it:

SELECT [Orders$].[City] AS [City],
[Orders$].[Customer ID] AS [Customer ID],
[Orders$].[Customer Name] AS [Customer Name],
[Orders$].[Customer Segment] AS [Customer Segment],
[Orders$].[Discount] AS [Discount],
[Orders$].[Order Date] AS [Order Date],
[Orders$].[Order ID] AS [Order ID],
[Orders$].[Order Priority] AS [Order Priority],
[Orders$].[Postal Code] AS [Postal Code],
[Orders$].[Product Base Margin] AS [Product Base Margin],
[Orders$].[Product Category] AS [Product Category],
[Orders$].[Product Container] AS [Product Container],
[Orders$].[Product Name] AS [Product Name],
[Orders$].[Product Sub-Category] AS [Product Sub-Category],
[Orders$].[Profit] AS [Profit],
[Orders$].[Quantity ordered new] AS [Quantity ordered new],
[Orders$].[Region] AS [Region],
[Orders$].[Row ID] AS [Row ID],
[Orders$].[Sales] AS [Sales],
[Orders$].[Ship Date] AS [Ship Date],
[Orders$].[Ship Mode] AS [Ship Mode],
[Orders$].[Shipping Cost] AS [Shipping Cost],
[Orders$].[State or Province] AS [State or Province],
[Orders$].[Unit Price] AS [Unit Price],
‘Dummy’ as RowType
FROM [Orders$]

UNION

SELECT [Orders$].[City] AS [City],
[Orders$].[Customer ID] AS [Customer ID],
[Orders$].[Customer Name] AS [Customer Name],
[Orders$].[Customer Segment] AS [Customer Segment],
[Orders$].[Discount] AS [Discount],
[Orders$].[Order Date] AS [Order Date],
[Orders$].[Order ID] AS [Order ID],
[Orders$].[Order Priority] AS [Order Priority],
[Orders$].[Postal Code] AS [Postal Code],
[Orders$].[Product Base Margin] AS [Product Base Margin],
[Orders$].[Product Category] AS [Product Category],
[Orders$].[Product Container] AS [Product Container],
[Orders$].[Product Name] AS [Product Name],
[Orders$].[Product Sub-Category] AS [Product Sub-Category],
[Orders$].[Profit] AS [Profit],
[Orders$].[Quantity ordered new] AS [Quantity ordered new],
[Orders$].[Region] AS [Region],
[Orders$].[Row ID] AS [Row ID],
[Orders$].[Sales] AS [Sales],
[Orders$].[Ship Date] AS [Ship Date],
[Orders$].[Ship Mode] AS [Ship Mode],
[Orders$].[Shipping Cost] AS [Shipping Cost],
[Orders$].[State or Province] AS [State or Province],
[Orders$].[Unit Price] AS [Unit Price],
‘Real’ as RowType
FROM [Orders$]

Step 2: Densification using Bins

Create a new Measure:

[ToPad] = if [RowType]=='Real' then 1 else 49 end

And now create a new Bin of Size 1 called [Padded]. To do this right click on [ToPad] and use the Create Bin option:

Padding

 

Because Bins are Range aware then they will effectively allow us to pad out our data for each row and create 49 rows of data for each row. Let’s test it out by creating a third function [t]:

[t] = (index()-25)/4

Now let’s Viz, drop [t] onto columns and our [Padded] bin onto Detail, change the mark type to circle. Finally make sure the index in [t] is Computing over [Padded].

Woah – we didn’t have a value of 12 in or data, we had 1 and 49 – but that’s densification in action! The index() will compute over the bins and build us values that weren’t there, there are lots of uses for this technique but it’s a true mind bender and Jedi technique.

Step 3: Build our Ranking Functions

We now need to build functions that will show our data at the right points vertically when we build the Sankey, these are identical:

[Rank 1] = RUNNING_SUM(sum(Sales))/TOTAL(sum(Sales))
[Rank 2] = RUNNING_SUM(sum(Sales))/TOTAL(sum(Sales))

They create a running total of Sales and divide by the overall Total, giving a cumulative percentage.

 Step 4: Build our Curve function(s):

Let’s start with a sigmoid function – the basis of the Viz (that gives the curve)

[Sigmoid] = 1/(1+EXP(1)^-[t])

and add the Curve:

[Curve] = [Rank 1]+(([Rank 2] - [Rank 1])*[Sigmoid])

 Step 5: Build the Viz

Let’s get Vizzical, drag Two Dimensions onto the Detail Shelf, these will be our Left and Right parts of our Sankey Diagram – in my case I’m using dynamic dimensions controlled by a parameter but please yourself (if you read Olivier’s post linked above he uses a set to create a Top N effect to control his dimensions).

Add the [Curve] to Rows and now let’s talk Nested Table Calcs (an excellent primer here from Nelson Davis, I also recommend my primer on Table Calcs).

Our [Curve] calc has three parts to it, [Rank 1], [Rank 2] and [t] we need each of these to address over different dimensions. We want to Rank over both dimensions, but we want each Rank calc in a different order, we want Rank 1 to sort by the “left” Dimension first, and Rank 2 to sort by the “right” dimension. That way the “groupings” of the flows will look right.

To do this Edit Table Calculation on the [Curve] pill on Rows. First we’ll set [Rank 1], click on Advanced in the dropdown and choose the options below:

Rank 1

Now switch the nested calc to change the Compute context for [Rank 2] using the menu below:

Nested

We want [Rank 2] to compute over a different ordering, which we can control by using a different order in the addressing:

Rank 2

Finally switch to [t] that should compute over just our Padded dimension:


Here’s the result:

Step 7: Tidy Up

Now we need to remove grid and zero lines, hide our [Curve] axis add a bit of colour, etc, but the main thing to do is fix that axes that has a wonky dot over on the right (caused by the domain padding). I fix [t] axis at -5 and 5. I also fix [Curve] at 0 and 1 and reverse the Axis – more to come later on why. With those few tweaks here’s what I have. To create line I changed to a Line mark type and added [Padded] to the Path:

Sankey Start

Step 8: Add Size

We now need to vary the lines by size, based on their sales. Adding Sum([Sales]) to Size won’t look good as the padded marks have no sales associated, so let’s trick tableau by using a Table Calc that will act over those passed marks:

RUNNING_AVG(SUM([Sales]))

Now add that to Size and Compute using [Padded], now tweak the overall Size until you’re happy:

Sankey

 Step 9: Add Left and Right Dimensions

Now build two single bar stacked chart sheets showing the breakdown of percentage Sales for each dimension (I shouldn’t have to show you how to do this, this is an advanced blog – but download check out my Viz if you’re not sure). Now combine into a dashboard, I use floating to minimise the spaces, and use Highlight Actions on Hover to improve readibility.

[tableau url=”https://public.tableau.com/views/SimpleSankey/Sankey?:embed=y&:display_count=no&:showTabs=y” width=”570px” height=”715px”][/tableau]

Getting More Complex

The above method has some downsides, namely the size needs to be configured manually to fit the Viz. Another option (aside from using Oliviers method above) is to draw one line per row of data. Yes it gets busy but that’s what I’ve done in the Viz below, it’s slower but neater. Have a look, it follows a very similar method to the above – I’ve just computed over RowID too.

Hope you enjoyed this post as much as I enjoy researching and testing these methods.

 

Chris Love

Nottingham, UK

43 thoughts on “Sankey Charts in Tableau

  1. Chris, going to have to have a good read, download and a further read to understand it.
    Sanseis are amazing creatures which as you mention Jeffrey has done a lot of work with this and I have wanted to create one for a while, but without all the back work required, which this method looks like it will provide.

    The elephant in the room is about the how to use them effectively over over visualisation best practices, but I really want to find good internal use cases.

    Thanks for the effort you have put in and I hope Mrs L understands lol

  2. Chris, going to have to have a good read, download and a further read to understand it.
    Sanseis are amazing creatures which as you mention Jeffrey has done a lot of work with this and I have wanted to create one for a while, but without all the back work required, which this method looks like it will provide.

    The elephant in the room is about the how to use them effectively over over visualisation best practices, but I really want to find good internal use cases.

    Thanks for the effort you have put in and I hope Mrs L understands lol

  3. Hello Chris,

    Thank you so much for posting this incredibly helpful tutorial! I work at a K-12 school district, but I am trying to use Sankey Charts to show progress towards high school graduation, much as you described with your university students.

    This tutorial is a great place for me to start, but I did run into a problem, which I hope you can help me with. My data is formatted differently from yours, which throws off the table calculations. I could reformat my data to look like yours, but that would be a fair amount of work and I have a feeling that if I were better with table calculations that would not be necessary.

    The issue is this:
    In your data, the category on the left of the diagram (dimension 1) and on the right of the diagram (dimension 2) are different columns in the same row. In my dataset, the two categories are different rows of the same column. That is, my data currently looks like this:

    Kid# Year# RiskGroup
    1 1 3
    1 2 4
    2 1 2
    2 2 2

    I want the category on the left of the diagram to be the kid’s risk group in year 1, and the category on the right of the diagram to be the kid’s risk group in year 2. If I were to format this like the data in your tutorial, it would look like this:

    Kid# RiskGroupYear1 RiskGroupYear2
    1 3 4
    2 2 2

    The problem is in the calculation of Rank 1 and Rank 2 (step 5 in the tutorial). I need to tell Tableau to do the running sum first by the RiskGroup in the first year, and then by the RiskGroup in the second year. I haven’t been able to find a way to do this. Do you have any suggestions?

    Thanks again!

    1. Hi JL – these kind of questions are difficult to sort out in comments, I’d recommend you post to the Tableau Community and then post the link here so we can look (assuming the community doesn’t get there first)

  4. This is awesome! I am using this technique for a dataviz around cost analysis.

    One question/comment was I had is around the sort order. I have ordered my Left and Right worksheet by Sum(Sales). This allows me to have the blocks to be sorted for quick view of what the are the highest sellers.

    The only issue I am trying to figure out is when the sum(sales) are equal between region and product category . It seems to get the ranking mixed up in the sankey area for the entries that are equal. Other than that, the rankings seem to work fine.

    I am trying to figure out a way to get around this. Any ideas?

    Thanks again for sharing this. Its a very good example and hopefully Tableau will add this type of graph in future versions.

    Wes

    1. Hi Wes – these kind of questions are difficult to sort out in comments, I’d recommend you post to the Tableau Community and then post the link here so we can look (assuming the community doesn’t get there first)

  5. This is fabulous, and great step-by-step for creating this type of visualization. I am curious if this was created in version 8.3 or 9? I am wondering if the expanding functionality in 9 would help in building these.

  6. I get this error when trying to UNION the custom SQL:

    Microsoft JET database error 0x80040E10: No value given for one or more required parameters.

    How come?

    1. Hi Jonathan, I assume there’s a mistake in the SQL – my assumption would be that there is a misspelled column. If you don’t like the SQL just use Excel to copy and paste the data below itself and create the columns manually

  7. Something doesn’t seem to be quite right. Even using your workbook, when I attempt to densify the data I get two marks, Padded = 1 and Padded = 49.

    Obviously missing something simple, any ideas?

    1. Hi KC,
      I stumbled upon the same obstacle and played a bit with it, not getting densification from the Bins built from [ToPad].
      I removed the bins and tried again after making [ToPad] Discrete and it worked…

      Chris: could you please confirm the above ?
      Thanks for all your wonderful work and enlightening the masses (well… at least the many that follow you 🙂 )

      1. Dear Alec!

        You just saved my computer! I was about to throw it out the window! 🙂

        Thank you for your solution!

        And Chris thank you for your simple but elegant instructions! I love the Sanky!

  8. I followed above steps on same workbook, but i am not able to generate the circle shapes for -6 to 6 , they are getting generated for only -6 and 5.75 having their values 1 and 49 respectively.

    Could you please help me out how to with the steps to be followed, also steps and solution created are not sync as nothing about RowID is mentioned.

      1. Hi Jeffrey,

        Still struggling with same issue.

        if you could share a video tutorial that would be great Help.

        Issue: I followed above steps on same workbook, but i am not able to generate the circle shapes for -6 to 6 , they are getting generated for only -6 and 5.75 having their values 1 and 49 respectively.
        Could you please help me out how to create sankey charts, also steps and solution created are not sync as nothing about RowID is mentioned.

  9. Hello,

    Thanks for the tutorial.

    Have a question:

    I want to have path tooltip showing the % of region associated sales for each product.

    For example, 20% of Sales for Central region is on Furniture.

    I have a added Sum(Sales) on details calculating Percent of Total addressing Product partitioning region..but I don’t know how ‘padded’ would play part in this. Nevertheless, different combinations only shows blank tooltips.

    Any idea?
    SM

  10. Chris – thanks for this! I looked all over for how to make one of these and your tutorial got me there.

    i want to do a gradient from the dimension 1 color to the dimension 2 color – so in your example, the curve would fade from blue to dark gray for Central > Furniture. My chart has a lot more unique values in each dimension, and i believe it would make a lot more visual sense to have this color blend.

    any ideas?
    Thanks!

      1. t alone messes up the Curve calculation… i did try combining t and the 2 dimensions (along with these other calculated fields)… got some psychedelic results, some boring results, etc., but nothing along what i’m looking for. i did see another post about saving every RGB value into the prefs file, then calculating the colors in Tableau. the application was different, but it might work with a big pair of CASE statements to get the start and finish colors, and calculate the gradient along t. was hoping for a couple clicks of the mouse, of course…

        1. I see. You want to go from blue to dark gray on one then orange to dark gray on another. That’s going to be a pain. You can at a T to the columns, make it dual axis, synchronize and create a light to dark within the same color. On the t(2) move dimension 1 and 2 to details and move t to color. Set left side white and right side dark. Then set transparency on color of original t. Then you get lighter blue to dark blue and lighter orange to dark orange.

          1. i think you’re on to something with the dual/sync axis! but i’m not able to replicate the color group by dim1 fading to transparent, and on t2 dim2 color to transparent – the entire chart changes transparency… basically, if it starts from a red box on the left and lands in a blue box on the right, the curve should fade between those colors, and ideally i’m assigning the colors to each dimension as normal. i’m beginning to think that i might need to calculate the colors, and just use calculations to assign colors to the dimensions? hrm..

  11. Hi Chris,

    Thank you so much for these instructions. They’ve been very helpful! I had a couple of follow up questions,

    1) Using your first version, my flow lines end up having a lot of empty space between them, even when I max out on Size. Can you think of something off the top of your head that might be causing this or how to fix it? If its something you’d have to look at to answer (and have the time to do so!) I can post a version of my work

    2) Your second version is visually amazing. However, I’m having performance issues with it because of how large my dataset is – the loading and hover times are quite unreasonable. Olivier Catherin seems to have had similar problems, which he fixed using polygons and very kindly provided instructions for as well. However, my first attempt to follow along wasn’t very successful. I’ll make another attempt to implement his version, but I was wondering if you’d created (or plan to create) a polygon version yourself that’s a natural extension of the instructions you’ve provided here. If so, would be very grateful if you shared some insights.

    Thank you again for the very detailed instructions you’ve already provided. They’ve been very helpful!

  12. Hi Chris, I am not able to get the Rowtype correctly. I followed your instructions and accordingly opened my excel sheet using a legacy connection and wrote a custom sql. But again I was not sure how the command ‘Real’ as RowType will be interpreted since my sheet did not have any column named RowType. I got an error. Am I missing something?

  13. Hi Chris,
    How can I put dimension 1 and dimension 2 into one legend?
    when I drag dimension 1 in color ,color changes,then I drag dimension 2 in color ,dimension 1 disappears.

    thanks,
    Roger

  14. Hi Chris,
    We were able to do a Sankey that follows the student progression of an incoming cohort of students through their 6th year (12 terms). We followed Olivier Catherine’s method from the Tableau Community post and made it work for our university data. I posted a version of it on Public, but by removing the majority of the data (down to one student per major), the viz doesn’t work as well. Our live version for internal university audience works great. Users can filter by incoming cohort, major, gender, income status, first gen status, etc.
    Thanks for sharing this option as well!

  15. Hi Chris – thanks a lot for this chart. While I copied your process to achieve Sankey chart, my dimension in stage 1 is not grouping categories while leading the curve to stage 2. I have posted the same query in the below link in tableau forum along with the twbx file. Could you please help. thank you.

    https://community.tableau.com/message/513717

    Regards,
    Naresh

  16. Hi Chris-

    Like several other posters I’ve tried rebuilding this viz new and from your supplied workbook. In both cases Step 2: Densification yields 2 marks: -6 and -5.75 instead of the desired range between -6 and 6.
    Naturally the rest of the viz does not computer properly.
    Can you please look to see which step is missing?
    Thanks,
    RCL1

  17. Hello
    Thank you a lot for this tutorial
    I have a question
    What is the benefits of Rank 1 and Rank 2
    I don’t find the different between them
    because in both of them that you are using the same formula?
    Thank you in advance for your support

  18. Your post is way over my head, Chris, but I do have a use case. I’m working with survey results and want to show the progression of participants through the survey. Some questions are yes-no, others have Likert Scales, etc.

    In the first level, call it Question 1, the two lines would show how many answered yes and how many answered no (or whatever the choices were).

    The second set of lines would start from the “yes” and “no” and go to the choices in Q2. For instance, of those who answered “yes” to Q1, how many also answered “yes” (or whatever) to Q2? And so on.

    Ideally, users would be able to choose the questions in Level 1 and Level 2.

    Now I’m going to go into a cave for a week and try to figure this out. Thanks, Chris!

  19. Howdy! I was walking through this example, and ran into the same issue others have mentioned above – at this step:

    Now let’s Viz, drop [t] onto columns and our [Padded] bin onto Detail, change the mark type to circle. Finally make sure the index in [t] is Computing over [Padded].

    I also only saw 2 marks instead of 49 at this step.

    To get past this, we need to ‘Show Missing Values’ for the [Padded] field, but this menu option isn’t available in the view built as described, with [Padded] on the Detail section of the Marks card.

    To get past this:
    1. Move [Padded] temporarily up to the Rows &/or Columns shelf
    2. Then we can Right-click & select ‘Show Missing Values’
    3. Move [Padded] back down to the Details section of the Marks card

    The Missing Values are still displayed, even though we no longer have this Menu option available when the [Padded] field is used on the Detail section of the Marks card.

  20. Hi,
    I am getting a lot of white spaces between my Sankey lines and curves. I used the same method that is being used on this discussion thread. How do I solve the white space issue?

  21. Thanks for the detailed instructions Chris. I am facing a similar issue as some of the other users.
    When I drag t into columns and padded on details and change the mark type to circle, I get just 1 circle (at -6).

    I followed everything described in your post but as the desktop version of Tableau 10+ does not seem to have the SQL functionality, I replicated my data once and added the ToPad column manually (1 for original, 49 for dummy).

    Followed your steps and got just 1 circle. I then changed ToPad to discrete as siggested by 1 user in comments. Still same result.

    Also, when you say “Finally make sure the index in [t] is Computing over [Padded]”, what does it mean, and how do I make sure?

    Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *