<< Back

Tableau Extracts–What / Why / How etc

image

 

What’s a Tableau Extract? 

An extract in Tableau is a special type of database, optimised for use by Tableau.  Its super fast, portable and a great way to handle large data sets.  But how do you create one, and what creative ways are there to make use of them?

 

Firstly, how do you create and extract?

 

This bit is simple, assuming you are using Tableau Desktop.  After you have connected to your data, go to the DATA menu and choose EXTRACT DATA – then accept the defaults on the dialog box (although more on this later).  Tableau will as where you want to save the extract – choose any location to save the file, although Tableau will probably direct you towards ‘My Tableau Repository | Datasources’ which is just fine too!

Now wait for the extract to be created, how long you’ll wait depends on the database technology you’re using, network speed, data volumes that sort of thing.

You’ll know it’s done when the data source icon changes – it will have another database icon behind it, presumably representing a ‘copy’, which is exactly what an extract is.

image

So, that’s how to create on (well the basics anyway) – but WHY would you want to?

 

Extract reason #1 – Performance

 

Extracts are FAST.  They are awesomely fast with a million rows, and they only slow down to ‘amazingly fast’ with 100 Million rows. 

But extracts are not just for big data sets.  Once your familiar with the Tableau interface, even a 1 second delay when using the interface can be annoying – you get used to the speed and responsiveness quickly.  I recommend using extracts for any data set which is anything other than super-responsive. 

For example:

If you have 50,000 rows in Excel – its likely to be annoyingly slow.  An extract solves this immediately.

If you’re using a Remote SQL server, create an extract over lunch, and you’ll have instant response times when you get back.

 

Extract reason #2 – Offline analysis

 

Extracts are files that you store on your PC (file extension .tde).  So of course you can take one with you on holiday.  If you just can’t sit by the pool without dreaming about histograms and scatterplots, then extracts could be for you…

They work on planes too.

 

Extract reason #3 – Accessing additional functionality

 

Some database technologies have restrictions that stop Tableau communicating with them as they would like.  The most common problem caused by this (in my experience) is the absence of COUNT DISTINCT when using Excel as a data source.

To see this in action, connect to an Excel source, and then drag a dimension out using the RIGHT MOUSE BUTTON.  When you drop this, you’ll get to choose from the following options:

image

 

Which is no good for me!!  I’m trying to get COUNT DISTINCT…

But, if I change to an extract…

image

 

Now I get COUNT DISTINCT – which allows some really insightful visualisations to be built.

 

The same holds true if you are trying to get a MEDIAN value for a measure, this is not available from Excel, so you’ll need an extract.

FROM EXCEL                                             FROM AN EXTRACT

imageimage

 

Extract reason #4 – Creating packaged workbooks

 

Packaged workbooks are a fantastic bit of Tableau functionality, allowing you to created a distributable file which can be an interactive visualisation.  This can be opened by users of the FREE Tableau Reader – which you can download here.

BUT – you can’t ‘package’ data which is held on remote servers, so you can’t package data from a SQL server, Oracle etc.

UNLESS you create an extract first.  Then you’ll have no problems.

 

Extract reason #5 – Publishing to Tableau Public

 

You can’t use packaged workbooks without extract as described above, but neither can you use Tableau Public.  If you want to publish data to the web, you’ll have to use an extract.

 

Extract reason #6 – Data security

 

Here’s a more subtle use case.  Suppose you work at a hospital, and you are REALLY NOT ALLOWED to share patient level information – yet you want to create a packaged workbook with a data source which does contain this information.  What do you do?  Here’s where the extra functionality of the data extract dialog comes into play.

When you extract data, you’ll see this window.  Notice the button at the bottom, this removes all the dimensions and measures which you have NOT used in ANY VIZ from the extract. 

image

The extract you have then created can be packaged, safe in the knowledge that any fields you did not want to be visible are not even in the extract, so they can’t be accessed.

Using this same dialog, you can restrict the ROWS (rather than columns) which are included in your data set by using the filter section.  Consider using a relative date filter for an extract which contains (say) the last 3 weeks of data).

 

Extract Reason #7 – Double aggregations

 

This is my personal favourite.

Suppose you have timesheet data which has one row per employee/day.  You want to know the average hours recorded per MONTH.

To solve this one, you first need to aggregate the data by employee/month, and then produce a view which then averages the data by month (hence the double aggregation).

Extracts address this problem as they can be used to perform the first level of aggregation, providing a new data source over which to run the Average calculation.

This step is performed in the extract dialog window, by ‘Aggregating visible dimensions’  and then rolling up dates to month.

 

image

 

You can also use this feature to massively reduce the size of the extract file by ‘’Hiding’ dimensions in the data window before you take this step.

 

I’m sure there are many more reasons to use extracts – and that in the future we’ll see many more developments in this area from Tableau.

 

Please drop me a line if you have other uses you think should be on this list.

Tom Brown

London, UK

25 thoughts on “Tableau Extracts–What / Why / How etc

  1. That was a great overview of some main point and factors of extracts with the Tableau Data Engine. The discussion on extracts in Tableau could go for quite a while. Here are a couple of other points in regards to extracts:

    a. If you use Custom SQL, the query slowness custom SQL adds (because Tableau cannot write the most efficient queries when it is used) is removed.

    b. When you publish to server, you can schedule the extract, so every interaction will not query a production database, and you can still get semi-live, or as live as you need it data.

    c. It is important to note that the majority of the speed gain is because the extract is a columnar data store (like Vertica or Infobright), and no longer a relational data store. Also just about anything is faster and a better choice than the MS Jet Engine (what Tableau uses to connect to Text/Excel/Access files). Columnar data store are designed to produce fast results aggregating data.

    d. if you create a calculated field that is not an aggregation (or use a parameter), you will want to Optimize the extract, so the result of the field for each row can be added to the extract to retain fast performance.

    e. The one downside of the new Data Engine is the lack of ability to use pass-through/RAWSQL functions. These are only useful in a select few cases now because custom table calculations and data blending can do most of what is needed, so not a major issue.

    f. an extract will attempt to retain some calculation logic of the original data source. Some databases would see "A"="a" as a true condition because their string comparison is not case sensitive, while other databases are, and when you create an extract, Tableau will attempt to honor that, at at least retain the values of your calculated fields. This is sometimes a source of confusion.

    g. If you refresh an extract, and the original data source structure has changed, Tableau will update the extract with those changes. This could cause some calculated fields to break, because of field name changes, but you can also add additional columns to your data this way as well.

    I could go on an on about the extract engine, in short, I love it, and use it as early and as often as I can.

    1. Joe,

      I pulled an extract of 950M rows assuming it would be faster to work locally off my machine than through Teradata. It feels like this is taking MUCH longer than just connecting live to the table on the user space. Do you have any insight or suggestions?

      1. If you pull it into Tableau Desktop it will be slower. You want to load this as a data extract on the Tableau server and make sure your Tableau server has adequate capacity.

  2. Wow joe, great comment!! Longer and more interesting than my original post. But you're right, extracts have much more to them.

    In my experience many people don't use them, or appreciate their value for day to day analysis. Hence the post.

    Thanks again

    Tom

  3. But what about the downside? The loss of table hierarchy within the dimensions and measures windows?
    How do you get around that little issue?

  4. So, here a question for you about extracts. Let’s say you have a 100 million row extract and you want it to run on the server…not your laptop. Everytime you say “Use Extract” it tries to go pull it. And it tries. Very hard I might add.

    I want to publish this datasource to the server without actually kicking it off…until it gets to the server.

    Thoughts?

    Thanks,
    Bryan

  5. Tom,

    “The extract you have then created can be packaged, safe in the knowledge that any fields you did not want to be visible are not even in the extract, so they can’t be accessed.”

    I believe it is necessary to clarify this statement from #6 some. Tableau does not physically remove the fields from the extract, they are simply hidden. I am unclear if I were to pass the extract off to another, if they’d have the ability to unhide the columns. Can anyone clarify?

    The most immediate reference I could find is the first bullet under “Some General things to keep in mind”. (Not my blog post!).
    http://www.tableau.com/tableau-data-extracts-part3

  6. Hi,

    We have a requirement in where the server which is exposed to external users should not connect to internal databases. Can you let me know a way how I can accomplish this. We plan to move the workbook which has extracts from internal prod server to external prod server but not sure how we can refresh the extracts.

    Thanks

  7. Hi Tom,

    Is there a way to edit the data within the export?

    For instance, if I wanted to share information from your hospital data example, would there be a way to change a female’s name to “Jane Doe” and so on then re-save the extract so it was masking the identity of individuals?

Leave a Reply

Your email address will not be published. Required fields are marked *