PDF Connector in Tableau 10.3
If you follow the features that Tableau will add in future versions, probably you have noticed the PDF connector that will be available in version 10.3, among a lot of new features that you can check in this website. (Note: In fact Tableau 10.3 was released some hours ago, so you can now test the PDF connector just updating Tableau).
Personally I find the PDF connector one of the most exciting ones, specially because I have suffer the pain of extracting data from PDF files in my previous jobs as market researcher. So I wanted to have a look at how it works, the possibilities it has and how accurate it is. If you also want to try new features coming in the future, I recommend you to be part of the pre-release programs to be one of the first one to be aware of what’s coming and test new Tableau features.
How to connect to PDF files
Once you have installed and open Tableau 10.3, connecting to PDF files is easy, as Tableau as added to the connection window an option to connect to PDF as a new “connect to a file” option.
We just need to click on the new PDF file option and then a new window will appear where we have to select the PDF we want to connect to from our computer. I did a quick search in Google for PDF reports, and I found this Competitiveness Report from the World Economic Forum, that seems like a good example to test new connector. Once we select the PDF file we want to connect to, a new window appears where we have to specify if we want to scan all the pages of the document, a single page, or a range of pages.
Scanning a simple table
As this particular document contains pages that are all text, and some others just basic charts, I’m going to connect to a single page and select the page number 22, that is the first one that contains a data table. In this case, a list of countries with their status / relationship with EU, and GDP:
In case I have made a mistake and I’m not connecting to the correct page of the document, I can always right click on the connections section and select Rescan PDF file… to select a different page of the document, what makes very easy to test how different pages of the document look like in Tableau.
Even if this particular case is a very simple table, it’s a great surprise to see the how well Tableau identifies the data in the table, not showing any of the text of the title or the bottom of the table. The only little issues that I identify in this particular table are that the headers are not identified properly, and that it also includes the rows that contains the text “EU Candidate countries” and “Comparator countries”.
But is with this concrete connector when the Tableau’s Data Interpreter becomes your best friend. Just clicking on the Use Data Interpreter check box and Tableau identifies that the first row is not proper data and that should be the headers. After that I just need to decide what to do with those other categories that Tableau reads as they are part of the table. Here is when Data Sources filters can help me filtering out data that I don’t need.
If I’m just interested in filtering out those two rows that I mentioned previously, I could just a data source filter that excludes all records with null in the field Status/Relationship with the EU. If I prefer to filter out all those countries from the 2 additional categories, and just analyse EU countries, I could exclude all records with a null in the EU code field. In this particular case I’m interested in just EU countries, so I opted for the last option.
Now I can go to the Sheet 1, and build a graph with the data, like for example to visualize GDP by country, colored by the year since each country has been member of the EU.
Or maybe I want to add an additional dimension to the view, and split the countries based on if there are just members of the EU or also members of the Euro area.
Voila. No more manual copy of PDF tables in Excel!
Scanning pages with text and tables
But how behaves the new PDF connector when we scan pages with text and tables? Let’s check what happens when we connect to page 32, that contains 90% text and a small tables of data.
When we connect to page 32, Tableau identifies 2 tables, the second is the Notes section at the end of the page (we are not interested in that), and the first one is the table we want to connect to.
Again, the data is not 100% correct, but is surprisingly accurate. We just need some quick filters and changes to get exactly what we want. In this case we can just uncheck the option that specifies that the first row contains contains the field names, rename the fields, hide the last column that is part of the text of the page, and filter all null rows. And here’s the result:
Data clean and ready. With the additional benefit that I can have it as an additional data source. So I can create a dashboard using as many pages of the document as I want. Very useful to add additional data to your dashboards and avoiding spending big amounts of time manually copying and pasting into an Excel file.