Web Scraping using Alteryx and Kimono Labs
Web Scraping, let’s skirt round the “is it legal / is it not” issues, let’s face it it’s something that gets asked about a lot and so people are doing it. There’s plenty of information out there through search engines on whether or not it’s something that is legal in your country, we’ll leave you to decide.
If you decide that for whatever reason you do need to perform some web scraping, be it for competitive intelligence, data journalism, etc then you’re likely to be wondering where to start. Most people start by copying data into Excel and hacking around with it. It’s a fair enough approach but what happens when your data source on the web is updated daily, is spread across a few pages, or you’re too busy to keep checking it, or you just want something that’s just plain clever. Well that’s where the folks at KimonoLabs.com come in; their product Kimono, which lets you turn websites to APIs in seconds,and is amazingly simple to use.
Let’s walk through an example. Firstly we sign up to the Beta for Kimono then drag the “Kimonify” link they provided to our bookmarks bar. We navigate to a website (in this case www.alteryx.comevents) and hit “kimonify” in the bookmarks bar. Imagine we wanted to pull out a live feed of all the upcoming Alteryx events….
We start to define our fields, first we might want to pull out the Date of the event, so we click on the first events date, and then the second, and Kimono picks up the rest (shown via the highlighting). We give the Property a name (“Date” in this case) and we’re done (hit +):
Next we want to pick out the Start Time <click>, this time Kimono also asked if we meant to highlight the end time by showing a tick box (sometimes this is useful – but in this case let’s save that for another property), we hit the cross to say No.
In this way we quickly build up a few more properties, End Time and Title were added. Finally we hit done, and are greeted by the details of the API we’ve just created – how simple is that! <20 clicks and we’ve built a realtime API:
So how do we use this in an automated BI process? Thankfully Alteryx can help there, using the csv endpoint we just feed the URL into a Text Input Tool and use the Download tool to bring the data into our module
I’ll leave it as a reader exercise to see how to parse out the csv delimited file (if you don’t have it maybe use the free version of Alteryx), here’s my module to help (note that this will become easier with JSON parsing in v9 of Alteryx):
The possibilities are endless, especially as with Kimono we can feed in query parameters if we build an API off URLs that accept them, so using Kimono and Alteryx this evening I’ve successfully automated:
- interrogating a supermarket store list (stores nearest a postcode) and plotted the postcodes on a Tableau map (imagine the competitive intelligence possibilities, we could pick our locality dynamically using Alteryx and build a live feed of our competitors stores in this way).
- a neater way to scrape data for Paul Banoub’s Winter Olympics Viz
- a neat Alteryx Events mobile webpage (all through Kimono, no Alteryx needed)
It’s limitless what the BI community will build out of this, using Kimono, Alteryx and Tableau we can imagine some amazing possibilities. Please share your results with us.