TeamURL String 65001 GET Compose jbell-walker@bwfc.co.uk 45E63F98653D6F714B6C90D405CE155441840D7CF3605CAEB 1 600 DownloadData ParseSimple DownloadData_Matched Tokenize: <tbody>.*?<tbody> http://www.soccerbase.com/teams/team.sd?team_id=359&comp_id=1&teamTabs=stats&season_id=144 C:\Users\Brian\AppData\Local\Temp\Engine_620_f9003c1d3595420cb821f0e2951bcc39_\Engine_4696_81970445ecc04788b83ace97d0409ce8_.yxdb DownloadData ParseSimple DownloadData_Matched Tokenize: <tr.*?>.*?</tr> Simple DownloadData Contains player_id Contains([DownloadData],"player_id") DownloadData ParseSimple DownloadData Warn DownloadData_Matched Tokenize: <td.*?>.*?</td> DownloadData ParseSimple DownloadData Warn DownloadData_Matched Tokenize: player_id=\d+ DownloadData1 ParseSimple DownloadData1 Warn DownloadData_Matched Tokenize: \d+ League Goals ParseComplex League Goals Warn League Goals_Matched Parse: (\d+) FA Cup Goals ParseComplex FA Cup Goals Warn FA Cup Goals_Matched Parse: (\d+) League Cup Goals ParseComplex League Cup Goals Warn League Cup Goals_Matched Parse: (\d+) Other Goals ParseComplex Other Goals Warn Other Goals_Matched Parse: (\d+) League Apps (Sub) ParseComplex League Apps (Sub) Warn League Apps (Sub)_Matched Parse: (\d+\s\(\d+\)) FA Cup Apps (Sub) ParseComplex FA Cup Apps (Sub) Warn FA Cup Apps (Sub)_Matched Parse: (\d+\s\(\d+\)) League Cup Apps (Sub) ParseComplex League Cup Apps (Sub) Warn League Cup Apps (Sub)_Matched Parse: (\d+\s\(\d+\)) Other Apps (Sub) ParseComplex Other Apps (Sub) Warn Other Apps (Sub)_Matched Parse: (\d+\s\(\d+\)) League ParseComplex League Warn League_Matched Parse: \((\d+)\) League ParseComplex League Warn League_Matched Parse: (\d+) FA Cup ParseComplex FA Cup Warn FA Cup_Matched Parse: \((\d+)\) FA Cup ParseComplex FA Cup Warn FA Cup_Matched Parse: (\d+) League Cup ParseComplex League Cup Warn League Cup_Matched Parse: \((\d+)\) League Cup ParseComplex League Cup Warn League Cup_Matched Parse: (\d+) Other ParseComplex Other Warn Other_Matched Parse: \((\d+)\) Other ParseComplex Other Warn Other_Matched Parse: (\d+) This workflow may look daunting to begin with, but once we take each tool one step at a time things can become clear. This workflow mainly makes use of the Regex tools to select certain elements of a webpage to break out each part of the web page'sunderlying code. But the first thing we need to do is download the wep page's code from our website. Here i have used a text input tool that has the web address that contains the data we want to download. We then use the download tool to download the data from the website. Click on the browse tool to see what we have downloaded - It is the "Download Data" column that we are interested in for this workflow. This is where the Regex tools then come in to play. Each one does a different thing - for instance the first Regex tool uses the Tokenize method to split out everything between the <tbody> tags. If you run this workflow using the Green "Run Workflow" button in the toolbar and look at the output results on each tool in the results window you will start to see what the workflow is doing. Have an explore and any queries email me: brian.prestidge@theinformationlab.co.uk C:\Users\Brian\AppData\Local\Temp\Engine_620_f9003c1d3595420cb821f0e2951bcc39_\Engine_4696_09681e683a4d4dfca46c96cc05a80989_.yxdb
Horizontal Soccerbase Web Scraping Example