TeamURL
String
65001
GET
Compose
jbell-walker@bwfc.co.uk
45E63F98653D6F714B6C90D405CE155441840D7CF3605CAEB
1
600
DownloadData
ParseSimple
DownloadData_Matched
Tokenize:
<tbody>.*?<tbody>
http://www.soccerbase.com/teams/team.sd?team_id=359&comp_id=1&teamTabs=stats&season_id=144
C:\Users\Brian\AppData\Local\Temp\Engine_620_f9003c1d3595420cb821f0e2951bcc39_\Engine_4696_81970445ecc04788b83ace97d0409ce8_.yxdb
DownloadData
ParseSimple
DownloadData_Matched
Tokenize:
<tr.*?>.*?</tr>
Simple
DownloadData
Contains
player_id
Contains([DownloadData],"player_id")
DownloadData
ParseSimple
DownloadData
Warn
DownloadData_Matched
Tokenize:
<td.*?>.*?</td>
DownloadData
ParseSimple
DownloadData
Warn
DownloadData_Matched
Tokenize:
player_id=\d+
DownloadData1
ParseSimple
DownloadData1
Warn
DownloadData_Matched
Tokenize:
\d+
League Goals
ParseComplex
League Goals
Warn
League Goals_Matched
Parse:
(\d+)
FA Cup Goals
ParseComplex
FA Cup Goals
Warn
FA Cup Goals_Matched
Parse:
(\d+)
League Cup Goals
ParseComplex
League Cup Goals
Warn
League Cup Goals_Matched
Parse:
(\d+)
Other Goals
ParseComplex
Other Goals
Warn
Other Goals_Matched
Parse:
(\d+)
League Apps (Sub)
ParseComplex
League Apps (Sub)
Warn
League Apps (Sub)_Matched
Parse:
(\d+\s\(\d+\))
FA Cup Apps (Sub)
ParseComplex
FA Cup Apps (Sub)
Warn
FA Cup Apps (Sub)_Matched
Parse:
(\d+\s\(\d+\))
League Cup Apps (Sub)
ParseComplex
League Cup Apps (Sub)
Warn
League Cup Apps (Sub)_Matched
Parse:
(\d+\s\(\d+\))
Other Apps (Sub)
ParseComplex
Other Apps (Sub)
Warn
Other Apps (Sub)_Matched
Parse:
(\d+\s\(\d+\))
League
ParseComplex
League
Warn
League_Matched
Parse:
\((\d+)\)
League
ParseComplex
League
Warn
League_Matched
Parse:
(\d+)
FA Cup
ParseComplex
FA Cup
Warn
FA Cup_Matched
Parse:
\((\d+)\)
FA Cup
ParseComplex
FA Cup
Warn
FA Cup_Matched
Parse:
(\d+)
League Cup
ParseComplex
League Cup
Warn
League Cup_Matched
Parse:
\((\d+)\)
League Cup
ParseComplex
League Cup
Warn
League Cup_Matched
Parse:
(\d+)
Other
ParseComplex
Other
Warn
Other_Matched
Parse:
\((\d+)\)
Other
ParseComplex
Other
Warn
Other_Matched
Parse:
(\d+)
This workflow may look daunting to begin with, but once we take each tool one step at a time things can become clear.
This workflow mainly makes use of the Regex tools to select certain elements of a webpage to break out each part of the web page'sunderlying code.
But the first thing we need to do is download the wep page's code from our website.
Here i have used a text input tool that has the web address that contains the data we want to download.
We then use the download tool to download the data from the website. Click on the browse tool to see what we have downloaded - It is the "Download Data" column that we are interested in for this workflow.
This is where the Regex tools then come in to play. Each one does a different thing - for instance the first Regex tool uses the Tokenize method to split out everything between the <tbody> tags.
If you run this workflow using the Green "Run Workflow" button in the toolbar and look at the output results on each tool in the results window you will start to see what the workflow is doing.
Have an explore and any queries email me: brian.prestidge@theinformationlab.co.uk
C:\Users\Brian\AppData\Local\Temp\Engine_620_f9003c1d3595420cb821f0e2951bcc39_\Engine_4696_09681e683a4d4dfca46c96cc05a80989_.yxdb
Horizontal
Soccerbase Web Scraping Example