Friday 3 July 2015

SFTW: Scraping data with Google Refine

For the first Something For The Weekend of 2012 I want to tackle a common problem when you’re trying to scrape a collection of webpage: they have some sort of structure in their URL like this, where part of the URL refers to the name or code of an entity:     http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237521

  tp://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237629

    ttp://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237823

In this instance, you can see that the URL is identical apart from a 7 digit code at the end: the ID of the school the data refers to.

There are a number of ways you could scrape this data. You could use Google Docs and the =importXML formula, but Google Docs will only let you use this 50 times on any one spreadsheet (you could copy the results and select Edit > Paste Special > Values Only and then use the formula a further 50 times if it’s not too many – here’s one I prepared earlier).

And you could use Scraperwiki to write a powerful scraper – but you need to understand enough coding to do so quickly (here’s a demo I prepared earlier).

A middle option is to use Google Refine, and here’s how you do it.

Assembling the ingredients

With the basic URL structure identified, we already have half of our ingredients. What we need  next is a list of the ID codes that we’re going to use to complete each URL.

An advanced search for “list seed number scottish schools filetype:xls” brings up a link to this spreadsheet (XLS) which gives us just that.

The spreadsheet will need editing: remove any rows you don’t need. This will reduce the time that the scraper will take in going through them. For example, if you’re only interested in one local authority, or one type of school, sort your spreadsheet so that you can delete those above or below them.

Now to combine  the ID codes with the base URL.

Bringing your data into Google Refine

Open Google Refine and create a new project with the edited spreadsheet containing the school IDs.

At the top of the school ID column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top call this ‘URL’.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

“http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=”+value

(Type in the quotation marks yourself – if you’re copying them from a webpage you may have problems)

The ‘value’ bit means the value of each cell in the column you just selected. The plus sign adds it to the end of the URL in quotes.

In the Preview window you should see the results – you can even copy one of the resulting URLs and paste it into a browser to check it works. (On one occasion Google Refine added .0 to the end of the ID number, ruining the URL. You can solve this by changing ‘value’ to value.substring(0,7) – this extracts the first 7 characters of the ID number, omitting the ‘.0') UPDATE: in the comment Thad suggests “perhaps, upon import of your spreadsheet of IDs, you forgot to uncheck the importer option to Parse as numbers?”

Click OK if you’re happy, and you should have a new column with a URL for each school ID.

Grabbing the HTML for each page

Now click on the top of this new URL column and select Edit column > Add column by fetching URLs…

In the New column name box at the top call this ‘HTML’.

All you need in the Expression window is ‘value’, so leave that as it is.

Click OK.

Google Refine will now go to each of those URLs and fetch the HTML contents. As we have a couple thousand rows here, this will take a long time – hours, depending on the speed of your computer and internet connection (it may not work at all if either isn’t very fast). So leave it running and come back to it later.

Extracting data from the raw HTML with parseHTML

When it’s finished you’ll have another column where each cell is a bunch of HTML. You’ll need to create a new column to extract what you need from that, and you’ll also need some GREL expressions explained here.

First you need to identify what data you want, and where it is in the HTML. To find it, right-click on one of the webpages containing the data, and search for a key phrase or figure that you want to extract. Around that data you want to find a HTML tag like <table class=”destinations”> or <div id=”statistics”>. Keep that open in another window while you tweak the expression we come onto below…

Back in Google Refine, at the top of the HTML column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top give it a name describing the data you’re going to pull out.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

value.parseHtml().select(“table.destinations”)[0].select(“tr”).toString()

(Again, type the quotation marks yourself rather than copying them from here or you may have problems)

I’ll break down what this is doing:

value.parseHtml()

parse the HTML in each cell (value)

.select(“table.destinations”)

find a table with a class (.) of “destinations” (in the source HTML this reads <table class=”destinations”>. If it was <div id=”statistics”> then you would write .select(“div#statistics”) – the hash sign representing an ‘id’ and the full stop representing a ‘class’.

[0]

This zero in square brackets tells Refine to only grab the first table – a number 1 would indicate the second, and so on. This is because numbering (“indexing”) generally begins with zero in programming.

.select(“tr”)

Now, within that table, find anything within the tag <tr>

.toString()

And convert the results into a string of text.

The results of that expression in the Preview window should look something like this:

<tr> <th></th> <th>Abbotswell School</th> <th>Aberdeen City</th> <th>Scotland</th> </tr> <tr> <th>Percentage of pupils</th> <td>25.5%</td> <td>16.3%</td> <td>22.6%</td> </tr>

This is still HTML, but a much smaller and manageable chunk. You could, if you chose, now export it as a spreadsheet file and use various techniques to get rid of the tags (Find and Replace, for example) and split the data into separate columns (the =SPLIT formula, for example).

Or you could further tweak your GREL code in Refine to drill further into your data, like so:

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString()

Which would give you this:

<td>25.5%</td>

Or you can add the .substring function to strip out the HTML like so (assuming that the data you want is always 5 characters long):

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString().substring(5,10)

When you’re happy, click OK and you should have a new column for that data. You can repeat this for every piece of data you want to extract into a new column.

Then click Export in the upper right corner and save as a CSV or Excel file.

Source: http://onlinejournalismblog.com/2012/01/13/sftw-scraping-data-with-google-refine/