Monday, 11 July 2016

Python 3 web-scraping examples with public data

Someone on the NICAR-L listserv asked for advice on the best Python libraries for web scraping. My advice below includes what I did for last spring’s Computational Journalism class, specifically, the Search-Script-Scrape project, which involved 101-web-scraping exercises in Python.

Best Python libraries for web scraping

For the remainder of this post, I assume you’re using Python 3.x, though the code examples will be virtually the same for 2.x. For my class last year, I had everyone install the Anaconda Python distribution, which comes with all the libraries needed to complete the Search-Script-Scrape exercises, including the ones mentioned specifically below:
The best package for general web requests, such as downloading a file or submitting a POST request to a form, is the simply-named requests library (“HTTP for Humans”).

Here’s an overly verbose example:

import requests
base_url = 'http://maps.googleapis.com/maps/api/geocode/json'
my_params = {'address': '100 Broadway, New York, NY, U.S.A',
             'language': 'ca'}
response = requests.get(base_url, params = my_params)
results = response.json()['results']
x_geo = results[0]['geometry']['location']
print(x_geo['lng'], x_geo['lat'])
# -74.01110299999999 40.7079445

For the parsing of HTML and XML, Beautiful Soup 4 seems to be the most frequently recommended. I never got around to using it because it was malfunctioning on my particular installation of Anaconda on OS X.
But I’ve found lxml to be perfectly fine. I believe both lxml and bs4 have similar capabilities – you can even specify lxml to be the parser for bs4. I think bs4 might have a friendlier syntax, but again, I don’t know, as I’ve gotten by with lxml just fine:

import requests
from lxml import html
page = requests.get("http://www.example.com").text
doc = html.fromstring(page)
link = doc.cssselect("a")[0]
print(link.text_content())
# More information...
print(link.attrib['href'])
# http://www.iana.org/domains/example

The standard urllib package also has a lot of useful utilities – I frequently use the methods from urllib.parse. Python 2 also has urllib but the methods are arranged differently.

Here’s an example of using the urljoin method to resolve the relative links on the California state data for high school test scores. The use of os.path.basename is simply for saving the each spreadsheet to your local hard drive:

from os.path import basename
from urllib.parse import urljoin
from lxml import html
import requests
base_url = 'http://www.cde.ca.gov/ds/sp/ai/'
page = requests.get(base_url).text
doc = html.fromstring(page)
hrefs = [a.attrib['href'] for a in doc.cssselect('a')]
xls_hrefs = [href for href in hrefs if 'xls' in href]
for href in xls_hrefs:
  print(href) # e.g. documents/sat02.xls
  url = urljoin(base_url, href)
  with open("/tmp/" + basename(url), 'wb') as f:
    print("Downloading", url)
    # Downloading http://www.cde.ca.gov/ds/sp/ai/documents/sat02.xls
    data = requests.get(url).content
    f.write(data)

And that’s about all you need for the majority of web-scraping work – at least the part that involves reading HTML and downloading files.
Examples of sites to scrape

The 101 scraping exercises didn’t go so great, as I didn’t give enough specifics about what the exact answers should be (e.g. round the numbers? Use complete sentences?) or even where the data files actually were – as it so happens, not everyone Googles things the same way I do. And I should’ve made them do it on a weekly basis, rather than waiting till the end of the quarter to try to cram them in before finals week.

The Github repo lists each exercise with the solution code, the relevant URL, and the number of lines in the solution code.

The exercises run the gamut of simple parsing of static HTML, to inspecting AJAX-heavy sites in which knowledge of the network panel is required to discover the JSON files to grab. In many of these exercises, the HTML-parsing is the trivial part – just a few lines to parse the HTML to dynamically find the URL for the zip or Excel file to download (via requests)…and then 40 to 50 lines of unzipping/reading/filtering to get the answer. That part is beyond what typically considered “web-scraping” and falls more into “data wrangling”.

I didn’t sort the exercises on the list by difficulty, and many of the solutions are not particulary great code. Sometimes I wrote the solution as if I were teaching it to a beginner. But other times I solved the problem using the style in the most randomly bizarre way relative to how I would normally solve it – hey, writing 100+ scrapers gets boring.

But here are a few representative exercises with some explanation:
1. Number of datasets currently listed on data.gov

I think data.gov actually has an API, but this script relies on finding the easiest tag to grab from the front page and extracting the text, i.e. the 186,569 from the text string, "186,569 datasets found". This is obviously not a very robust script, as it will break when data.gov is redesigned. But it serves as a quick and easy HTML-parsing example.
29. Number of days until Texas’s next scheduled execution

Texas’s death penalty site is probably one of the best places to practice web scraping, as the HTML is pretty straightforward on the main landing pages (there are several, for scheduled and past executions, and current inmate roster), which have enough interesting tabular data to collect. But you can make it more complex by traversing the links to collect inmate data, mugshots, and final words. This script just finds the first person on the scheduled list and does some math to print the number of days until the execution (I probably made the datetime handling more convoluted than it needs to be in the provided solution)
3. The number of people who visited a U.S. government website using Internet Explorer 6.0 in the last 90 days

The analytics.usa.gov site is a great place to practice AJAX-data scraping. It’s a very simple and robust site, but either you are aware of AJAX and know how to use the network panel (and in this case, locate ie.json, or you will have no clue how to scrape even a single number on this webpage. I think the difference between static HTML and AJAX sites is one of the tougher things to teach novices. But they pretty much have to learn the difference given how many of today’s websites use both static and dynamically-rendered pages.
6. From 2010 to 2013, the change in median cost of health, dental, and vision coverage for California city employees

There’s actually no HTML parsing if you assume the URLs for the data files can be hard coded. So besides the nominal use of the requests library, this ends up being a data-wrangling exercise: download two specific zip files, unzip them, read the CSV files, filter the dictionaries, then do some math.
90. The currently serving U.S. congressmember with the most Twitter followers

Another example with no HTML parsing, but probably the most complicated example. You have to download and parse Sunlight Foundation’s CSV of Congressmember data to get all the Twitter usernames. Then authenticate with Twitter’s API, then perform mulitple batch lookups to get the data for all 500+ of the Congressional Twitter usernames. Then join the sorted result with the actual Congressmember identity. I probably shouldn’t have assigned this one.
HTML is not necessary

I included no-HTML exercises because there are plenty of data programming exercises that don’t have to deal with the specific nitty-gritty of the Web, such as understanding HTTP and/or HTML. It’s not just that a lot of public data has moved to JSON (e.g. the FEC API) – but that much of the best public data is found in bulk CSV and database files. These files can be programmatically fetched with simple usage of the requests library.

It’s not that parsing HTML isn’t a whole boatload of fun – and being able to do so is a useful skill if you want to build websites. But I believe novices have more than enough to learn from in sorting/filtering dictionaries and lists without worrying about learning how a website works.

Besides analytics.usa.gov, the data.usajobs.gov API, which lists federal job openings, is a great one to explore, because its data structure is simple and the site is robust. Here’s a Python exercise with the USAJobs API; and here’s one in Bash.

There’s also the Google Maps geocoding API, which can be hit up for a bit before you run into rate limits, and you get the bonus of teaching geocoding concepts. The NYTimes API requires creating an account, but you not only get good APIs for some political data, but for content data (i.e. articles, bestselling books) that is interesting fodder for journalism-related analysis.

But if you want to scrape HTML, then the Texas death penalty pages are the way to go, because of the simplicity of the HTML and the numerous ways you can traverse the pages and collect interesting data points. Besides the previously mentioned Texas Python scraping exercise, here’s one for Florida’s list of executions. And here’s a Bash exercise that scrapes data from Texas, Florida, and California and does a simple demographic analysis.

If you want more interesting public datasets – most of which require only a minimal of HTML-parsing to fetch – check out the list I talked about in last week’s info session on Stanford’s Computational Journalism Lab.

Source URL :  http://blog.danwin.com/examples-of-web-scraping-in-python-3-x-for-data-journalists/

Sunday, 10 July 2016

Web Data Scraping: Practical Uses

Whether in the form of media, text or data in diverse other formats—the internet serves to be a huge storehouse of the world’s information. While browsing for commercial or business needs alike, users are exposed to numerous web pages that contain data in just about every form. Even though access to such data is extremely critical for garnering success in the contemporary world, unfortunately most of it is not open. More often than not, business websites restrict the accessibility options to such data and do not allow visitors to save or display them for reuse on their local storage devices, or onto their own websites.  This is where web data extraction tools come in handy.

Read on for a closer look into some of the common areas of data scraping usage.

• Gathering of data from diverse sources for analysis: In case a business necessitates the collection and analysis of data specific to certain categories from multiple websites, then it helps refer to web data integration experts or those related to the field of data scraping linked with categories like industrial equipment, real estate, automobiles, marketing, business contacts, electronic gadgets and so forth.

• Collection of data in different formats: Different websites are known to publish information and structured data in different formats. So, it may not be possible for organizations to see all the required data a one place, at any given time. Data scrapers allow the extraction of information spanning across multiple pages under various sections, on to a single database or spreadsheet.  This makes it easy for users to analyze (or visualize) the data.

• Helps Research: Data is an important and integral part of all kinds of research – marketing, academic or scientific. A data scraper helps in gathering structured data with ease.

• Market analysis for businesses: Companies that cater to products or services connected to specific domains require comprehensive data of products and services that are of similar kind, and which have a tendency of appearing in the market on a daily basis.

Web scraping software solutions from reputed companies are successful in keeping a constant watch on this kind of data and allow users to get access required information from diverse sources – all at the click of a button.
Go for data extraction to take your business to the next levels of success – you will not be disappointed.

Source URL : http://www.3idatascraping.com/web-data-scraping-practical-uses.php

Thursday, 7 July 2016

Data Scraping - What Are Hand-Scraped Hardwood Floors and What Are the Benefits?

If you love the look of hardwood flooring with lots of character, then you may want to check out hand-scraped hardwood flooring. Hand-scraped wood provides a warm vintage look, providing the floor instant character. These types of scraped hardwoods are suitable for living rooms, dining rooms, hallways and bedrooms. But what exactly is hand-scraped hardwood flooring?

Well, it is literally what you think it is. Hand-scraped hardwood flooring is created by hand using specialized wood working tools to make each board unique and giving an overall "old worn" appearance.

At Innovation Builders we offer solid wood floors finished on site with an actual hand-scraping technique followed by stain and sealer. Solid wood floors are installed by an expert team of technicians who work each board with skilled craftsman-like attention to detail. Following the scraping procedure the floor is stained by hand with a customer selected stain color, and then protected with multiple coats of sealing and finishing polyurethane. This finishing process of staining, sealing and coating the wood floors contributes to providing the look and durability of an old reclaimed wood floor, but with today's tough, urethane finishes.

There are many, many benefits to hand-scraped wood flooring. Overall, these floors are extremely durable and hard wearing, providing years of trouble-free use. These wood floors remain looking newer for longer because the texture that the process provides hides the typical dents, dings and scratches that other floors can't hide so easily. That's great news for households with kids, dogs, and cats.

These types of wood flooring have another unique advantage as well. When you do scratch these floors during their lifetime, the scratches are easily repaired. As long as the scratch isn't too deep you can make them practically disappear without ever having to hire a professional. It's simple to hide the scratch by using a color-matched stain marker or repair kit that is readily available through local flooring distributors. These features make hand-scraped hardwood flooring a lot more durable and hassle-free to maintain than other types of wood flooring.

The expert processes utilized in the creation of these floors provides a custom look of worn wood with deep color and subtle highlights. When the light hits the wood at different times during the day, it provides an understated but powerful effect of depth and beauty. They instantly offer your rooms a rustic look full of character, allowing your home to become a warm and inviting environment. The rustic look of this wood provides a texture, style and rustic appeal that cannot be matched by any other type of flooring.

Hand-Scraped Hardwood Flooring is a floor that says welcome and adds a touch of elegance to any home. If you are looking to buy a new home and you haven't had the opportunity to see or feel hand scraped hardwoods, stop in any of the model homes at Innovation Builders in Keller, North Richland Hills or Grand Prairie, Texas and check it out!

Source URL :   http://yellowpagesdatascraping.blogspot.in/2015/06/data-scraping-what-are-hand-scraped.html

Saturday, 18 June 2016

Increasing Accessibility by Scraping Information From PDF

You may have heard about data scraping which is a method that is being used by computer programs in extracting data from an output that comes from another program. To put it simply, this is a process which involves the automatic sorting of information that can be found on different resources including the internet which is inside an html file, PDF or any other documents. In addition to that, there is the collection of pertinent information. These pieces of information will be contained into the databases or spreadsheets so that the users can retrieve them later.

Most of the websites today have text that can be accessed and written easily in the source code. However, there are now other businesses nowadays that choose to make use of Adobe PDF files or Portable Document Format. This is a type of file that can be viewed by simply using the free software known as the Adobe Acrobat. Almost any operating system supports the said software. There are many advantages when you choose to utilize PDF files. Among them is that the document that you have looks exactly the same even if you put it in another computer so that you can view it. Therefore, this makes it ideal for business documents or even specification sheets. Of course there are disadvantages as well. One of which is that the text that is contained in the file is converted into an image. In this case, it is often that you may have problems with this when it comes to the copying and pasting.

This is why there are some that start scraping information from PDF. This is often called PDF scraping in which this is the process that is just like data scraping only that you will be getting information that is contained in your PDF files. In order for you to begin scraping information from PDF, you must choose and exploit a tool that is specifically designed for this process. However, you will find that it is not easy to locate the right tool that will enable you to perform PDF scraping effectively. This is because most of the tools today have problems in obtaining exactly the same data that you want without personalizing them.

Nevertheless, if you search well enough, you will be able to encounter the program that you are looking for. There is no need for you to have programming language knowledge in order for you to use them. You can easily specify your own preferences and the software will do the rest of the work for you. There are also companies out there that you can contact and they will perform the task since they have the right tools that they can use. If you choose to do things manually, you will find that this is indeed tedious and complicated whereas if you compare this to having professionals do the job for you, they will be able to finish it in no time at all. Scraping information from PDF is a process where you collect the information that can be found on the internet and this does not infringe copyright laws.

 Source  URL : http://ezinearticles.com/?Increasing-Accessibility-by-Scraping-Information-From-PDF&id=4593863

Thursday, 12 May 2016

Web Scraping to Create Open Data

Open data is the idea that some data should be freely available to everyone to use and republish as they wish, without restrictions from
copyright, patents or other mechanisms of control.

My first experience with open data was in the year 2010. I wanted to create a better app for Bicing, the local bike sharing system in
Barcelona. Their website was a nightmare to use and I was tired of needing to walk to each station, trying to guess which ones had bicycles.
There was no app for Android, other than a couple of unofficial attempts that didn’t work at all.

I began as most would; I searched the internet and found a library named python-bicing that was somehow able to retrieve station and
bike information. This was my first time using Python and, after some investigation, I learned what the code was doing: accessing the
official website, parsing the JavaScript that generated their buggy map and giving back a nice chunk of Python objects that represented
bike share stations.

This I learned was called web scraping. It was like I had figured out a magic trick that would allow me to always be able to access the data I
needed without having to rely on faulty websites.

The rise of OpenBicing and CityBikes

Shortly after, I launched OpenBicing, an Android app for the local bike sharing system in Barcelona, together with a backend that used
python-bicing. I also shared a public API that provided this information so that nobody else had to do the dirty work ever again.

Since other cities were having the same problem, we expanded the scope of the project worldwide and renamed it CityBikes. That was 6
years ago.

To date, CityBikes is the most comprehensive and widely used open API for bike sharing information, with support for over 400 cities
worldwide. Our API processes around 10 requests per second and we scrape each of the 418 feeds about every three minutes. Making our
core library available for anyone to contribute has been crucial in maintaining and adding coverage for all of the supported systems.

The open data fallacy

We are usually regarded as “an open data project” even though less than 10% of our feeds come from properly licensed, documented and
machine-readable feeds. The remaining 90% is composed of 188 feeds that are machine-readable, but not licensed nor documented and
230 that are entirely maintained by scraping HTML pages.

North American BikeShare Association) recently published GBFS (General Bikeshare Feed Specification). This is clearly a step in the right
direction, but I can’t help but look at the almost 60% of services we currently support through scraping and wonder how long it will take the
remaining organizations to release their information, if ever. This is even more the case considering these numbers aren’t even taking into
account worldwide coverage.

Over the last few years there has been a progression by transportation companies and city councils toward providing their information as
“open data”. Directive 2003/98/EC encourages EU member states to release information regarding public services.

Yet, in most cases, there’s little action in enforcing Public Private Partnerships (PPP) to release their public information under a non-
restrictive license or even to transfer ownership of the data to city councils to be included in their open data portals.

Even with the increasing number of companies and institutions interested in participating in open data, by no means should we consider
open data a reality or something to be taken for granted. I firmly believe in the future and benefits of open data, I have seen them
happening all around CityBikes, but as technologists we need to stress the fact that the data is not out there yet.

The benefits of open data

When I started this project, I sought to make a difference in Barcelona. Now you can find tons of bike sharing apps that use our API on all
major platforms. It doesn’t matter that these are not our own apps. They are solving the same problem we were trying to fix, so their
success is our success.

Besides popular apps like Moovit or CityMapper, there are many neat projects out there, some of which are published under free software
licenses. Ideally, a city council could create a customization of any of these apps for their own use.

Most official applications for bike sharing systems have terrible ratings. The core business of transportation companies is running a service,

so they have no real motivation to create an engaging UI or innovate further. In some cases, the city council does not even own the rights to
the data, being completely at the mercy of the company providing the transportation service.

Open data over apps

When providing public services, city councils and companies often get lost in what they should offer as an aid to the service. They focus on
a nice map or a flashy application, rather than providing the data behind these service aids. Maps, apps, and websites have a limited focus
and usually serve a single purpose. On the other hand, data is malleable and the purest form of representation. While you can’t create
something new from looking and playing with a static map (except, of course, if you scrape it), data can be used to create countless
different iterations. It can even provide a bridge that will allow anyone to participate, improve and build on top of these public services.

Wrap Up

At this point, you might wonder why I care so much about bike sharing. To me it’s not about bike sharing anymore. CityBikes is just too
good of an open data metaphor, a simulation in which public information is freely accessible to everyone. It shows the benefits of open
data and the deficiencies that arise from the lack thereof.

We shouldn’t have to create open data by scraping websites. This information should be already available, easily accessed and provided in
a machine-readable format from the original providers, be they city councils or transportation companies. However, until there’s another
option, we’ll always have scraping.


Source : https://blog.scrapinghub.com/2016/03/30/web-scraping-to-create-open-data/




Thursday, 28 April 2016

Web Scraping – Ethical Data Collection Activity or an Illegal Practice?

Abiding by the definition, web scrapping is a method to extract data from website. There can be different reasons to perform this task, such as for reporting, market research, to determine share indexes, know website updates, product rate updates, to monitor data, and so on. Besides these, data theft is another of the prominent motives behind web data extraction, which ultimately holds the use of a web scraper as unethical and at times, illegal.

Technical definition

In technical terms, data scraping is a method of collecting data from a website through specific software. These software programs or web scrapers give the website owners the impression of human web surfing and extract a big volume of data, which is usually difficult for any user visitor to access manually. The apps simulate human exploration of online data by embedding web browsers, or implementing HTTP to fulfill the cause of data extractors.

Relation with data mining

Usually, data mining refers to analyzing data from varied perspectives and transforming it to meaningful information that could help in boosting sales or mitigating financial risks in a business. As for web scraping, it involves extraction of analytical data from the web. At present, web scrapping comprises major source of data extraction carried out by data miners. This is because almost everything is now available online and for any data miner, this resource is no less than a gold mine.

The web scraping process

In this data scraping method, the experts look out for tricks to format the URLs into pages that include the usable information. The web scrapers then parse the DOM tree to extract data from the website. In simple language, the web scrapers process the semi-structured or unstructured data pages of the desired website and then convert the resulting data into a well structured form. The users can harvest or modify the structured data in a better manner.

Web scraping – legal or unethical?

It solely relies on your intentions, whether you are doing this activity in the interest of the masses or just wish to satisfy your personal interests. If it is for a goodwill, such as to research on share index to predict the market situation in the coming days, it is fine. Another positive example could be to identify the trend of market and suggest a client on viable business boosting methods accordingly.

However, if you are doing web scraping for personal gratification then it may well be termed as intrusion into one’s personal data. For example, if you are hacking into the database of a university to steal the academic articles and using them in your own project. Any such instance is definitely an act of stealth and may accompany relevant punishment. Concisely, to get hold of someone’s creative work for individual gains is unethical. Such people also deploy several bots to for data scraping or spinning, which in turn choke the search engine results and hardly useful to the internet.

Considerations that deem web scraping illegal

Generally, web scraping is illegal in two instances:

1. When you violate the terms and conditions of the service of the concerned website:

Most of the data-oriented websites disallow data scraping. Hence, if you are trying to extract data from that website, the owner has all the rights to sue you on the offense of breach of contract.

2. When you publish scraped content:

This is yet another condition that may delve you into violating the right of the copyright holders. If you are only scraping the content for fair use, it may be permissible. However, companies often hold all the publishing rights and may file suit against you if you publish their data without their permission.

Remedy to illegal web scraping

Despite running the apprehensions of getting identified, unethical web scrapers deter to steal data from websites. Hence, the web owners themselves need to be alert enough not to fall prey to such fraudulent activities. Indeed, it is your data and you won’t like it to get compromised at any cost. Just like there are many web scraping tools available online, you can also opt for applications that offer protection against web data extraction as a fruitful remedy. These software safeguard your website content from hacking attacks such as bots, denial of service, brute force, session opening and transaction anomalies, and more.

Summary: Technology has two facets – good and bad. It depends on us which one to adopt; the same holds in the case of web scraping as well. We should make sure to use this innovation for the benefit of society and not to steal away some one’s creativity, which is indeed unethical and at times, illegal

Source : http://www.web-parsing.com/blog/ethical-data-collection-activity-or-an-illegal-practice

Friday, 3 July 2015

SFTW: Scraping data with Google Refine

For the first Something For The Weekend of 2012 I want to tackle a common problem when you’re trying to scrape a collection of webpage: they have some sort of structure in their URL like this, where part of the URL refers to the name or code of an entity:     http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237521

  tp://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237629

    ttp://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=5237823

In this instance, you can see that the URL is identical apart from a 7 digit code at the end: the ID of the school the data refers to.

There are a number of ways you could scrape this data. You could use Google Docs and the =importXML formula, but Google Docs will only let you use this 50 times on any one spreadsheet (you could copy the results and select Edit > Paste Special > Values Only and then use the formula a further 50 times if it’s not too many – here’s one I prepared earlier).

And you could use Scraperwiki to write a powerful scraper – but you need to understand enough coding to do so quickly (here’s a demo I prepared earlier).

A middle option is to use Google Refine, and here’s how you do it.

Assembling the ingredients

With the basic URL structure identified, we already have half of our ingredients. What we need  next is a list of the ID codes that we’re going to use to complete each URL.

An advanced search for “list seed number scottish schools filetype:xls” brings up a link to this spreadsheet (XLS) which gives us just that.

The spreadsheet will need editing: remove any rows you don’t need. This will reduce the time that the scraper will take in going through them. For example, if you’re only interested in one local authority, or one type of school, sort your spreadsheet so that you can delete those above or below them.

Now to combine  the ID codes with the base URL.

Bringing your data into Google Refine

Open Google Refine and create a new project with the edited spreadsheet containing the school IDs.

At the top of the school ID column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top call this ‘URL’.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

“http://www.ltscotland.org.uk/scottishschoolsonline/schools/freemealentitlement.asp?iSchoolID=”+value

(Type in the quotation marks yourself – if you’re copying them from a webpage you may have problems)

The ‘value’ bit means the value of each cell in the column you just selected. The plus sign adds it to the end of the URL in quotes.

In the Preview window you should see the results – you can even copy one of the resulting URLs and paste it into a browser to check it works. (On one occasion Google Refine added .0 to the end of the ID number, ruining the URL. You can solve this by changing ‘value’ to value.substring(0,7) – this extracts the first 7 characters of the ID number, omitting the ‘.0') UPDATE: in the comment Thad suggests “perhaps, upon import of your spreadsheet of IDs, you forgot to uncheck the importer option to Parse as numbers?”

Click OK if you’re happy, and you should have a new column with a URL for each school ID.

Grabbing the HTML for each page

Now click on the top of this new URL column and select Edit column > Add column by fetching URLs…

In the New column name box at the top call this ‘HTML’.

All you need in the Expression window is ‘value’, so leave that as it is.

Click OK.

Google Refine will now go to each of those URLs and fetch the HTML contents. As we have a couple thousand rows here, this will take a long time – hours, depending on the speed of your computer and internet connection (it may not work at all if either isn’t very fast). So leave it running and come back to it later.

Extracting data from the raw HTML with parseHTML

When it’s finished you’ll have another column where each cell is a bunch of HTML. You’ll need to create a new column to extract what you need from that, and you’ll also need some GREL expressions explained here.

First you need to identify what data you want, and where it is in the HTML. To find it, right-click on one of the webpages containing the data, and search for a key phrase or figure that you want to extract. Around that data you want to find a HTML tag like <table class=”destinations”> or <div id=”statistics”>. Keep that open in another window while you tweak the expression we come onto below…

Back in Google Refine, at the top of the HTML column click on the drop-down menu and select Edit column > Add column based on this column…

In the New column name box at the top give it a name describing the data you’re going to pull out.

In the Expression box type the following piece of GREL (Google Refine Expression Language):

value.parseHtml().select(“table.destinations”)[0].select(“tr”).toString()

(Again, type the quotation marks yourself rather than copying them from here or you may have problems)

I’ll break down what this is doing:

value.parseHtml()

parse the HTML in each cell (value)

.select(“table.destinations”)

find a table with a class (.) of “destinations” (in the source HTML this reads <table class=”destinations”>. If it was <div id=”statistics”> then you would write .select(“div#statistics”) – the hash sign representing an ‘id’ and the full stop representing a ‘class’.

[0]

This zero in square brackets tells Refine to only grab the first table – a number 1 would indicate the second, and so on. This is because numbering (“indexing”) generally begins with zero in programming.

.select(“tr”)

Now, within that table, find anything within the tag <tr>

.toString()

And convert the results into a string of text.

The results of that expression in the Preview window should look something like this:

<tr> <th></th> <th>Abbotswell School</th> <th>Aberdeen City</th> <th>Scotland</th> </tr> <tr> <th>Percentage of pupils</th> <td>25.5%</td> <td>16.3%</td> <td>22.6%</td> </tr>

This is still HTML, but a much smaller and manageable chunk. You could, if you chose, now export it as a spreadsheet file and use various techniques to get rid of the tags (Find and Replace, for example) and split the data into separate columns (the =SPLIT formula, for example).

Or you could further tweak your GREL code in Refine to drill further into your data, like so:

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString()

Which would give you this:

<td>25.5%</td>

Or you can add the .substring function to strip out the HTML like so (assuming that the data you want is always 5 characters long):

value.parseHtml().select(“table.destinations”)[0].select(“td”)[0].toString().substring(5,10)

When you’re happy, click OK and you should have a new column for that data. You can repeat this for every piece of data you want to extract into a new column.

Then click Export in the upper right corner and save as a CSV or Excel file.

Source: http://onlinejournalismblog.com/2012/01/13/sftw-scraping-data-with-google-refine/