Friday, 3 July 2015

Web Scraping : To Scrape or Not to Scrape?

Web scraping is on the rise and its legality is being debated. The future of big data could hang in the balance.

I decided that I might want to stop writing about all these successful dot-com businesses and get into the act myself. I mean, how hard could it be, aside from the fact that I have no expertise in any particular vertical, no technological knowledge, and no money? That last was a bit of a problem because I was going to need big data, and big data doesn't come cheap.

So, one day I'm talking to a direct-tech wizard and he says, “Why don't you just find a business you want to be in, find the most successful company, go to their website, and scrape some of their data?”

Scrape? My dad was a house painter. I used to help him during summers. The only scraping I knew was done with a putty knife. But that's what God invented the Internet for. Google turned up endless Web scraping services and I went to one called Automation Anywhere.

Its homepage told me not to try scraping on my own, that I could pay them as little as $1,995 for a program that would have me scraping away in minutes with no programming expertise. A video showed me how. Suppose I wanted to locate all the assisted living facilities in Detroit? (Bedpans! There's a wide-open Web business!) Automation Anywhere showed me how I could request a data pattern—name, address, phone, and service area of targets—apply their program to a rehab facility listing, and minutes later be in possession of tidy customer list on a spreadsheet. A box popped up asking if I had any questions for a live account manager. I did.

“I'm interested in this, but is it totally legal?” I asked Shine, the account rep.

Shine was slow to respond and came back disappointingly noncommittal: “You need to install the software at your end. Hence, you will need to check at your end for legal documents for the website.”

Shine had obviously been reading the same European news sites that I had. Last year Irish airline Ryanair filed suit against PR Aviation, a Dutch airfare comparison site, charging it with copyright infringement and breach of contract for scraping flight data from its site. The Court of Justice of the European Union (ECJ) dismissed the suit, saying the scraping amounted to “normal use” of a website. However, the ECJ did potentially leave the door open for businesses with unprotected databases, such as Ryanair, to establish contractual limitations on use of their databases by third parties. That opening, should it be entered into by airlines, might have businesses such as Expedia, Orbitz, and Priceline reimagining their business plans.

And it could have any other businesses that load up third-party data files through scraping activities doing the same. The reality, however, is that that's a big “and.”

“The airlines have been halfway successful taking travel agents to court, but it can take five years and then they lose,” said Gus Cunningham, CEO of ScrapeSentry, one of only “three and a half” companies, in his words, that block scrapers from websites.

Professional scrapers are not only out-front and plentiful—as the Google search demonstrated—they're also nimble and expensive to chase away. Basically, Cunningham said, it's a matter of stopping scrapers at the website door among the airlines, e-coms, real estate sites, and online gambling companies that are scraped the most. Cunningham's company monitors inbound Web traffic and uses an analysis engine to block suspect visitors per parameters set down by clients. ScanSentry has a nine-year-old database that helps it identify bad actors, much like Interpol with its criminal database. Then a human element must enter the process.

Some of Cunningham's clients feed bogus information to competitors identified as scrapers to ferret them out. In many cases, though, they opt to turn their heads. “Airlines have some flights where they just want to get as many butts as they can in the seats, so they won't concentrate their blocking efforts on those. They'll concentrate on the routes that are always jammed,” Cunningham said.

Unlike botnets that steal money by, say, serving bogus websites to siphon off programmatic ad dollars, scraping is not overtly criminal. In the wide sphere of digital commerce, it's probably most common that the scrap-ee is also a scrap-er. How vigilant vulnerable industries become, and how protective courts and law enforcement agencies grow, will depend on how much scraping activity increases. Cunningham said it's growing fast. More than one fifth of visitors to client websites last year were scrapers, according to a ScanSentry study. Among travel companies, meanwhile, scrapers doubled from 15% in 2013 to 33% last year.

“And,” Cunningham noted, “It is stealing.”

Source: http://www.dmnews.com/direct-line-blog/to-scrape-or-not-to-scrape/article/422662/

Wednesday, 24 June 2015

Data Scraping - Increasing Accessibility by Scraping Information From PDF

You may have heard about data scraping which is a method that is being used by computer programs in extracting data from an output that comes from another program. To put it simply, this is a process which involves the automatic sorting of information that can be found on different resources including the internet which is inside an html file, PDF or any other documents. In addition to that, there is the collection of pertinent information. These pieces of information will be contained into the databases or spreadsheets so that the users can retrieve them later.

Most of the websites today have text that can be accessed and written easily in the source code. However, there are now other businesses nowadays that choose to make use of Adobe PDF files or Portable Document Format. This is a type of file that can be viewed by simply using the free software known as the Adobe Acrobat. Almost any operating system supports the said software. There are many advantages when you choose to utilize PDF files. Among them is that the document that you have looks exactly the same even if you put it in another computer so that you can view it. Therefore, this makes it ideal for business documents or even specification sheets. Of course there are disadvantages as well. One of which is that the text that is contained in the file is converted into an image. In this case, it is often that you may have problems with this when it comes to the copying and pasting.

This is why there are some that start scraping information from PDF. This is often called PDF scraping in which this is the process that is just like data scraping only that you will be getting information that is contained in your PDF files. In order for you to begin scraping information from PDF, you must choose and exploit a tool that is specifically designed for this process. However, you will find that it is not easy to locate the right tool that will enable you to perform PDF scraping effectively. This is because most of the tools today have problems in obtaining exactly the same data that you want without personalizing them.

Nevertheless, if you search well enough, you will be able to encounter the program that you are looking for. There is no need for you to have programming language knowledge in order for you to use them. You can easily specify your own preferences and the software will do the rest of the work for you. There are also companies out there that you can contact and they will perform the task since they have the right tools that they can use. If you choose to do things manually, you will find that this is indeed tedious and complicated whereas if you compare this to having professionals do the job for you, they will be able to finish it in no time at all. Scraping information from PDF is a process where you collect the information that can be found on the internet and this does not infringe copyright laws.

Source: http://ezinearticles.com/?Increasing-Accessibility-by-Scraping-Information-From-PDF&id=4593863

Friday, 19 June 2015

Web Scraping: working with APIs

APIs present researchers with a diverse set of data sources through a standardised access mechanism: send a pasted together HTTP request, receive JSON or XML in return. Today we tap into a range of APIs to get comfortable sending queries and processing responses.

These are the slides from the final class in Web Scraping through R: Web scraping for the humanities and social sciences

This week we explore how to use APIs in R, focusing on the Google Maps API. We then attempt to transfer this approach to query the Yandex Maps API. Finally, the practice section includes examples of working with the YouTube V2 API, a few ‘social’ APIs such as LinkedIn and Twitter, as well as APIs less off the beaten track (Cricket scores, anyone?).

I enjoyed teaching this course and hope to repeat and improve on it next year. When designing the course I tried to cram in everything I wish I had been taught early on in my PhD (resulting in information overload, I fear). Still, hopefully it has been useful to students getting started with digital data collection, showing on the one hand what is possible, and on the other giving some idea of key steps in achieving research objectives.

Download the .Rpres file to use in Rstudio here

A regular R script with code-snippets only can be accessed here

Slides from the first session here

Slides from the second session here

Slides from the third session here

Source: http://www.r-bloggers.com/web-scraping-working-with-apis/

Monday, 8 June 2015

Web Scraping Services : Making Modern File Formats More Accessible

Data scraping is the process of automatically sorting through information contained on the internet inside html, PDF or other documents and collecting relevant information to into databases and spreadsheets for later retrieval. On most websites, the text is easily and accessibly written in the source code but an increasing number of businesses are using Adobe PDF format (Portable Document Format: A format which can be viewed by the free Adobe Acrobat software on almost any operating system. See below for a link.). The advantage of PDF format is that the document looks exactly the same no matter which computer you view it from making it ideal for business forms, specification sheets, etc.; the disadvantage is that the text is converted into an image from which you often cannot easily copy and paste. PDF Scraping is the process of data scraping information contained in PDF files. To PDF scrape a PDF document, you must employ a more diverse set of tools.

There are two main types of PDF files: those built from a text file and those built from an image (likely scanned in). Adobe's own software is capable of PDF scraping from text-based PDF files but special tools are needed for PDF scraping text from image-based PDF files. The primary tool for PDF scraping is the OCR program. OCR, or Optical Character Recognition, programs scan a document for small pictures that they can separate into letters. These pictures are then compared to actual letters and if matches are found, the letters are copied into a file. OCR programs can perform PDF scraping of image-based PDF files quite accurately but they are not perfect.

Once the OCR program or Adobe program has finished PDF scraping a document, you can search through the data to find the parts you are most interested in. This information can then be stored into your favorite database or spreadsheet program. Some PDF scraping programs can sort the data into databases and/or spreadsheets automatically making your job that much easier.

Quite often you will not find a PDF scraping program that will obtain exactly the data you want without customization. Surprisingly a search on Google only turned up one business, that will create a customized PDF scraping utility for your project. A handful of off the shelf utilities claim to be customizable, but seem to require a bit of programming knowledge and time commitment to use effectively. Obtaining the data yourself with one of these tools may be possible but will likely prove quite tedious and time consuming. It may be advisable to contract a company that specializes in PDF scraping to do it for you quickly and professionally.

Let's explore some real world examples of the uses of PDF scraping technology. A group at Cornell University wanted to improve a database of technical documents in PDF format by taking the old PDF file where the links and references were just images of text and changing the links and references into working clickable links thus making the database easy to navigate and cross-reference. They employed a PDF scraping utility to deconstruct the PDF files and figure out where the links were. They then could create a simple script to re-create the PDF files with working links replacing the old text image.

A computer hardware vendor wanted to display specifications data for his hardware on his website. He hired a company to perform PDF scraping of the hardware documentation on the manufacturers' website and save the PDF scraped data into a database he could use to update his webpage automatically.

PDF Scraping is just collecting information that is available on the public internet. PDF Scraping does not violate copyright laws.

PDF Scraping is a great new technology that can significantly reduce your workload if it involves retrieving information from PDF files. Applications exist that can help you with smaller, easier PDF Scraping projects but companies exist that will create custom applications for larger or more intricate PDF Scraping jobs.

Source: http://ezinearticles.com/?PDF-Scraping:-Making-Modern-File-Formats-More-Accessible&id=193321

Tuesday, 2 June 2015

Scraping the Royal Society membership list

To a data scientist any data is fair game, from my interest in the history of science I came across the membership records of the Royal Society from 1660 to 2007 which are available as a single PDF file. I’ve scraped the membership list before: the first time around I wrote a C# application which parsed a plain text file which I had made from the original PDF using an online converting service, looking back at the code it is fiendishly complicated and cluttered by boilerplate code required to build a GUI. ScraperWiki includes a pdftoxml function so I thought I’d see if this would make the process of parsing easier, and compare the ScraperWiki experience more widely with my earlier scraper.

The membership list is laid out quite simply, as shown in the image below, each member (or Fellow) record spans two lines with the member name in the left most column on the first line and information on their birth date and the day they died, the class of their Fellowship and their election date on the second line.

Later in the document we find that information on the Presidents of the Royal Society is found on the same line as the Fellow name and that Royal Patrons are formatted a little differently. There are also alias records where the second line points to the primary record for the name on the first line.

pdftoxml converts a PDF into an xml file, wherein each piece of text is located on the page using spatial coordinates, an individual line looks like this:

<text top="243" left="135" width="221" height="14" font="2">Abbot, Charles, 1st Baron Colchester </text>

This makes parsing columnar data straightforward you simply need to select elements with particular values of the “left” attribute. It turns out that the columns are not in exactly the same positions throughout the whole document, which appears to have been constructed by tacking together the membership list A-J with that of K-Z, but this can easily be resolved by accepting a small range of positions for each column.

Attempting to automatically parse all 395 pages of the document reveals some transcription errors: one Fellow was apparently elected on 16th March 197 – a bit of Googling reveals that the real date is 16th March 1978. Another fellow is classed as a “Felllow”, and whilst most of the dates of birth and death are separated by a dash some are separated by an en dash which as far as the code is concerned is something completely different and so on. In my earlier iteration I missed some of these quirks or fixed them by editing the converted text file. These variations suggest that the source document was typed manually rather than being output from a pre-existing database. Since I couldn’t edit the source document I was obliged to code around these quirks.

ScraperWiki helpfully makes putting data into a SQLite database the simplest option for a scraper. My handling of dates in this version of the scraper is a little unsatisfactory: presidential terms are described in terms of a start and end year but are rendered 1st January of those years in the database. Furthermore, in historical documents dates may not be known accurately so someone may have a birth date described as “circa 1782″ or “c 1782″, even more vaguely they may be described as having “flourished 1663-1778″ or “fl. 1663-1778″. Python’s default datetime module does not capture this subtlety and if it did the database used to store dates would need to support it too to be useful – I’ve addressed this by storing the original life span data as text so that it can be analysed should the need arise. Storing dates as proper dates in the database, rather than text strings means we can query the database using date based queries.

ScraperWiki provides an API to my dataset so that I can query it using SQL, and since it is public anyone else can do this too. So, for example, it’s easy to write queries that tell you the the database contains 8019 Fellows, 56 Presidents, 387 born before 1700, 3657 with no birth date, 2360 with no death date, 204 “flourished”, 450 have birth dates “circa” some year.

I can count the number of classes of fellows:

select distinct class,count(*) from `RoyalSocietyFellows` group by class

Make a table of all of the Presidents of the Royal Society

select * from `RoyalSocietyFellows` where StartPresident not null order by StartPresident desc

…and so on. These illustrations just use the ScraperWiki htmltable export option to display the data as a table but equally I could use similar queries to pull data into a visualisation.

Comparing this to my earlier experience, the benefits of using ScraperWiki are:

•    Nice traceable code to provide a provenance for the dataset;

•    Access to the pdftoxml library;

•    Strong encouragement to “do the right thing” and put the data into a database;

•    Publication of the data;

•    A simple API giving access to the data for reuse by all.

My next target for ScraperWiki may well be the membership lists for the French Academie des Sciences, a task which proved too complex for a simple plain text scraper…

Source: https://scraperwiki.wordpress.com/2012/12/28/scraping-the-royal-society-membership-list/

Thursday, 28 May 2015

Data Scraping Services - Things to take care while doing Web Scraping!!!

In the present day and age, web scraping word becomes most popular in data science. Basically web scraping is extracting the information from the websites using pre-written programs and web scraping scripts. Many organizations have successfully used web site scraping to build relevant and useful database that they use on a daily basis to enhance their business interests. This is the age of the Big Data and web scraping is one of the trending techniques in the data science.

Throughout my journey of learning web scraping and implementing many successful scraping projects, I have come across some great experiences we can learn from.  In this post, I’m going to discuss some of the approaches to take and approaches to avoid while executing web scraping.

User Proxies: Anonymously scraping data from websites

One should not scrape website with a single IP Address. Because when you repeatedly request the web page for web scraping, there is a chance that the remote web server might block your IP address preventing further request to the web page. To overcome this situation, one should scrape websites with the help of proxy servers (anonymous scraping). This will minimize the risk of getting trapped and blacklisted by a website. Use of Proxies to hide your identity (network details) to remote web servers while scraping data. You may also use a VPN instead of proxies to anonymously scrape websites.

Take maximum data and store it.

Do not follow “process the web page as it comes from the remote server”. Instead take all the information and store it to disk. This approach will be useful when your scraping algorithm breaks in the middle. In this case you don’t have to start scraping again. Never download the same content more than once as you are just wasting bandwidth. Try and download all content to disk in one go and then do the processing.

Follow strict rules in parsing:

Check various rules while parsing the information from the web site. For example if you expect a value to be a date then check that it’s really a date. This may greatly improve the quality of information. When you get unexpected data, then the algorithm need to be changed accordingly.

Respect Robots.txt

Robots.txt specifies the set of rules that should be followed by web crawlers and robots. I strongly advise you to consider and adjust your crawler to fully respect robots.txt. Robots.txt contains instructions on the exact pages that you are allowed to crawl, user-agent, and the requisite intervals between page requests. Following to these instructions minimizes the chance of getting blacklisted and banned from website owner.

Use XPath Smartly

XPath is a nice option to select elements of the HTML document more flexibly than CSS Selectors.  Be careful about HTML structure change through page to page so one xpath you made may be failed to extract data on another page due to changes in HTML structure.

Obey Website TOC:

Some websites make it absolutely apparent in their terms and conditions that they are particularly against to web scraping activities on their content. This can make you vulnerable against possible ethical and legal implications.

Test sample scrape and verify the data with actual scrape

Once you are done with web scraping project set up, you need to test it for sometimes. Check the extracted data. If something is not good, find out the cause and make changes accordingly and finally come to a perfect web scraping project.

Source: http://webdata-scraping.com/things-take-care-web-scraping/

Tuesday, 26 May 2015

Web Scraping Services : What are the ethics of web scraping?

Someone recently asked: "Is web scraping an ethical concept?" I believe that web scraping is absolutely an ethical concept. Web scraping (or screen scraping) is a mechanism to have a computer read a website. There is absolutely no technical difference between an automated computer viewing a website and a human-driven computer viewing a website. Furthermore, if done correctly, scraping can provide many benefits to all involved.

There are a bunch of great uses for web scraping. First, services like Instapaper, which allow saving content for reading on the go, use screen scraping to save a copy of the website to your phone. Second, services like Mint.com, an app which tells you where and how you are spending your money, uses screen scraping to access your bank's website (all with your permission). This is useful because banks do not provide many ways for programmers to access your financial data, even if you want them to. By getting access to your data, programmers can provide really interesting visualizations and insight into your spending habits, which can help you save money.

That said, web scraping can veer into unethical territory. This can take the form of reading websites much quicker than a human could, which can cause difficulty for the servers to handle it. This can cause degraded performance in the website. Malicious hackers use this tactic in what’s known as a "Denial of Service" attack.

Another aspect of unethical web scraping comes in what you do with that data. Some people will scrape the contents of a website and post it as their own, in effect stealing this content. This is a big no-no for the same reasons that taking someone else's book and putting your name on it is a bad idea. Intellectual property, copyright and trademark laws still apply on the internet and your legal recourse is much the same. People engaging in web scraping should make every effort to comply with the stated terms of service for a website. Even when in compliance with those terms, you should take special care in ensuring your activity doesn't affect other users of a website.

One of the downsides to screen scraping is it can be a brittle process. Minor changes to the backing website can often leave a scraper completely broken. Herein lies the mechanism for prevention: making changes to the structure of the code of your website can wreak havoc on a screen scraper's ability to extract information. Periodically making changes that are invisible to the user but affect the content of the code being returned is the most effective mechanism to thwart screen scrapers. That said, this is only a set-back. Authors of screen scrapers can always update them and, as there is no technical difference between a computer-backed browser and a human-backed browser, there's no way to 100% prevent access.

Going forward, I expect screen scraping to increase. One of the main reasons for screen scraping is that the underlying website doesn't have a way for programmers to get access to the data they want. As the number of programmers (and the need for programmers) increases over time, so too will the need for data sources. It is unreasonable to expect every company to dedicate the resources to build a programmer-friendly access point. Screen scraping puts the onus of data extraction on the programmer, not the company with the data, which can work out well for all involved.

Source: https://quickleft.com/blog/is-web-scraping-ethical/

Monday, 25 May 2015

Screen Scraping with BeautifulSoup and lxml

Please enjoy this — a free Chapter of the Python network programming book that I revised for Apress in 2010!

I completely rewrote this chapter for the book's second edition, to feature two powerful libraries that have appeared since the book first came out. I show how to screen-scrape a real-life web page using both BeautifulSoup and also the powerful lxml library (their web sites are here and here).

I chose this chapter for release because screen scraping is often the first network task that a novice Python programmer tackles. Because this material is oriented towards beginners, it explains the entire process — from fetching web pages, to understanding HTML, to querying for specific elements in the document.

Program listings are available for this chapter in both Python 2 and also in Python 3. Let me know if you have any questions!

Most web sites are designed first and foremost for human eyes. While well-designed sites offer formal APIs by which you can construct Google maps, upload Flickr photos, or browse YouTube videos, many sites offer nothing but HTML pages formatted for humans. If you need a program to be able to fetch its data, then you will need the ability to dive into densely formatted markup and retrieve the information you need—a process known affectionately as screen scraping.

In one's haste to grab information from a web page sitting open in your browser in front of you, it can be easy for even experienced programmers to forget to check whether an API is provided for data that they need. So try to take a few minutes investigating the site in which you are interested to see if some more formal programming interface is offered to their services. Even an RSS feed can sometimes be easier to parse than a list of items on a full web page.

Also be careful to check for a “terms of service” document on each site. YouTube, for example, offers an API and, in return, disallows programs from trying to parse their web pages. Sites usually do this for very important reasons related to performance and usage patterns, so I recommend always obeying the terms of service and simply going elsewhere for your data if they prove too restrictive.

Regardless of whether terms of service exist, always try to be polite when hitting public web sites. Cache pages or data that you will need for several minutes or hours, rather than hitting their site needlessly over and over again. When developing your screen-scraping algorithm, test against a copy of their web page that you save to disk, instead of doing an HTTP round-trip with every test. And always be aware that excessive use can result in your IP being temporarily or permanently blocked from a site if its owners are sensitive to automated sources of load.

Fetching Web Pages

Before you can parse an HTML-formatted web page, you of course have to acquire some. Chapter 9 provides the kind of thorough introduction to the HTTP protocol that can help you figure out how to fetch information even from sites that require passwords or cookies. But, in brief, here are some options for downloading content.

From the Future

If you need a simple way to fetch web pages before scraping them, try Kenneth Reitz's requests library!

The library was not released until after the book was published, but has already taken the Python world by storm. The simplicity and convenience of its API has made it the tool of choice for making web requests from Python.

    You can use urllib2, or the even lower-level httplib, to construct an HTTP request that will return a web page. For each form that has to be filled out, you will have to build a dictionary representing the field names and data values inside; unlike a real web browser, these libraries will give you no help in submitting forms.

    You can to install mechanize and write a program that fills out and submits web forms much as you would do when sitting in front of a web browser. The downside is that, to benefit from this automation, you will need to download the page containing the form HTML before you can then submit it—possibly doubling the number of web requests you perform!

    If you need to download and parse entire web sites, take a look at the Scrapy project, hosted at scrapy.org, which provides a framework for implementing your own web spiders. With the tools it provides, you can write programs that follow links to every page on a web site, tabulating the data you want extracted from each page.

    When web pages wind up being incomplete because they use dynamic JavaScript to load data that you need, you can use the QtWebKit module of the PyQt4 library to load a page, let the JavaScript run, and then save or parse the resulting complete HTML page.

    Finally, if you really need a browser to load the site, both the Selenium and Windmill test platforms provide a way to drive a standard web browser from inside a Python program. You can start the browser up, direct it to the page of interest, fill out and submit forms, do whatever else is necessary to bring up the data you need, and then pull the resulting information directly from the DOM elements that hold them.

These last two options both require third-party components or Python modules that are built against large libraries, and so we will not cover them here, in favor of techniques that require only pure Python.

For our examples in this chapter, we will use the site of the United States National Weather Service, which lives at www.weather.gov.

Among the better features of the United States government is its having long ago decreed that all publications produced by their agencies are public domain. This means, happily, that I can pull all sorts of data from their web site and not worry about the fact that copies of the data are working their way into this book.

Of course, web sites change, so the online source code for this chapter includes the downloaded web page on which the scripts in this chapter are designed to work. That way, even if their site undergoes a major redesign, you will still be able to try out the code examples in the future. And, anyway—as I recommended previously—you should be kind to web sites by always developing your scraping code against a downloaded copy of a web page to help reduce their load.

Downloading Pages Through Form Submission

The task of grabbing information from a web site usually starts by reading it carefully with a web browser and finding a route to the information you need. Figure 10–1 shows the site of the National Weather Service; for our first example, we will write a program that takes a city and state as arguments and prints out the current conditions, temperature, and humidity. If you will explore the site a bit, you will find that city-specific forecasts can be visited by typing the city name into the small “Local forecast” form in the left margin.

Figure 10–1. The National Weather Service web site

(click to enlarge)

When using the urllib2 module from the Standard Library, you will have to read the web page HTML manually to find the form. You can use the View Source command in your browser, search for the words “Local forecast,” and find the following form in the middle of the sea of HTML:

<form method="post" action="http://forecast.weather.gov/zipcity.php" ...>

  ...

  <input type="text" id="zipcity" name="inputstring" size="9"

    value="City, St" onfocus="this.value='';" />

  <input type="submit" name="Go2" value="Go" />

</form>

The only important elements here are the <form> itself and the <input> fields inside; everything else is just decoration intended to help human readers.

This form does a POST to a particular URL with, it appears, just one parameter: an inputstring giving the city name and state. Listing 10–1 shows a simple Python program that uses only the Standard Library to perform this interaction, and saves the result to phoenix.html.

Listing 10–1. Submitting a Form with “urllib2”

#!/usr/bin/env python

# Foundations of Python Network Programming - Chapter 10 - fetch_urllib2.py

# Submitting a form and retrieving a page with urllib2

import urllib, urllib2

data = urllib.urlencode({'inputstring': 'Phoenix, AZ'})

info = urllib2.urlopen('http://forecast.weather.gov/zipcity.php', data)

content = info.read()

open('phoenix.html', 'w').write(content)

On the one hand, urllib2 makes this interaction very convenient; we are able to download a forecast page using only a few lines of code. But, on the other hand, we had to read and understand the form ourselves instead of relying on an actual HTML parser to read it. The approach encouraged by mechanize is quite different: you need only the address of the opening page to get started, and the library itself will take responsibility for exploring the HTML and letting you know what forms are present. Here are the forms that it finds on this particular page:

>>> import mechanize

>>> br = mechanize.Browser()

>>> response = br.open('http://www.weather.gov/')

>>> for form in br.forms():

...     print '%r %r %s' % (form.name, form.attrs.get('id'), form.action)

...     for control in form.controls:

...         print '   ', control.type, control.name, repr(control.value)

None None http://search.usa.gov/search

    hidden v:project 'firstgov'

    text query ''

    radio affiliate ['nws.noaa.gov']

    submit None 'Go'

None None http://forecast.weather.gov/zipcity.php

    text inputstring 'City, St'

    submit Go2 'Go'

'jump' 'jump' http://www.weather.gov/

    select menu ['http://www.weather.gov/alerts-beta/']

    button None None

Here, mechanize has helped us avoid reading any HTML at all. Of course, pages with very obscure form names and fields might make it very difficult to look at a list of forms like this and decide which is the form we see on the page that we want to submit; in those cases, inspecting the HTML ourselves can be helpful, or—if you use Google Chrome, or Firefox with Firebug installed—right-clicking the form and selecting “Inspect Element” to jump right to its element in the document tree.

Once we have determined that we need the zipcity.php form, we can write a program like that shown in Listing 10–2. You can see that at no point does it build a set of form fields manually itself, as was necessary in our previous listing. Instead, it simply loads the front page, sets the one field value that we care about, and then presses the form's submit button. Note that since this HTML form did not specify a name, we had to create our own filter function—the lambda function in the listing—to choose which of the three forms we wanted.

Listing 10–2. Submitting a Form with mechanize

#!/usr/bin/env python

# Foundations of Python Network Programming - Chapter 10 - fetch_mechanize.py

# Submitting a form and retrieving a page with mechanize

import mechanize

br = mechanize.Browser()

br.open('http://www.weather.gov/')

br.select_form(predicate=lambda(form): 'zipcity' in form.action)

br['inputstring'] = 'Phoenix, AZ'

response = br.submit()

content = response.read()


open('phoenix.html', 'w').write(content)

Many mechanize users instead choose to select forms by the order in which they appear in the page—in which case we could have called select_form(nr=1). But I prefer not to rely on the order, since the real identity of a form is inherent in the action that it performs, not its location on a page.

You will see immediately the problem with using mechanize for this kind of simple task: whereas Listing 10–1 was able to fetch the page we wanted with a single HTTP request, Listing 10–2 requires two round-trips to the web site to do the same task. For this reason, I avoid using mechanize for simple form submission. Instead, I keep it in reserve for the task at which it really shines: logging on to web sites like banks, which set cookies when you first arrive at their front page and require those cookies to be present as you log in and browse your accounts. Since these web sessions require a visit to the front page anyway, no extra round-trips are incurred by using mechanize.

The Structure of Web Pages

There is a veritable glut of online guides and published books on the subject of HTML, but a few notes about the format would seem to be appropriate here for users who might be encountering the format for the first time.

The Hypertext Markup Language (HTML) is one of many markup dialects built atop the Standard Generalized Markup Language (SGML), which bequeathed to the world the idea of using thousands of angle brackets to mark up plain text. Inserting bold and italics into a format like HTML is as simple as typing eight angle brackets:

The <b>very</b> strange book <i>Tristram Shandy</i>.

In the terminology of SGML, the strings <b> and </b> are each tags—they are, in fact, an opening and a closing tag—and together they create an element that contains the text very inside it. Elements can contain text as well as other elements, and can define a series of key/value attribute pairs that give more information about the element:

<p content="personal">I am reading <i document="play">Hamlet</i>.</p>

There is a whole subfamily of markup languages based on the simpler Extensible Markup Language (XML), which takes SGML and removes most of its special cases and features to produce documents that can be generated and parsed without knowing their structure ahead of time. The problem with SGML languages in this regard—and HTML is one particular example—is that they expect parsers to know the rules about which elements can be nested inside which other elements, and this leads to constructions like this unordered list <ul>, inside which are several list items <li>:

<ul><li>First<li>Second<li>Third<li>Fourth</ul>

At first this might look like a series of <li> elements that are more and more deeply nested, so that the final word here is four list elements deep. But since HTML in fact says that <li> elements cannot nest, an HTML parser will understand the foregoing snippet to be equivalent to this more explicit XML string:

<ul><li>First</li><li>Second</li><li>Third</li><li>Fourth</li></ul>

And beyond this implicit understanding of HTML that a parser must possess are the twin problems that, first, various browsers over the years have varied wildly in how well they can reconstruct the document structure when given very concise or even deeply broken HTML; and, second, most web page authors judge the quality of their HTML by whether their browser of choice renders it correctly. This has resulted not only in a World Wide Web that is full of sites with invalid and broken HTML markup, but also in the fact that the permissiveness built into browsers has encouraged different flavors of broken HTML among their different user groups.

If HTML is a new concept to you, you can find abundant resources online. Here are a few documents that have been longstanding resources in helping programmers learn the format:

    http://www.w3.org/MarkUp/Guide/

    http://www.w3.org/MarkUp/Guide/Advanced.html

    http://www.w3.org/MarkUp/Guide/Style

The brief Bare Bones Guide, and the long and verbose HTML standard itself, are good resources to have when trying to remember an element name or the name of a particular attribute value:

    http://werbach.com/barebones/barebones.html

    http://www.w3.org/TR/REC-html40/

When building your own web pages, try to install a real HTML validator in your editor, IDE, or build process, or test your web site once it is online by submitting it to

    http://validator.w3.org/

You might also want to consider using the tidy tool, which can also be integrated into an editor or build process:

    http://tidy.sourceforge.net/

We will now turn to that weather forecast for Phoenix, Arizona, that we downloaded earlier using our scripts (note that we will avoid creating extra traffic for the NWS by running our experiments against this local file), and we will learn how to extract actual data from HTML.

Three Axes

Parsing HTML with Python requires three choices:

    The parser you will use to digest the HTML, and try to make sense of its tangle of opening and closing tags

    The API by which your Python program will access the tree of concentric elements that the parser built from its analysis of the HTML page

    What kinds of selectors you will be able to write to jump directly to the part of the page that interests you, instead of having to step into the hierarchy one element at a time

The issue of selectors is a very important one, because a well-written selector can unambiguously identify an HTML element that interests you without your having to touch any of the elements above it in the document tree. This can insulate your program from larger design changes that might be made to a web site; as long as the element you are selecting retains the same ID, name, or whatever other property you select it with, your program will still find it even if after the redesign it is several levels deeper in the document.

I should pause for a second to explain terms like “deeper,” and I think the concept will be clearest if we reconsider the unordered list that was quoted in the previous section. An experienced web developer looking at that list rearranges it in her head, so that this is what it looks like:

<ul>

  <li>First</li>

  <li>Second</li>

  <li>Third</li>

  <li>Fourth</li>

</ul>

Here the <ul> element is said to be a “parent” element of the individual list items, which “wraps” them and which is one level “above” them in the whole document. The <li> elements are “siblings” of one another; each is a “child” of the <ul> element that “contains” them, and they sit “below” their parent in the larger document tree. This kind of spatial thinking winds up being very important for working your way into a document through an API.

In brief, here are your choices along each of the three axes that were just listed:

    The most powerful, flexible, and fastest parser at the moment appears to be the HTMLParser that comes with lxml; the next most powerful is the longtime favorite BeautifulSoup (I see that its author has, in his words, “abandoned” the new 3.1 version because it is weaker when given broken HTML, and recommends using the 3.0 series until he has time to release 3.2); and coming in dead last are the parsing classes included with the Python Standard Library, which no one seems to use for serious screen scraping.

    The best API for manipulating a tree of HTML elements is ElementTree, which has been brought into the Standard Library for use with the Standard Library parsers, and is also the API supported by lxml; BeautifulSoup supports an API peculiar to itself; and a pair of ancient, ugly, event-based interfaces to HTML still exist in the Python Standard Library.

    The lxml library supports two of the major industry-standard selectors: CSS selectors and XPath query language; BeautifulSoup has a selector system all its own, but one that is very powerful and has powered countless web-scraping programs over the years.

Given the foregoing range of options, I recommend using lxml when doing so is at all possible—installation requires compiling a C extension so that it can accelerate its parsing using libxml2—and using BeautifulSoup if you are on a machine where you can install only pure Python. Note that lxml is available as a pre-compiled package named python-lxml on Ubuntu machines, and that the best approach to installation is often this command line:

STATIC_DEPS=true pip install lxml

And if you consult the lxml documentation, you will find that it can optionally use the BeautifulSoup parser to build its own ElementTree-compliant trees of elements. This leaves very little reason to use BeautifulSoup by itself unless its selectors happen to be a perfect fit for your problem; we will discuss them later in this chapter.

But the state of the art may advance over the years, so be sure to consult its own documentation as well as recent blogs or Stack Overflow questions if you are having problems getting it to compile.

From the Future

The BeautifulSoup project has recovered! While the text below — written in late 2010 — has to carefully avoid the broken 3.2 release in favor of 3.0, BeautifulSoup has now released a rewrite named beautifulsoup4 on the Python Package Index that works with both Python 2 and 3. Once installed, simply import it like this:

from bs4 import BeautifulSoup

I just ran a test, and it reads the malformed phoenix.html page perfectly.

Diving into an HTML Document

The tree of objects that a parser creates from an HTML file is often called a Document Object Model, or DOM, even though this is officially the name of one particular API defined by the standards bodies and implemented by browsers for the use of JavaScript running on a web page.

The task we have set for ourselves, you will recall, is to find the current conditions, temperature, and humidity in the phoenix.html page that we have downloaded. Here is an excerpt: Listing 10–3, which focuses on the pane that we are interested in.

Listing 10–3. Excerpt from the Phoenix Forecast Page

<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN"><html><head>

<title>7-Day Forecast for Latitude 33.45&deg;N and Longitude 112.07&deg;W (Elev. 1132 ft)</title><link rel="STYLESHEET" type="text/css" href="fonts/main.css">

...

<table cellspacing="0" cellspacing="0" border="0" width="100%"><tr align="center"><td><table width='100%' border='0'>

<tr>

<td align ='center'>

<span class='blue1'>Phoenix, Phoenix Sky Harbor International Airport</span><br>

Last Update on 29 Oct 7:51 MST<br><br>

</td>
</tr>
<tr>

<td colspan='2'>

<table cellspacing='0' cellpadding='0' border='0' align='left'>

<tr>

<td class='big' width='120' align='center'>

<font size='3' color='000066'>

A Few Clouds<br>

<br>71&deg;F<br>(22&deg;C)</td>

</font><td rowspan='2' width='200'><table cellspacing='0' cellpadding='2' border='0' width='100%'>

<tr bgcolor='#b0c4de'>

<td><b>Humidity</b>:</td>

<td align='right'>30 %</td>

</tr>

<tr bgcolor='#ffefd5'>

<td><b>Wind Speed</b>:</td><td align='right'>SE 5 MPH<br>

</td>
</tr>

<tr bgcolor='#b0c4de'>

<td><b>Barometer</b>:</td><td align='right' nowrap>30.05 in (1015.90 mb)</td></tr>

<tr bgcolor='#ffefd5'>

<td><b>Dewpoint</b>:</td><td align='right'>38&deg;F (3&deg;C)</td>

</tr>
</tr>

<tr bgcolor='#ffefd5'>

<td><b>Visibility</b>:</td><td align='right'>10.00 Miles</td>

</tr>

<tr><td nowrap><b><a href='http://www.wrh.noaa.gov/total_forecast/other_obs.php?wfo=psr&zone=AZZ023' class='link'>More Local Wx:</a></b> </td>

<td nowrap align='right'><b><a href='http://www.wrh.noaa.gov/mesowest/getobext.php?wfo=psr&sid=KPHX&num=72' class='link'>3 Day History:</a></b> </td></tr>

</table>

...

There are two approaches to narrowing your attention to the specific area of the document in which you are interested. You can either search the HTML for a word or phrase close to the data that you want, or, as we mentioned previously, use Google Chrome or Firefox with Firebug to “Inspect Element” and see the element you want embedded in an attractive diagram of the document tree. Figure 10–2 shows Google Chrome with its Developer Tools pane open following an Inspect Element command: my mouse is poised over the <font> element that was brought up in its document tree, and the element itself is highlighted in blue on the web page itself.

Image

Figure 10–2. Examining Document Elements in the Browser

(click to enlarge)

Note that Google Chrome does have an annoying habit of filling in “conceptual” tags that are not actually present in the source code, like the <tbody> tags that you can see in every one of the tables shown here. For that reason, I look at the actual HTML source before writing my Python code; I mainly use Chrome to help me find the right places in the HTML.

We will want to grab the text “A Few Clouds” as well as the temperature before turning our attention to the table that sits to this element's right, which contains the humidity.

A properly indented version of the HTML page that you are scraping is good to have at your elbow while writing code. I have included phoenix-tidied.html with the source code bundle for this chapter so that you can take a look at how much easier it is to read!

You can see that the element displaying the current conditions in Phoenix sits very deep within the document hierarchy. Deep nesting is a very common feature of complicated page designs, and that is why simply walking a document object model can be a very verbose way to select part of a document—and, of course, a brittle one, because it will be sensitive to changes in any of the target element's parent. This will break your screen-scraping program not only if the target web site does a redesign, but also simply because changes in the time of day or the need for the site to host different kinds of ads can change the layout subtly and ruin your selector logic.

To see how direct document-object manipulation would work in this case, we can load the raw page directly into both the lxml and BeautifulSoup systems.

>>> import lxml.etree
>>> parser = lxml.etree.HTMLParser(encoding='utf-8')
>>> tree = lxml.etree.parse('phoenix.html', parser)

The need for a separate parser object here is because, as you might guess from its name, lxml is natively targeted at XML files.

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup(open('phoenix.html'))

Traceback (most recent call last):
  ...
HTMLParseError: malformed start tag, at line 96, column 720

What on earth? Well, look, the National Weather Service does not check or tidy its HTML! I might have chosen a different example for this book if I had known, but since this is a good illustration of the way the real world works, let's press on. Jumping to line 96, column 720 of phoenix.html, we see that there does indeed appear to be some broken HTML:

<a href="http://www.weather.gov"<u>www.weather.gov</u></a>

You can see that the <u> tag starts before a closing angle bracket has been encountered for the <a> tag. But why should BeautifulSoup care? I wonder what version I have installed.

>>> BeautifulSoup.__version__

'3.1.0'

Well, drat. I typed too quickly and was not careful to specify a working version when I ran pip to install BeautifulSoup into my virtual environment. Let's try again:

$ pip install BeautifulSoup==3.0.8.1

And now the broken document parses successfully:

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup(open('phoenix.html'))

That is much better!

Now, if we were to take the approach of starting at the top of the document and digging ever deeper until we find the node that we are interested in, we are going to have to generate some very verbose code. Here is the approach we would have to take with lxml:
>>> fonttag = tree.find('body').find('div').findall('table')[3] \

...     .findall('tr')[1].find('td').findall('table')[1].find('tr') \

...     .findall('td')[1].findall('table')[1].find('tr').find('td') \

...     .find('table').findall('tr')[1].find('td').find('table') \

...     .find('tr').find('td').find('font')

>>> fonttag.text

'\nA Few Clouds'

An attractive syntactic convention lets BeautifulSoup handle some of these steps more beautifully:

>>> fonttag = soup.body.div('table', recursive=False)[3] \

...     ('tr', recursive=False)[1].td('table', recursive=False)[1].tr \

...     ('td', recursive=False)[1]('table', recursive=False)[1].tr.td \

...     .table('tr', recursive=False)[1].td.table \

...     .tr.td.font

>>> fonttag.text

u'A Few Clouds71&deg;F(22&deg;C)'

BeautifulSoup lets you choose the first child element with a given tag by simply selecting the attribute .tagname, and lets you receive a list of child elements with a given tag name by calling an element like a function—you can also explicitly call the method findAll()—with the tag name and a recursive option telling it to pay attention just to the children of an element; by default, this option is set to True, and BeautifulSoup will run off and find all elements with that tag in the entire sub-tree beneath an element!

Anyway, two lessons should be evident from the foregoing exploration.

First, both lxml and BeautifulSoup provide attractive ways to quickly grab a child element based on its tag name and position in the document.

Second, we clearly should not be using such primitive navigation to try descending into a real-world web page! I have no idea how code like the expressions just shown can easily be debugged or maintained; they would probably have to be re-built from the ground up if anything went wrong with them—they are a painful example of write-once code.

And that is why selectors that each screen-scraping library supports are so critically important: they are how you can ignore the many layers of elements that might surround a particular target, and dive right in to the piece of information you need.

Figuring out how HTML elements are grouped, by the way, is much easier if you either view HTML with an editor that prints it as a tree, or if you run it through a tool like HTML tidy from W3C that can indent each tag to show you which ones are inside which other ones:
$ tidy phoenix.html > phoenix-tidied.html

You can also use either of these libraries to try tidying the code, with a call like one of these:

lxml.html.tostring(html)

soup.prettify()

See each library's documentation for more details on using these calls.

Selectors

A selector is a pattern that is crafted to match document elements on which your program wants to operate. There are several popular flavors of selector, and we will look at each of them as possible techniques for finding the current-conditions <font> tag in the National Weather Service page for Phoenix. We will look at three:

•    People who are deeply XML-centric prefer XPath expressions, which are a companion technology to XML itself and let you match elements based on their ancestors, their own identity, and textual matches against their attributes and text content. They are very powerful as well as quite general.

•    If you are a web developer, then you probably link to CSS selectors as the most natural choice for examining HTML. These are the same patterns used in Cascading Style Sheets documents to describe the set of elements to which each set of styles should be applied.

•    Both lxml and BeautifulSoup, as we have seen, provide a smattering of their own methods for finding document elements.

Here are standards and descriptions for each of the selector styles just described— first, XPath:

    http://www.w3.org/TR/xpath/

    http://codespeak.net/lxml/tutorial.html#using-xpath-to-find-text

    http://codespeak.net/lxml/xpathxslt.html

And here are some CSS selector resources:

    http://www.w3.org/TR/CSS2/selector.html

    http://codespeak.net/lxml/cssselect.html

And, finally, here are links to documentation that looks at selector methods peculiar to lxml and BeautifulSoup:

    http://codespeak.net/lxml/tutorial.html#elementpath

  http://www.crummy.com/software/BeautifulSoup/documentation.html#Searching the Parse Tree

The National Weather Service has not been kind to us in constructing this web page. The area that contains the current conditions seems to be constructed entirely of generic untagged elements; none of them have id or class values like currentConditions or temperature that might help guide us to them.

Well, what are the features of the elements that contain the current weather conditions in Listing 10–3? The first thing I notice is that the enclosing <td> element has the class "big". Looking at the page visually, I see that nothing else seems to be of exactly that font size; could it be so simple as to search the document for every <td> with this CSS class? Let us try, using a CSS selector to begin with:

>>> from lxml.cssselect import CSSSelector

>>> sel = CSSSelector('td.big')

>>> sel(tree)

[<Element td at b72ec0a4>]

Perfect! It is also easy to grab elements with a particular class attribute using the peculiar syntax of BeautifulSoup:

>>> soup.find('td', 'big')

<td class="big" width="120" align="center">

<font size="3" color="000066">

A Few Clouds<br />

<br />71&deg;F<br />(22&deg;C)</font></td>

Writing an XPath selector that can find CSS classes is a bit difficult since the class="" attribute contains space-separated values and we do not know, in general, whether the class will be listed first, last, or in the middle.

>>> tree.xpath(".//td[contains(concat(' ', normalize-space(@class), ' '), ' big ')]")

[<Element td at a567fcc>]

This is a common trick when using XPath against HTML: by prepending and appending spaces to the class attribute, the selector assures that it can look for the target class name with spaces around it and find a match regardless of where in the list of classes the name falls.

Selectors, then, can make it simple, elegant, and also quite fast to find elements deep within a document that interest us. And if they break because the document is redesigned or because of a corner case we did not anticipate, they tend to break in obvious ways, unlike the tedious and deep procedure of walking the document tree that we attempted first.

Once you have zeroed in on the part of the document that interests you, it is generally a very simple matter to use the ElementTree or the old BeautifulSoup API to get the text or attribute values you need. Compare the following code to the actual tree shown in Listing 10–3:

>>> td = sel(tree)[0]

>>> td.find('font').text

'\nA Few Clouds'

>>> td.find('font').findall('br')[1].tail

u'71°F'

If you are annoyed that the first string did not return as a Unicode object, you will have to blame the ElementTree standard; the glitch has been corrected in Python 3! Note that ElementTree thinks of text strings in an HTML file not as entities of their own, but as either the .text of its parent element or the .tail of the previous element. This can take a bit of getting used to, and works like this:

<p>

  My favorite play is    # the <p> element's .text

  <i>

    Hamlet                 # the <i> element's .text

  </i>

  which is not really      # the <i> element's .tail

  <b>

    Danish                 # the <b> element's .text

  </b>

  but English.             # the <b> element's .tail

</p>


This can be confusing because you would think of the three words favorite and really and English as being at the same “level” of the document—as all being children of the <p> element somehow—but lxml considers only the first word to be part of the text attached to the <p> element, and considers the other two to belong to the tail texts of the inner <i> and <b> elements. This arrangement can require a bit of contortion if you ever want to move elements without disturbing the text around them, but leads to rather clean code otherwise, if the programmer can keep a clear picture of it in her mind.

BeautifulSoup, by contrast, considers the snippets of text and the <br> elements inside the <font> tag to all be children sitting at the same level of its hierarchy. Strings of text, in other words, are treated as phantom elements. This means that we can simply grab our text snippets by choosing the right child nodes:

>>> td = soup.find('td', 'big')

>>> td.font.contents[0]

u'\nA Few Clouds'

>>> td.font.contents[4]

u'71&deg;F'

Through a similar operation, we can direct either lxml or BeautifulSoup to the humidity datum. Since the word Humidity: will always occur literally in the document next to the numeric value, this search can be driven by a meaningful term rather than by something as random as the big CSS tag. See Listing 10–4 for a complete screen-scraping routine that does the same operation first with lxml and then with BeautifulSoup.

This complete program, which hits the National Weather Service web page for each request, takes the city name on the command line:

$ python weather.py Springfield, IL

Condition:

Traceback (most recent call last):

  ..

AttributeError: 'NoneType' object has no attribute 'text'

And here you can see, superbly illustrated, why screen scraping is always an approach of last resort and should always be avoided if you can possibly get your hands on the data some other way: because presentation markup is typically designed for one thing—human readability in browsers—and can vary in crazy ways depending on what it is displaying.

What is the problem here? A short investigation suggests that the NWS page includes only a <font> element inside of the <tr> if—and this is just a guess of mine, based on a few examples—the description of the current conditions is several words long and thus happens to contain a space. The conditions in Phoenix as I have written this chapter are “A Few Clouds,” so the foregoing code has worked just fine; but in Springfield, the weather is “Fair” and therefore does not need a <font> wrapper around it, apparently.

Listing 10–4. Completed Weather Scraper

#!/usr/bin/env python

# Foundations of Python Network Programming - Chapter 10 - weather.py

# Fetch the weather forecast from the National Weather Service.

import sys, urllib, urllib2

import lxml.etree

from lxml.cssselect import CSSSelector

from BeautifulSoup import BeautifulSoup

if len(sys.argv) < 2:

    print >>sys.stderr, 'usage: weather.py CITY, STATE'

    exit(2)

data = urllib.urlencode({'inputstring': ' '.join(sys.argv[1:])})

info = urllib2.urlopen('http://forecast.weather.gov/zipcity.php', data)

content = info.read()

# Solution #1

parser = lxml.etree.HTMLParser(encoding='utf-8')

tree = lxml.etree.fromstring(content, parser)

big = CSSSelector('td.big')(tree)[0]

if big.find('font') is not None:

    big = big.find('font')

print 'Condition:', big.text.strip()

print 'Temperature:', big.findall('br')[1].tail

tr = tree.xpath('.//td[b="Humidity"]')[0].getparent()

print 'Humidity:', tr.findall('td')[1].text

print

# Solution #2

soup = BeautifulSoup(content)  # doctest: +SKIP

big = soup.find('td', 'big')

if big.font is not None:

    big = big.font

print 'Condition:', big.contents[0].string.strip()

temp = big.contents[3].string or big.contents[4].string  # can be either

print 'Temperature:', temp.replace('°', ' ')

tr = soup.find('b', text='Humidity').parent.parent.parent

print 'Humidity:', tr('td')[1].string

print

If you look at the final form of Listing 10–4, you will see a few other tweaks that I made as I noticed changes in format with different cities. It now seems to work against a reasonable selection of locations; again, note that it gives the same report twice, generated once with lxml and once with BeautifulSoup:

$ python weather.py Springfield, IL

Condition: Fair

Temperature: 54 °F

Humidity: 28 %</code>

<code>Condition: Fair

Temperature: 54  F

Humidity: 28 %

$ python weather.py Grand Canyon, AZ

Condition: Fair

Temperature: 67°F

Humidity: 28 %

Condition: Fair

Temperature: 67 F

Humidity: 28 %

You will note that some cities have spaces between the temperature and the F, and others do not. No, I have no idea why. But if you were to parse these values to compare them, you would have to learn every possible variant and your parser would have to take them into account.

I leave it as an exercise to the reader to determine why the web page currently displays the word “NULL”—you can even see it in the browser—for the temperature in Elk City, Oklahoma. Maybe that location is too forlorn to even deserve a reading? In any case, it is yet another special case that you would have to treat sanely if you were actually trying to repackage this HTML page for access from an API:

$ python weather.py Elk City, OK

Condition: Fair and Breezy

Temperature: NULL

Humidity: NA

Condition: Fair and Breezy

Temperature: NULL

Humidity: NA

I also leave as an exercise to the reader the task of parsing the error page that comes up if a city cannot be found, or if the Weather Service finds it ambiguous and prints a list of more specific choices!

Summary

Although the Python Standard Library has several modules related to SGML and, more specifically, to HTML parsing, there are two premier screen-scraping technologies in use today: the fast and powerful lxml library that supports the standard Python “ElementTree” API for accessing trees of elements, and the quirky BeautifulSoup library that has powerful API conventions all its own for querying and traversing a document.

If you use BeautifulSoup before 3.2 comes out, be sure to download the most recent 3.0 version; the 3.1 series, which unfortunately will install by default, is broken and chokes easily on HTML glitches.

Screen scraping is, at bottom, a complete mess. Web pages vary in unpredictable ways even if you are browsing just one kind of object on the site—like cities at the National Weather Service, for example.

To prepare to screen scrape, download a copy of the page, and use HTML tidy, or else your screen-scraping library of choice, to create a copy of the file that your eyes can more easily read. Always run your program against the ugly original copy, however, lest HTML tidy fixes something in the markup that your program will need to repair!

Once you find the data you want in the web page, look around at the nearby elements for tags, classes, and text that are unique to that spot on the screen. Then, construct a Python command using your scraping library that looks for the pattern you have discovered and retrieves the element in question. By looking at its children, parents, or enclosed text, you should be able to pull out the data that you need from the web page intact.

When you have a basic script working, continue testing it; you will probably find many edge cases that have to be handled correctly before it becomes generally useful. Remember: when possible, always use true APIs, and treat screen scraping as a technique of last resort!

Source: http://rhodesmill.org/brandon/chapters/screen-scraping/


Friday, 22 May 2015

Web scraping using Python without using large frameworks like Scrapy

scrapy-big-logoIf you need publicly available data from scraping the Internet, before creating a webscraper, it is best to check if this data is already available from public data sources or APIs. Check the site’s FAQ section or Google for their API endpoints and public data.

Even if their API endpoints are available you have to create some parser for fetching and structuring the data according to your needs.

Scrapy is a well established framework for scraping, but it is also a very heavy framework. For smaller jobs, it may be overkill and for extremely large jobs it is very slow.

So if you would like to roll up your sleeves and build your own scraper, continue reading.

Here are some basic steps performed by most webspiders:

1) Start with a URL and use a HTTP GET or PUT request to access the URL
2) Fetch all the contents in it and parse the data
3) Store the data in any database or put it into any data warehouse
4) Enqueue all the URLs in a page
5) Use the URLs in queue and repeat from process 1
Here are the 3 major modules in every web crawler:
1) Request/Response handler.
2) Data parsing/data cleansing/data munging process.
3) Data serialization/data pipelines.

Lets look at each of these modules and see what they do and how to use them.

Request/Response handler

Request/response handlers are managers who make http requests to a url or a group of urls, and fetch the response objects as html contents and pass this data to the next module. If you use Python for performing request/response url-opening process libraries such as the following are most commonly used

1) urllib(20.5. urllib – Open arbitrary resources by URL – Python v2.7.8 documentation) -Basic python library yet high-level interface for fetching data across the World Wide Web.

2) urllib2(20.6. urllib2 – extensible library for opening URLs – Python v2.7.8 documentation) – extensible library of urllib, which would handle basic http requests, digest authentication, redirections, cookies and more.

3) requests(Requests: HTTP for Humans) – Much advanced request library

which is built on top of basic request handling libraries.

Data parsing/data cleansing/data munging process

This is the module where the fetched data is processed and cleaned. Unstructured data is transformed into structured during this processing. Usually  a set of Regular Expressions (regexes) which perform pattern matching and text processing tasks on the html data are used for this processing.

In addition to regexes, basic string manipulation and search methods are also used to perform this cleaning and transformation. You must have a thorough knowledge of regular expressions and so that you could design the regex patterns.

Data serialization/data pipelines

Once you get the cleaned data from the parsing and cleaning module, the data serialization module will be used to serialize the data according to the data models that you require. This is the final module that will output data in a standard format that can be stored in databases, JSON/CSV files or passed to any data warehouses for storage. These tasks are usually performed by libraries listed below

1) pickle (pickle – Python object serialization) –  This module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure

2) JSON (JSON encoder and decoder)

3) CSV (https://docs.python.org/2/library/csv.html)

4) Basic database interface libraries like pymongo (Tutorial – PyMongo),mysqldb ( on python.org), sqlite3(sqlite3 – DB-API interface for SQLite databases)

And many more such libraries based on the format and database/data storage.

Basic spider rules

The rules to follow while building a spider are to be nice to the sites you are scraping and follow the rules in the site’s spider policies outlined in the site’s robots.txt.

Limit the  number of requests in a second and build enough delays in the spiders so that  you don’t adversely affect the site.

It just makes sense to be nice.

We will cover more techniques in future articles

Source: http://learn.scrapehero.com/webscraping-using-python-without-using-large-frameworks-like-scrapy/

Tuesday, 19 May 2015

How to Generate Sales Leads Using Web Scraping Services

The first stage of any selling process is what is popularly known as “lead generation”. This phase is what most businesses place at the apex of their sales concerns. It is a driving force that governs decision-making at its highest levels, and influences business strategy and planning. If you are about to embark on an outbound sales campaign and are in the process of looking for leads, you would acknowledge the fact that lead generation process is of extreme importance for any business.

Different lead generation techniques have been used over and over again by companies around the world to satiate this growing business need. Newer, more innovative methods have also emerged to help marketers in this process. One such method of lead generation that is fast catching on, and is poised to play a big role for businesses in the coming years, is web scraping. With web scraping, you can easily get access to multiple relevant and highly customized leads – a perfect starting point for any marketing, promotional or sales campaign.

The prominence of Web Scraping in overall marketing strategy

At present, levels of competition have risen sky high for most businesses. For success, lead generation and gaining insight about customer behavior and preferences is an essential business requirement. Web scraping is the process of scraping or mining the internet for information. Different tools and techniques can be used to harvest information from multiple internet sources based on relevance, and the structured and organized in a way that makes sense to your business. Companies that provide web scraping services essentially use web scrapers to generate a targeted lead database that your company can then integrate into its marketing and sales strategies and plans.

The actual process of web scraping involves creating scraping scripts or algorithms which crawl the web for information based on certain preset parameters and options. The scraping process can be customized and tuned towards finding the kind of data that your business needs. The script can extract data from websites automatically, collate and put together a meaningful collection of leads for business development.

Lead Generation Basics

At a very high level, any person who has the resources and the intent to purchase your product or service qualifies as a lead. In the present scenario, you need to go far deeper than that. Marketers need to observe behavior patterns and purchasing trends to ensure that a particular person qualifies as a lead. If you have a group of people you are targeting, you need to decide who the viable leads will be, acquire their contact information and store it in a database for further action.

List buying used to be a popular way to get leads, but their efficacy has dwindled over time. Web scraping is the fast coming up as a feasible lead generation technique, allowing you to find highly focused and targeted leads in short amounts of time. All you need is a service provider that would carry out the data mining necessary for lead generation, and you end up with a list of actionable leads that you can try selling to.

How Web Scraping makes a substantial difference

With web scraping, you can extract valuable predictive information from websites. Web scraping facilitates high quality data collection and allows you to structure marketing and sales campaigns better. To drive sales and maximize revenue, you need strong, viable leads. To facilitate this, you need critical data which encompasses customer behavior, contact details, buying patterns and trends, willingness and ability to spend resources, and a myriad of other aspects critical to ascertain the potential of an entity as a rewarding lead. Data mining through web scraping can be a great way to get to these factors and identifying the leads that would make a difference for your business.

Crawling through many different web locales using different techniques, web scraping services pick up a wealth of information. This highly relevant and specialized information instantly provides your business with actionable leads. Furthermore, this exercise allows you to fine-tune your data management processes, make more accurate and reliable predictions and projections, arrive at more effective, strategic and marketing decisions and customize your workflow and business development to better suit the current market.

The Process and the Tools

Lead generation, being one of the most important processes for any business, can prove to be an expensive proposition if not handled strategically. Companies spend large amounts of their resources acquiring viable leads they can sell to. With web scraping, you can dramatically cut down the costs involved in lead generation and take your business forward with speed and efficiency. Here are some of the time-tested web scraping tools which can come in handy for lead generation –

•    Website download software – Used to copy entire websites to local storage. All website pages are downloaded and the hierarchy of navigation and internal links preserve. The stored pages can then be viewed and scoured for information at any later time.     Web scraper – Tools that crawl through bulk information on the internet, extracting specific, relevant data using a set of pre-defined parameters.

•    Data grabber – Sifts through websites and databases fast and extracts all the information, which can be sorted and classified later.

•    Text extractor – Can be used to scrape multiple websites or locations for acquiring text content from websites and web documents. It can mine data from a variety of text file formats and platforms.

With these tools, web scraping services scrape websites for lead generation and provide your business with a set of strong, actionable leads that can make a difference.

Covering all Bases

The strength of web scraping and web crawling lies in the fact that it covers all the necessary bases when it comes to lead generation. Data is harvested, structured, categorized and organized in such a way that businesses can easily use the data provided for their sales leads. As discussed earlier, cold and detached lists no longer provide you with enough actionable leads. You need to look at various factors and consider them during your lead generation efforts –

•    Contact details of the prospect

•    Purchasing power and purchasing history of the prospect

•    Past purchasing trends, willingness to purchase and history of buying preferences of the prospect

•    Social markers that are indicative of behavioral patterns

•    Commercial and business markers that are indicative of behavioral patterns

•    Transactional details

•    Other factors including age, gender, demography, social circles, language and interests

All these factors need to be taken into account and considered in detail if you have to ensure whether a lead is viable and actionable, or not. With web scraping you can get enough data about every single prospect, connect all the data collected with the help of onboarding, and ascertain with conviction whether a particular prospect will be viable for your business.

Let us take a look at how web scraping addresses these different factors –

1. Scraping website’s

During the scraping process, all websites where a particular prospect has some participation are crawled for data. Seemingly disjointed data can be made into a sensible unit by the use of onboarding- linking user activities with their online entities with the help of user IDs. Documents can be scanned for participation. E-commerce portals can be scanned to find comments and ratings a prospect might have delivered to certain products. Service providers’ websites can be scraped to find if the prospect has given a testimonial to any particular service. All these details can then be accumulated into a meaningful data collection that is indicative of the purchasing power and intent of the prospect, along with important data about buying preferences and tastes.

2. Social scraping

According to a study, most internet users spend upwards of two hours every day on social networks. Therefore, scraping social networks is a great way to explore prospects in detail. Initially, you can get important identification markers like names, addresses, contact numbers and email addresses. Further, social networks can also supply information about age, gender, demography and language choices. From this basic starting point, further details can be added by scraping social activity over long periods of time and looking for activities which indicate purchasing preferences, trends and interests. This exercise provides highly relevant and targeted information about prospects can be constructively used while designing sales campaigns.

3. Transaction scraping

Through the scraping of transactions, you get a clear idea about the purchasing power of prospects. If you are looking for certain income groups or leads that invest in certain market sectors or during certain specific periods of time, transaction scraping is the best way to harvest meaningful information. This also helps you with competition analysis and provides you with pointers to fine-tune your marketing and sales strategies.

Using these varied lead generation techniques and finding the right balance and combination is key to securing the right leads for your business. Overall, signing up for web scraping services can be a make or break factor for your business going forward. With a steady supply of valuable leads, you can supercharge your sales, maximize returns and craft the perfect marketing maneuvers to take your business to an altogether new dimension.

Source: https://www.promptcloud.com/blog/how-to-generate-sales-leads-using-web-scraping-services/

Sunday, 17 May 2015

Metadata Scraping Service

As mentioned in Robert's last blog post we set up a scraping service which supports users working with citations by extracting automatically references from digital library or publisher websites. We use a very similar service in BibSonomy to support our users while posting a new reference. However, the service is independent from BibSonomy. Our main goal is to make the metadata of other websites easily accessible to every user who needs bibliographic metadata. Therefore we offer the extracted information in BibTeX format. Most tools allow to import BibTeX so it should be very easy for everyone to get the data into his own tool. The service is running under the following URL:

http://scraper.bibsonomy.org/

Currently we support more than 60 different websites (here the full list) and we are working on further extensions. In the near future we will make the source code of our scrapers publicly available under GPL and we hope that other people will find it useful and start to help us by implementing their own scrapers.

How does the service work?

In principle there are two ways to use the service. One uses a so

called bookmarklet and the other is simply based on the URL. If you

have a webpage of a supported site e.g. from ACM digital library the

following page:

Logsonomy - social information retrieval with logdata

then you can copy this URL into the form on the service homepage and the service will return you the extracted BibTeX information. As this is not a very convenient way to access the data we provide a ScrapePublication button. This button is a small piece of JavaScript and can be copied to the toolbar of the browser. By pressing this button while visiting a digital library webpage the URL will be automatically copied and sent to the scraping service and the metadata is extracted.

The service has three options which can be used to customize it and to make it useful for other systems. Obviously one parameter is the URL itself which is used by the bookmarklet, too. The next is the selection parameter which allows to send text to the service and the last parameter allows to change the output format from html to plain BibTeX. This last parameter makes integration with other systems very simple.

If needed we can provide the metadata in other formats as well but currently we support only BibTeX.

Source: http://blog.bibsonomy.org/2008/11/metadata-scraping-service.html

Thursday, 30 April 2015

Customized Web Data Extraction Solutions for Business

As you begin leading your business on the path to success, competitive analysis forms a major part of your homework. You have already mobilized your efforts in finding the appropriate website data scrapping tool that will help you to collect relevant data from competitive websites and shape them up into useable information. There is however a need to look for a customized approach in your search for Data Extraction tools in order to leverage its benefits in the best possible way.

Off-the-shelf Tools Impede Data Extraction

 In the current scenario, Internet Technologies are evolving in abundance. Every organization leverages this development and builds their websites using a different programming language and technology. Off-the-shelf Website Data extraction tools are unable to interpret this difference. They fail to understand the data elements that need to be captured and end up in gathering data without any change in the software source codes.

As a result of this incapability in their technology, off-the-shelf solutions often deliver unclean, incomplete and also inaccurate data. Developers need to contribute a humungous effort in cleaning up and structuring the data to make it useable. However, despite the time-consuming activity, data seldom metamorphoses into the desired information. Also the personnel dealing with the clean-up process needs to have sufficient technical expertise in order to participate in the activities. The endeavor however results in an impediment to the whole process of data extraction leaving you thirsting for the required information to augment business growth.

Understanding how Web Extraction tools work

Web Scrapping tools are designed to extract data from the web automatically. They are usually small pieces of code written using programming languages such as Python, Ruby or PHP depending upon the expertise of the community building it. There are however several single-click models available which tends to make life easier for non-technical personnel.

The biggest challenge faced by a successful web extractor tool is to know how to tackle the right page and the right elements on that page in order to extract the desired information. Consequently, a web extractor needs to be designed to understand the anatomy of a web page in order to accomplish its task successfully. It should be designed to interpret the meaning of HTML elements like , table rows () within those tables, and table data (<td>) cells within those rows in order to extract the exact data. It will also be interfacing with the

element which are blocks of text and know how to extract the desired information from it.

Customized Solutions for your business

 Customized Solutions are provided by most Data Scraping experts. These software's help to minimize the cumbersome effort of writing elaborate codes to successfully accomplish the feat of data extraction. They are designed to seamlessly search competitive websites,identify relevant data elements, and extract appropriate data that will be useful for your business. Owing to their focused approach, these tools provide clean and accurate data thereby eliminating the need to waste valuable time and effort in any clean-up effort.

Most customized data extraction tools are also capable of delivering the extracted data in customized formats like XML or CSV. It also stores data in local databases like Microsoft Access, MySQL, or Microsoft SQL.

Customized Data scraping solutions therefore help you take accurate and informed decisions in order to define effective business strategies.

Source: http://scraping-solutions.blogspot.in/2014_07_01_archive.html 

Tuesday, 28 April 2015

A Guide to Web Scraping Tools

Web Scrapers are tools designed to extract / gather data in a website via crawling engine usually made in Java, Python, Ruby and other programming languages.Web Scrapers are also called as Web Data Extractor, Data Harvester , Crawler and so on which most of them are web-based or can be installed in local desktops.

Its main purpose is to enable webmasters, bloggers, journalist and virtual assistants to harvest data from a certain website whether text, numbers, contact details and images in a structured way which cannot be done easily thru manual copy and paste method. Typically, it transforms the unstructured data on the web, from HTML format into a structured data stored in a local database or spreadsheet or automates web human browsing.

Web Scraper Usage

Web Scrapers are also being used by SEO and Online Marketing Analyst to pull out some data privately from the competitor’s website such as high targeted keywords, valuable links, emails & traffic sources that were also perform by SEOClerk, Google and many other web crawling sites.

Includes:

•    Price comparison
•    Weather data monitoring
•    Website change detection
•    Research
•    Web mash up
•    Info graphics
•    Web data integration
•    Web Indexing & rank checking
•    Analyze websites quality links

List of Popular Web Scrapers

There are hundreds of Web Scrapers today available for both commercial and personal use. If you’ve never done any web scraping before, there are basic

Web scraping tools like YahooPipes, Google Web Scrapers and Outwit Firefox extensions that it’s good to start with but if you need something more flexible and has extra functionality then,  check out the following:

HarvestMan [ Free Open Source]

HarvestMan is a web crawler application written in the Python programming language. HarvestMan can be used to download files from websites, according to a number of user-specified rules. The latest version of HarvestMan supports as much as 60 plus customization options. HarvestMan is a console (command-line) application. HarvestMan is the only open source, multithreaded web-crawler program written in the Python language. HarvestMan is released under the GNU General Public License.Like Scrapy, HarvestMan is truly flexible however, your first installation would not be easy.

Scraperwiki [Commercial]

Using a minimal programming you will be able to extract anything. Off course, you can also request a private scraper if there’s an exclusive in there you want to protect. In other words, it’s a marketplace for data scraping.

Scraperwiki is a site that encourages programmers, journalists and anyone else to take online information and turn it into legitimate datasets. It’s a great resource for learning how to do your own “real” scrapes using Ruby, Python or PHP. But it’s also a good way to cheat the system a little bit. You can search the existing scrapes to see if your target website has already been done. But there’s another cool feature where you can request new scrapers be built.  All in all, a fantastic tool for learning more about scraping and getting the desired results while sharpening your own skills.

Best use: Request help with a scrape, or find a similar scrape to adapt for your purposes.

FiveFilters.org [Commercial]   

Is an online web scraper available for commercial use. Provides easy content extraction using Full-Text RSS tool which can identify and extract web content (news articles, blog posts, Wikipedia entries, and more) and return it in an easy to parse format. Advantages; speedy article extraction, Multi-page support, has a Autodetection and  you can deploy  on the cloud server without database required.

Kimono

Produced by Kimono labs this tool lets you convert data to into apis for automated export.   Benjamin Spiegel did a great Youmoz post on how to build a custom ranking tool with Kimono, well worth checking out!

Mozenda [Commercial]

This is a unique tool for web data extraction or web scarping.Designed for easiest and fastest way of getting data from the web for everyone. It has a point & click interface and with the power of the cloud you can scrape, store, and manage your data all with Mozenda’s incredible back-end hardware. More advance, you can automate your data extraction leaving without a trace using Mozenda’s  anonymous proxy feature that could rotate tons of IP’s .

Need that data on a schedule? Every day? Each hour? Mozenda takes the hassle out of automating and publishing extracted data. Tell Mozenda what data you want once, and then get it however frequently you need it. Plus it allows advanced programming using REST API the user can connect directly Mozenda account.

Mozenda’s Data Mining Software is packed full of useful applications especially for sales people. You can do things such as “lead generation, forecasting, acquiring information for establishing budgets, competitor pricing analysis. This software is a great companion for marketing plan & sales plan creating.

Using Refine Capture tetx tool, Mozenda is smart enough to filter the text you want stays clean or get  the specific text or split them into pieces.

80Legs [Commercial]

The first time I heard about 80Legs my mind really got confused of what really this software does. 80Legs like Mozenda is a web-based data extraction  tool with customizable features:

•    Select which websites to crawl by entering URLs or uploading a seed list
•    Specify what data to extract by using a pre-built extractor or creating your own
•    Run a directed or general web crawler
•    Select how many web pages you want to crawl
•    Choose specific file types to analyze

80 legs offers customized web crawling that lets you get very specific about your crawling parameters, which tell 80legs what web pages you want to crawl and what data to collect from those web pages and also the general web crawling which can collect data like web page content, outgoing links and other data. Large web crawls take advantage of 80legs’ ability to run massively parallel crawls.

Also crawls data feeds and offers web extraction design services. (No installation needed)

ScrapeBox [Commercial]

ScrapeBox are most popular web scraping tools to SEO experts, online marketers and even spammers with its very user-friendly interface you can easily harvest data from a website;

•    Grab Emails
•    Check page rank
•    Checked high value backlinks
•    Export URLS
•    Checked Index
•    Verify working proxies
•    Powerful RSS Submission

Using thousands of rotating proxies you will be able to sneak on the competitor’s site keywords, do research on .gov sites, harvesting data, and commenting without getting blocked.

The latest updates allow the users to spin comments and anchor text to avoid getting detected by search engines.

You can also check out my guide to using Scrapebox for finding guest posting opportunities:

Scrape.it [Commercial]

Using a simple point & click Chrome Extension tool, you can extract data from websites that render in javascript. You can automate filling out forms, extract data from popups, navigate and crawl links across multiple pages, extract images from even the most complex websites with very little learning curve. Schedule jobs to run at regular intervals.

When a website changes layout or your web scraper stops working, scrape.it  will fix it automatically so that you can continue to receive data uninterrupted and without the need for you to recreate or edit it yourself.

They work with enterprises using our own tool that we built to deliver fully managed solutions for competitive pricing analysis, business intelligence, market research, lead generation, process automation and compliance & risk management requirements.

Features:

    Very easy web date extraction with Windows like Explorer interface

    Allowing you to extract text, images and files from modern Web 2.0 and HTML5 websites which uses Javascript & AJAX.

    The user could select what features they’re going to pay with

    lifetime upgrade and support at no extra charge on premium license

Scrapy [Free Open Source]

Off course the list would not be cool without Scrapy, it is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Features:

•         Design with simplicity- Just writes the rules to extract the data from web pages and let Scrapy crawl the entire web site. It can crawl 500 retailers’ sites daily.

•         Ability to attach new code for extensibility without having to touch the framework core

•         Portable, open-source, 100% Python- Scrapy is completely written in Python and runs on Linux, Windows, Mac and BSD

•         Scrapy comes with lots of functionality built in.

•         Scrapy is extensively documented and has an comprehensive test suite with very good code coverage

•         Good community and commercial support

 Cons: The installation process is hard to perfect especially for beginners

Needlebase [Commercial]

Many organizations, from private companies to government agencies, store their info in a searchable database that requires you navigate a list page listing results, and a detail page with more information about each result.  Grabbing all this information could result in thousands of clicks, but as long as it fits the same formula, Needlebase can do it for you.  Point and click on example data from one page once to show Needlebase how your site is structured, and it will use that pattern to extract the information you’re looking for into a dataset.  You can query the data through Needle’s site, or you can output it as a CSV or other file format of your choice.  Needlebase can also rerun your scraper every day to continuously update your dataset.

OutwitHub [Free]

This Firefox extension is one of the more robust free products that exists Write your own formula to help it find information you’re looking for, or just tell it to download all the PDFs listed on a given page.  It will suggest certain pieces of information it can extract easily, but it’s flexible enough for you to be very specific in directing it.  The documentation for Outwit is especially well written, they even have a number of tutorials for what you might be looking to do.  So if you can’t easily figure out how to accomplish what you want, investing a little time to push it further can go a long way.

Best use: more text

irobotsoft [Free}

This is a free program that is essentially a GUI for web scraping. There’s a pretty steep learning curve to figure out how to work it, and the documentation appears to reference an old version of the software. It’s the latest in a long tradition of tools that lets a user click through the logic of web scraping. Generally, these are a good way to wrap your head around the moving parts of a scrape, but the products have drawbacks of their own that makes them little easier than doing the same thing with scripts.

Cons: The documentation seems outdated

Best use: Slightly complex scrapes involving multiple layers.

iMacros [Free]

The  same ethos on how microsoft macros works, iMacros automates repetitive task.Whether you choose the website, Firefox extension, or Internet Explorer add-on flavor of this tool, it can automate navigating through the structure of a website to get to the piece of info you care about. Record your actions once, navigating to a specific page, and entering a search term or username where appropriate.  Especially useful for navigating to a specific stock you care about, or campaign contribution data that’s mired deep in an agency website and lacks a unique Web address.  Extract that key piece (pieces) of info into a usable form.  Can also help convert Web tables into usable data, but OutwitHub is really more suited to that purpose.  Helpful video and text tutorials enable you to get up to speed quickly.

Best use: Eliminate repetition in navigating to a particular datapoint in a website that you’re checking up on often by recording a repeatable action that pulls the datapoint out of the clutter it’s naturally surrounded by.

InfoExtractor [Commercial]

This is a neat little web service that generates all sorts of information given a list of urls. Currently, it only works for YouTube video pages, YouTube user profile pages, Wikipedia entries, Huffingtonpost posts, Blogcatalog blog posts and The Heritage Foundation blog (The Foundry). Given a url, the tool will return structured information including title, tags, view count, comments and so on.

Google Web Scraper [Free]

A browser-based web scraper works like Firefox’s Outwit Hub, it’s designed for plain text extraction from any online pages and export to spreadsheets via Google docs. Google Web Scraper can be downloaded as an extension and you can install it in your Chrome browser without seconds. To use it: highlight a part of the webpage you’d like to scrape, right-click and choose “Scrape similar…”. Anything that’s similar to what you highlighted will be rendered in a table ready for export, compatible with Google Docs™. The latest version still had some bugs on spreadsheets.

Cons: It doesn’t work for images and sometimes it can’t perform well on huge volume of text but it’s easy and fast to use.


Tutorials:

Scraping Website Images Manually using Google Inspect Elements

The main purpose of Google Inspect Elements is for debugging like the Firefox Firebug however, if you’re flexible you can use this tool also for harvesting images in a website. Your main goal is to get the specific images like web backgrounds, buttons, banners, header images and product images which is very useful for web designers.

Now, this is a very easy task. First, you will definitely need to download and install the Google Chrome browser in your computer. After the installation do the following:

1. Open the desired webpage in Google Chrome

2. Highlight any part of the website and right click > choose Google Inspect Elements

3. In the Google Inspect Elements, go to Resources tab

4. Under Resources tab, expand all folders. You will eventually see script folders and IMAGES folders

5. In the Images folders, just use arrow keys to find the images you need to have (see the screenshot above)

6. Next, right click the images and choose Open the Image in New Tab

7. Finally, right click the image > choose Save Image As… . (save to your local folder)

You’re done!

How to Extract Links from a Web Page with OutWit Hub

In this tutorial we are going to learn how to extract links from a webpage with OutWit Hub.

Sometimes it can be useful to extract all links from a given web page. OutWit Hub is the easiest way to achieve this goal.

1. Launch OutWit Hub

If you haven’t installed OutWit Hub yet, please refer to the Getting Started with OutWit Hub tutorial.

Begin by launching OutWit Hub from Firefox. Open Firefox then click on the OutWit Button in the toolbar.

If the icon is not visible go to the menu bar and select Tools -> OutWit -> OutWit Hub

OutWit Hub will open displaying the Web page currently loaded on Firefox.


2. Go to the Desired Web Page

In the address bar, type the URL of the Website.

Go to the Page view where you can see the Web page as it would appear in a traditional browser.

Now, select “Links” from the view list.

In the “Links” widget, OutWit Hub displays all the links from the current page.

If you want to export results to Excel, just select all links using ctrl/cmd + A, then copy using ctrl/cmd + C and paste it in Excel (ctrl/cmd + V).

Source: http://www.garethjames.net/a-guide-to-web-scrapping-tools/