Thursday, September 17, 2009

Scraping in Python

Posted by Danny Tarlow
Not every website out there has their data available via a nice API. Now, I wish it weren't the case, but I can think of several good reasons for a company or organization not to release their data:
  1. The data is valuable and provides a competitive advantage to the company.
  2. It would require hardware upgrades to support lots of downloads or additional requests.
  3. There aren't enough engineering resources at the company to make it worthwhile to devote an engineer to building out an API.
At the same time, though, most websites show you a good bit of data in some format or another every time you visit their site: Google gives you a small data set of web pages relevant to a given keyword every time you enter a query; Digg gives you a data set of recent, popular links; Twitter gives you a data set of recent events from your sphere; you get the point.

Given that all of the sites I listed above have APIs, I have to think that (1) isn't actually as big of a concern as it might seem, at least as long as there are reasonable limits that keep somebody from flat-out duplicating the site. Anyhow, whatever the reason, sometimes the best way to get your hands on some data is to crawl the website and scrape the data from the raw HTML.

Now, there is some touchy legal ground involved with scraping. For example, the law firm WilmherHale (I don't know anything about the firm) has a 2003 article about the legality of web scraping:
Based on these cases, it would appear that anyone who, without authorization, uses a web "scraper" or similar computer program to access and download data from a third party website risks potential and perhaps serious legal claims from the website operator. However, the cases suggest that, for website operators that wish to protect the data available on their website, the failure to observe some basic precautions may compromise or even preclude such claims. Specifically: * website operators should ensure that their website terms and conditions specifically prohibit unauthorized access or downloading of data using any computer program; and * website operators should either clearly identify the terms and conditions of use on each webpage containing valuable data or provide an obvious link to a webpage with those conditions.
I do not in any way advocate using this code to scrape data from a website that disallows it. I also recommend that you contact the website administrators to get permission before scraping any data from any site.

With that out of the way, there may be some cases where you have permission to scrape data. This is the case I'm going to consider from here on out.

Now this should go without saying, but you definitely want to go easy on the site's servers. The last thing you want to do is fire off an accidental denial of service attack with thousands of requests a minute. Hopefully the server has some automated systems in place to deal with such a basic attack (E.g., by blocking you), but there's no reason to test it, especially when the site owner has so graciously allowed you to crawl their data. I've found that an average of a request per minute is reasonable for small crawl jobs, but some people advocate backing off even more if it's a big job. Yes, it might take weeks, but sometimes that's the price you have to pay.

There are sometimes other cases where you have permission to crawl the site, but the website has built in mechanisms to block requests that appear to be crawlers. If it's not already in place, it can be a bit of a pain to set up a whitelist based on IP address or some other tag, especially for sites where the engineering resources are tight as is. In this case, it may be helpful to make your requests look as much like typical web traffic as possible.

The two most useful things I've found to do in this case are:
  • Add some randomness to the visit frequency (beyond just waiting M + N * rand() seconds between requests).
  • Send realistic looking headers along with the request.
I've implemented a basic crawler that uses all of these strategies. You start by telling it the first page that you want to visit, along with how many subsequent pages you want. You need to define a pattern that pulls a link to the next page from the most recently downloaded page. It will then download a page, find the "next" link, download the next page, ... up to the number of pages you request. You still have to parse all of the html, but all of the pages will be there waiting for you, sitting in your output/ directory.

Remember, though, only use this in cases where you have permission from the website owner.
import sys
import time
import re
from subprocess import call
from datetime import datetime
import numpy as np


HEADERS = {
    "Host" : "www.SOME-SITE-HERE.com",
    "User-Agent" : "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.10) Gecko/2009042315 Firefox/3.0.10",
    "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language" : "en-us,en;q=0.5",
    "Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
    "Keep-Alive" : "300",
    "Connection" : "keep-alive"
}


class GenericCrawler():

    base_url = 'http://www.SOME-SITE-HERE.com/'
    
    next_page_pattern = r'<a href="/([^"]*?)">Next Page</a>'


    def __init__(self, starting_page):
        self.starting_page = starting_page
        self.next_page_re = re.compile(self.next_page_pattern)


    def Get(self, num_pages):
        next_page_url = self.starting_page

        for i in range(num_pages):
            print "Page %s" % i
            self.FetchPage(i, next_page_url)
            next_page_url = self.NextLinkInLastResult()


    def FetchPage(self, page_num, relative_url):
        
        request_url = '%s%s' % (self.base_url, relative_url)

        current_time = datetime.now()
        output_file = "output/%s_%s_%s_%s_%s_%s__p%s.html" % (self.starting_page,
                                                              current_time.year, current_time.month,
                                                              current_time.day, current_time.hour,
                                                              current_time.minute, page_num)
        self.last_result = output_file

        # Grab the file and put it in the 
        print request_url

        curl_args = ["curl", "-o", output_file]
        for h in HEADERS:
            curl_args.append('-H')
            curl_args.append("%s: %s" % (h, HEADERS[h]))
        curl_args.append(request_url)
            
        call(curl_args) 

        # Don't overload the server or trip the spider detector
        time.sleep(30 + 30 * np.random.random())

        if np.random.random() < .02:
            print "Taking a long break"
            time.sleep(300 + 300 * np.random.random())


    def NextLinkInLastResult(self):
        f = open(self.last_result, 'r')

        for line in f:
            m = self.next_page_re.findall(line);
            if len(m) > 0:
                print "Next page relative URL: ", m[0]
                return m[0]

        return None


if __name__ == "__main__":

    import sys

    starting_page = sys.argv[1]
    num_pages = int(sys.argv[2])

    h = GenericCrawler(starting_page)
    h.Get(num_pages)

4 comments:

lorg said...

Several issues:
1. You might want to do that work with a proper html parser, such as BeautifulSoup, or lxml, or (shamelessly publicizing :) my own wrapper for lxml: Finding the most influential artists

2. In your code, you'll be better served by using a queue to allow a breadth first walk, and instead of specifying num_pages, specifying depth.

3. If you're already using Python, use urllib instead of an outside process (such as curl).

By the way, I started reading up your blog a short while ago, and I wanted to say, keep up the good work! I especially like your mathematical posts.

Danny Tarlow said...

Thanks lorg. I can always use some tips on the "right" way to do things like this.

And I have some new more mathy posts in the works. Stay tuned.

dwf said...

The tricky part is that the 'urlopen' method that people usually use from urllib2 is probably not going to suffice here, since you want to spoof real browser headers.

It looks like Request Objects will do the job, though.

dwf said...

Hmm, strange that it decided to use just my 'nickname' with OpenID. Anyhow, welcome back to Toronto. :)