Saturday, September 26, 2009

Analysis of Pollster Fraud and Oklahoma Students

Posted by Danny Tarlow
I've been following with interest the recent series of posts at fivethirtyeight.com regarding the polling firm Strategic Vision, LLC. It's all very interesting -- the short of it is that Nate Silver is presenting statistical evidence that there are anomalies in Strategic Vision LLC's polling numbers, which are very unlikely to be the result of random statistical fluctuation. Nate is clear that there could be other explanations for the results, but one possibility is that the firm is just making up the numbers. From what I read, the firm has not released any alternative explanations for the anomalies. You can read the full details directly from the source:
http://www.fivethirtyeight.com/2009/09/strategic-vision-polls-exhibit-unusual.html
http://www.fivethirtyeight.com/2009/09/comparison-study-unusual-patterns-in.html
http://www.fivethirtyeight.com/2009/09/are-oklahoma-students-really-this-dumb.html

In the latest post about one poll of Oklahoma students, the first part of the argument takes this form:
1A. Assume a wrong model of the world.
1B. Notice that in the limit, this wrong model of the world matches Strategic Vision, LLC's reported numbers.

2A. Assume an arguably less wrong model of the world.
2B. Notice that in the limit, this less wrong model of the world doesn't really match Strategic Vision, LLC's reported numbers.

I say in the limit, because the blue curves in Nate's graphs are generated from 50,000 simulated students, while the red curves are generated from the reported results for 1000 real (debatably) students. This feels a little bit weird to me, because it is not saying anything about how likely it is for a sample of 1000 points to deviate from the expectation. Beyond that, I understand that it makes the graph easier to read if you can just plot one distribution against the other, but I think what we're really interested in is something slightly different.

I want to tackle a related question: Suppose that we accept Nate's (admittedly overly simplistic) model of the world, where there are three equally sized classes of student: low, medium, and high achieving and that we trust the survey's report of per-questions correct percentages for each question.

Now, let's generalize the model a bit -- let there be a heterogeneity parameter, h, that indicates how different the low, medium, and high achieving groups are. We would say that the low group gets questions right with probability (1 - h) * base_rate, (where base_rate is the reported per-question correct percentages), medium gets questions right at the base rate, and the high group gets questions right with probability (1 + h) * the base rate. Setting h to 0 gives Nate's first homogeneous model (1A above), and setting h to .5 gives his second, more realistic model (2A above).

So now we could ask what are the likely settings of h, given the data that Strategic Vision LLC has released. By Bayes rule, P(h | counts_data, base_rate_data) \propto P(counts_data | h, base_rate_data) P(h, base_rate_data). In order to make progress now, we need to come up with a measure for P(counts_data | h, base_rate_data). In other words, if we assume that h and the per-question percentages are fixed, how likely are we to generate a specific set of counts data. Nate presents the graphs in his post and asks us to visually inspect how different the curves are, so I use this as the basis for the likelihood function:
l = the sum of absolute value of differences between each bin of observed versus expected # correct questions

So if the expected counts_data result is [x0, x1, x2], then we're assuming that the likelihood of generating [y0, y1, y2] is proportional to -[abs(x0 - y0) + abs(x1 - y1) + abs(x2 - y2)]. In other words, we are (very roughly) linearly more incredulous as the area between Nate's two curves gets bigger.

What I want to do now is estimate what Strategic Vision LLC's data is saying about h -- that is, P(h | counts_data, base_rate_data). We could just use a large sample of say 50,000 students to get expected counts, but I want to also capture some of the uncertainty in the counts data due to only pulling 1000 students.

I do this by replacing
l = sum(abs(expected_counts - counts_data))

with
l = (1/N) sum_i [sum(abs(sampled_count_i - counts_data))]

where I get sampled_count_i by dividing the students into three groups, flipping coins independently for each student, and counting a question as correct with proper probability for the (group, question) combination. I repeat this with N=2000 for several values of h, between 0 and .6. I posted the code I used for this below, so you can see everything down to the last detail. The following graph shows -l, which I call Curve Mismatch, versus h: I also show error bars of the standard deviation of individual sample likelihoods -- roughly, how much variation is there resulting from sampling 1000 students and using a loop of 2000 samples to estimate l.

So this is showing the estimated (negative) likelihood of different values of h when given Strategic Vision LLC's data and my assumptions detailed above. What this shows is fairly interesting: values of h from .2 to .3 look to possibly even better explain the data than a value of 0, although there is quite a bit of uncertainty in the range of 0 to .3.

How do we interpret this? If Nate had chosen to give his low knowledge group a little more knowledge and his high knowledge group a little less (.2 worth, to be precise), then we'd expect the mismatch in his curves to be less than we see in the .5 case, and maybe even less than we see in the h=0 case. Actually, I'll just make the curve for h=.3 now: Compared to the h=0 one from Nate: Pretty similar, huh?

So Nate's conclusion is that Strategic Vision LLC's data implies the absurd model of h = 0, so we should distrust their data -- sort of like a proof by contradiction, but not a proof. I argue instead that Strategic Vision LLC's data implies that h could very easily be in the range of [0, .3], but it's pretty unlikely that h is .5.

And of course, if you want to be Bayesian, you can also start to think about what your prior belief over h is. Most people probably don't think all students achieve equally, so it seems reasonable to think h is greater than zero, but I'm not entirely convinced that it's as high as .5. If I had to try to put numbers to it, my completely uninformed, personal prior would probably be increasing on the range [0, .2], flat from [.2, .6], then decreasing beyond .6.

So my personal, Bayesian conclusion from all of this: if we're going to assume this three-class-of-achiever model, then Strategic Data LLC's data implies that h is probably in the range of [.2, .3]. Is that absurd? I really don't know. Is Strategic Vision LLC a big fraud? I have no idea. I'd never even heard of them before Nate's first post.

For those that want to check my work and follow along at home, here's the code I used to generate the graphs. The numbers hard coded in there are from the Strategic Vision LLC data that's up at Nate's post. The first file is for the first graph, the second file is for the second graph.
import numpy as np
from pylab import *

num_runs = 2000  # number of simulated data sets to make per model                                                                                                                          
num_students = 1000
fraction_right = np.array([.28, .26, .27, .1, .14, .61, .43, .11, .23, .29])
num_questions = len(fraction_right)

sv_correct_counts = np.array([46, 158, 246, 265, 177, 80, 22, 6, 0, 0, 0]) / 1000.0
heterogeneities = [.6, .5, .4, .3, .2, .1, 0]

result_means = []
result_stds = []

for heterogeneity in heterogeneities:
    count_diff = np.zeros(num_runs)

    for run in range(num_runs):
        sim_correct_counts = np.zeros(num_questions + 1)


        for i in range(num_students):
            answers = np.random.rand(num_questions)

            if i < num_students / 3:
                num_right = sum(answers < (1 - heterogeneity) * fraction_right)
            elif i < 2 * num_students / 3:
                num_right = sum(answers < fraction_right)
            else:
                num_right = sum(answers < (1 + heterogeneity) * fraction_right)

            sim_correct_counts[num_right] += 1

        sim_correct_counts /= num_students

        count_diff[run] = 100 * np.sum(np.abs(sim_correct_counts - sv_correct_counts))

    print heterogeneity, np.mean(count_diff), np.std(count_diff)

    result_means.append(np.mean(count_diff))
    result_stds.append(np.std(count_diff))


errorbar(heterogeneities, result_means, yerr=result_stds)
xlim([-.1, .7])
title("Strategic Vision, LLC vs. Simulated Test Score Counts")
xlabel("Model Heterogeneity")
ylabel("Curve Mismatch")
show()
Second graph:
import numpy as np
from pylab import *

num_students = 50000
fraction_right = np.array([.28, .26, .27, .1, .14, .61, .43, .11, .23, .29])
num_questions = len(fraction_right)

sv_correct_counts = np.array([46, 158, 246, 265, 177, 80, 22, 6, 0, 0, 0]) / 1000.0
heterogeneity = .3

sim_correct_counts = np.zeros(num_questions + 1)

for i in range(num_students):
    answers = np.random.rand(num_questions)

    if i < num_students / 3:
        num_right = sum(answers < (1 - heterogeneity) * fraction_right)
    elif i < 2 * num_students / 3:
        num_right = sum(answers < fraction_right)
    else:
        num_right = sum(answers < (1 + heterogeneity) * fraction_right)

    sim_correct_counts[num_right] += 1


sim_correct_counts /= num_students
plot(range(num_questions+1), sim_correct_counts, 'b-o', lw=2)
plot(range(num_questions+1), sv_correct_counts, 'r-x', lw=2)
legend(['Simulated', 'Actual'])
show()

Monday, September 21, 2009

NIPS 2009 Accepted Papers

Posted by Danny Tarlow
The list of accepted papers for NIPS 2009 has been released:
http://nips.cc/Conferences/2009/Program/accepted-papers.php

As usual, it looks to be an interesting conference. I'm particularly interested in most of the ones that have "MAP" in the title.

Netflix Prize Awarded and New Contest Announced

Posted by Danny Tarlow
BellKor pulls out the last minute win:
http://bits.blogs.nytimes.com/2009/09/21/netflix-awards-1-million-prize-and-starts-a-new-contest/

There is also a new contest in the works:
The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals’ “taste profiles,” the company said. The data set of more than 100 million entries will include information about renters’ ages, gender, ZIP codes, genre ratings and previously chosen movies. Unlike the first challenge, the contest will have no specific accuracy target. Instead, $500,000 will be awarded to the team in the lead after six months, and $500,000 to the leader after 18 months.
This definitely sounds like a more interesting data set. It will be particularly interesting to see if this additional information improves the performance of the systems that did well on the original contest.

Related Posts:

Thursday, September 17, 2009

Research Group Blogs

Posted by Danny Tarlow
Daniel Tunkelang has a post about, among other things, CSAIL at MIT's Haystack blog:
I wish that more universities and departments would encourage their faculty and students to blog. As Daniel Lemire has pointed out, it’s a great way for academic researchers to get their ideas out and build up their reputations and networks. He should know–he leads by example. Likewise, Haystack is setting a great example for university blogs, and is a credit to MIT and CSAIL.
I definitely agree. Anybody from Toronto want to join in on the fun?

Scraping in Python

Posted by Danny Tarlow
Not every website out there has their data available via a nice API. Now, I wish it weren't the case, but I can think of several good reasons for a company or organization not to release their data:
  1. The data is valuable and provides a competitive advantage to the company.
  2. It would require hardware upgrades to support lots of downloads or additional requests.
  3. There aren't enough engineering resources at the company to make it worthwhile to devote an engineer to building out an API.
At the same time, though, most websites show you a good bit of data in some format or another every time you visit their site: Google gives you a small data set of web pages relevant to a given keyword every time you enter a query; Digg gives you a data set of recent, popular links; Twitter gives you a data set of recent events from your sphere; you get the point.

Given that all of the sites I listed above have APIs, I have to think that (1) isn't actually as big of a concern as it might seem, at least as long as there are reasonable limits that keep somebody from flat-out duplicating the site. Anyhow, whatever the reason, sometimes the best way to get your hands on some data is to crawl the website and scrape the data from the raw HTML.

Now, there is some touchy legal ground involved with scraping. For example, the law firm WilmherHale (I don't know anything about the firm) has a 2003 article about the legality of web scraping:
Based on these cases, it would appear that anyone who, without authorization, uses a web "scraper" or similar computer program to access and download data from a third party website risks potential and perhaps serious legal claims from the website operator. However, the cases suggest that, for website operators that wish to protect the data available on their website, the failure to observe some basic precautions may compromise or even preclude such claims. Specifically: * website operators should ensure that their website terms and conditions specifically prohibit unauthorized access or downloading of data using any computer program; and * website operators should either clearly identify the terms and conditions of use on each webpage containing valuable data or provide an obvious link to a webpage with those conditions.
I do not in any way advocate using this code to scrape data from a website that disallows it. I also recommend that you contact the website administrators to get permission before scraping any data from any site.

With that out of the way, there may be some cases where you have permission to scrape data. This is the case I'm going to consider from here on out.

Now this should go without saying, but you definitely want to go easy on the site's servers. The last thing you want to do is fire off an accidental denial of service attack with thousands of requests a minute. Hopefully the server has some automated systems in place to deal with such a basic attack (E.g., by blocking you), but there's no reason to test it, especially when the site owner has so graciously allowed you to crawl their data. I've found that an average of a request per minute is reasonable for small crawl jobs, but some people advocate backing off even more if it's a big job. Yes, it might take weeks, but sometimes that's the price you have to pay.

There are sometimes other cases where you have permission to crawl the site, but the website has built in mechanisms to block requests that appear to be crawlers. If it's not already in place, it can be a bit of a pain to set up a whitelist based on IP address or some other tag, especially for sites where the engineering resources are tight as is. In this case, it may be helpful to make your requests look as much like typical web traffic as possible.

The two most useful things I've found to do in this case are:
  • Add some randomness to the visit frequency (beyond just waiting M + N * rand() seconds between requests).
  • Send realistic looking headers along with the request.
I've implemented a basic crawler that uses all of these strategies. You start by telling it the first page that you want to visit, along with how many subsequent pages you want. You need to define a pattern that pulls a link to the next page from the most recently downloaded page. It will then download a page, find the "next" link, download the next page, ... up to the number of pages you request. You still have to parse all of the html, but all of the pages will be there waiting for you, sitting in your output/ directory.

Remember, though, only use this in cases where you have permission from the website owner.
import sys
import time
import re
from subprocess import call
from datetime import datetime
import numpy as np


HEADERS = {
    "Host" : "www.SOME-SITE-HERE.com",
    "User-Agent" : "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.10) Gecko/2009042315 Firefox/3.0.10",
    "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language" : "en-us,en;q=0.5",
    "Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
    "Keep-Alive" : "300",
    "Connection" : "keep-alive"
}


class GenericCrawler():

    base_url = 'http://www.SOME-SITE-HERE.com/'
    
    next_page_pattern = r'<a href="/([^"]*?)">Next Page</a>'


    def __init__(self, starting_page):
        self.starting_page = starting_page
        self.next_page_re = re.compile(self.next_page_pattern)


    def Get(self, num_pages):
        next_page_url = self.starting_page

        for i in range(num_pages):
            print "Page %s" % i
            self.FetchPage(i, next_page_url)
            next_page_url = self.NextLinkInLastResult()


    def FetchPage(self, page_num, relative_url):
        
        request_url = '%s%s' % (self.base_url, relative_url)

        current_time = datetime.now()
        output_file = "output/%s_%s_%s_%s_%s_%s__p%s.html" % (self.starting_page,
                                                              current_time.year, current_time.month,
                                                              current_time.day, current_time.hour,
                                                              current_time.minute, page_num)
        self.last_result = output_file

        # Grab the file and put it in the 
        print request_url

        curl_args = ["curl", "-o", output_file]
        for h in HEADERS:
            curl_args.append('-H')
            curl_args.append("%s: %s" % (h, HEADERS[h]))
        curl_args.append(request_url)
            
        call(curl_args) 

        # Don't overload the server or trip the spider detector
        time.sleep(30 + 30 * np.random.random())

        if np.random.random() < .02:
            print "Taking a long break"
            time.sleep(300 + 300 * np.random.random())


    def NextLinkInLastResult(self):
        f = open(self.last_result, 'r')

        for line in f:
            m = self.next_page_re.findall(line);
            if len(m) > 0:
                print "Next page relative URL: ", m[0]
                return m[0]

        return None


if __name__ == "__main__":

    import sys

    starting_page = sys.argv[1]
    num_pages = int(sys.argv[2])

    h = GenericCrawler(starting_page)
    h.Get(num_pages)

Wednesday, September 9, 2009

Advanced NFL Stats

Posted by Danny Tarlow
Every so often you come across somebody who is just doing a really great job and deserves a special mention. Brian at Advanced NFL Stats is a great example of doing thorough and insightful analysis of NFL football data:
http://www.advancednflstats.com/