Saturday, September 26, 2009

Analysis of Pollster Fraud and Oklahoma Students

Posted by Danny Tarlow
I've been following with interest the recent series of posts at fivethirtyeight.com regarding the polling firm Strategic Vision, LLC. It's all very interesting -- the short of it is that Nate Silver is presenting statistical evidence that there are anomalies in Strategic Vision LLC's polling numbers, which are very unlikely to be the result of random statistical fluctuation. Nate is clear that there could be other explanations for the results, but one possibility is that the firm is just making up the numbers. From what I read, the firm has not released any alternative explanations for the anomalies. You can read the full details directly from the source:
http://www.fivethirtyeight.com/2009/09/strategic-vision-polls-exhibit-unusual.html
http://www.fivethirtyeight.com/2009/09/comparison-study-unusual-patterns-in.html
http://www.fivethirtyeight.com/2009/09/are-oklahoma-students-really-this-dumb.html

In the latest post about one poll of Oklahoma students, the first part of the argument takes this form:
1A. Assume a wrong model of the world.
1B. Notice that in the limit, this wrong model of the world matches Strategic Vision, LLC's reported numbers.

2A. Assume an arguably less wrong model of the world.
2B. Notice that in the limit, this less wrong model of the world doesn't really match Strategic Vision, LLC's reported numbers.

I say in the limit, because the blue curves in Nate's graphs are generated from 50,000 simulated students, while the red curves are generated from the reported results for 1000 real (debatably) students. This feels a little bit weird to me, because it is not saying anything about how likely it is for a sample of 1000 points to deviate from the expectation. Beyond that, I understand that it makes the graph easier to read if you can just plot one distribution against the other, but I think what we're really interested in is something slightly different.

I want to tackle a related question: Suppose that we accept Nate's (admittedly overly simplistic) model of the world, where there are three equally sized classes of student: low, medium, and high achieving and that we trust the survey's report of per-questions correct percentages for each question.

Now, let's generalize the model a bit -- let there be a heterogeneity parameter, h, that indicates how different the low, medium, and high achieving groups are. We would say that the low group gets questions right with probability (1 - h) * base_rate, (where base_rate is the reported per-question correct percentages), medium gets questions right at the base rate, and the high group gets questions right with probability (1 + h) * the base rate. Setting h to 0 gives Nate's first homogeneous model (1A above), and setting h to .5 gives his second, more realistic model (2A above).

So now we could ask what are the likely settings of h, given the data that Strategic Vision LLC has released. By Bayes rule, P(h | counts_data, base_rate_data) \propto P(counts_data | h, base_rate_data) P(h, base_rate_data). In order to make progress now, we need to come up with a measure for P(counts_data | h, base_rate_data). In other words, if we assume that h and the per-question percentages are fixed, how likely are we to generate a specific set of counts data. Nate presents the graphs in his post and asks us to visually inspect how different the curves are, so I use this as the basis for the likelihood function:
l = the sum of absolute value of differences between each bin of observed versus expected # correct questions

So if the expected counts_data result is [x0, x1, x2], then we're assuming that the likelihood of generating [y0, y1, y2] is proportional to -[abs(x0 - y0) + abs(x1 - y1) + abs(x2 - y2)]. In other words, we are (very roughly) linearly more incredulous as the area between Nate's two curves gets bigger.

What I want to do now is estimate what Strategic Vision LLC's data is saying about h -- that is, P(h | counts_data, base_rate_data). We could just use a large sample of say 50,000 students to get expected counts, but I want to also capture some of the uncertainty in the counts data due to only pulling 1000 students.

I do this by replacing
l = sum(abs(expected_counts - counts_data))

with
l = (1/N) sum_i [sum(abs(sampled_count_i - counts_data))]

where I get sampled_count_i by dividing the students into three groups, flipping coins independently for each student, and counting a question as correct with proper probability for the (group, question) combination. I repeat this with N=2000 for several values of h, between 0 and .6. I posted the code I used for this below, so you can see everything down to the last detail. The following graph shows -l, which I call Curve Mismatch, versus h: I also show error bars of the standard deviation of individual sample likelihoods -- roughly, how much variation is there resulting from sampling 1000 students and using a loop of 2000 samples to estimate l.

So this is showing the estimated (negative) likelihood of different values of h when given Strategic Vision LLC's data and my assumptions detailed above. What this shows is fairly interesting: values of h from .2 to .3 look to possibly even better explain the data than a value of 0, although there is quite a bit of uncertainty in the range of 0 to .3.

How do we interpret this? If Nate had chosen to give his low knowledge group a little more knowledge and his high knowledge group a little less (.2 worth, to be precise), then we'd expect the mismatch in his curves to be less than we see in the .5 case, and maybe even less than we see in the h=0 case. Actually, I'll just make the curve for h=.3 now: Compared to the h=0 one from Nate: Pretty similar, huh?

So Nate's conclusion is that Strategic Vision LLC's data implies the absurd model of h = 0, so we should distrust their data -- sort of like a proof by contradiction, but not a proof. I argue instead that Strategic Vision LLC's data implies that h could very easily be in the range of [0, .3], but it's pretty unlikely that h is .5.

And of course, if you want to be Bayesian, you can also start to think about what your prior belief over h is. Most people probably don't think all students achieve equally, so it seems reasonable to think h is greater than zero, but I'm not entirely convinced that it's as high as .5. If I had to try to put numbers to it, my completely uninformed, personal prior would probably be increasing on the range [0, .2], flat from [.2, .6], then decreasing beyond .6.

So my personal, Bayesian conclusion from all of this: if we're going to assume this three-class-of-achiever model, then Strategic Data LLC's data implies that h is probably in the range of [.2, .3]. Is that absurd? I really don't know. Is Strategic Vision LLC a big fraud? I have no idea. I'd never even heard of them before Nate's first post.

For those that want to check my work and follow along at home, here's the code I used to generate the graphs. The numbers hard coded in there are from the Strategic Vision LLC data that's up at Nate's post. The first file is for the first graph, the second file is for the second graph.
import numpy as np
from pylab import *

num_runs = 2000  # number of simulated data sets to make per model                                                                                                                          
num_students = 1000
fraction_right = np.array([.28, .26, .27, .1, .14, .61, .43, .11, .23, .29])
num_questions = len(fraction_right)

sv_correct_counts = np.array([46, 158, 246, 265, 177, 80, 22, 6, 0, 0, 0]) / 1000.0
heterogeneities = [.6, .5, .4, .3, .2, .1, 0]

result_means = []
result_stds = []

for heterogeneity in heterogeneities:
    count_diff = np.zeros(num_runs)

    for run in range(num_runs):
        sim_correct_counts = np.zeros(num_questions + 1)


        for i in range(num_students):
            answers = np.random.rand(num_questions)

            if i < num_students / 3:
                num_right = sum(answers < (1 - heterogeneity) * fraction_right)
            elif i < 2 * num_students / 3:
                num_right = sum(answers < fraction_right)
            else:
                num_right = sum(answers < (1 + heterogeneity) * fraction_right)

            sim_correct_counts[num_right] += 1

        sim_correct_counts /= num_students

        count_diff[run] = 100 * np.sum(np.abs(sim_correct_counts - sv_correct_counts))

    print heterogeneity, np.mean(count_diff), np.std(count_diff)

    result_means.append(np.mean(count_diff))
    result_stds.append(np.std(count_diff))


errorbar(heterogeneities, result_means, yerr=result_stds)
xlim([-.1, .7])
title("Strategic Vision, LLC vs. Simulated Test Score Counts")
xlabel("Model Heterogeneity")
ylabel("Curve Mismatch")
show()
Second graph:
import numpy as np
from pylab import *

num_students = 50000
fraction_right = np.array([.28, .26, .27, .1, .14, .61, .43, .11, .23, .29])
num_questions = len(fraction_right)

sv_correct_counts = np.array([46, 158, 246, 265, 177, 80, 22, 6, 0, 0, 0]) / 1000.0
heterogeneity = .3

sim_correct_counts = np.zeros(num_questions + 1)

for i in range(num_students):
    answers = np.random.rand(num_questions)

    if i < num_students / 3:
        num_right = sum(answers < (1 - heterogeneity) * fraction_right)
    elif i < 2 * num_students / 3:
        num_right = sum(answers < fraction_right)
    else:
        num_right = sum(answers < (1 + heterogeneity) * fraction_right)

    sim_correct_counts[num_right] += 1


sim_correct_counts /= num_students
plot(range(num_questions+1), sim_correct_counts, 'b-o', lw=2)
plot(range(num_questions+1), sv_correct_counts, 'r-x', lw=2)
legend(['Simulated', 'Actual'])
show()

2 comments:

Steve said...

Hi Danny,

If anyone doesn't believe your evidence that the Oklahoma "study" is very probably garbage, check out the supposed results of the same survey conducted in Arizona. You'll find the link (and links to your and Nate's blog also) here: http://stevekass.com/2009/11/05/frightening-but-not-for-the-obvious-reason.

Vile stuff.

Danny Tarlow said...

Hi Steve,

Thanks for the interest, but I think you might be overstating my conclusions. All I ended up saying was this:
So my personal, Bayesian conclusion from all of this: if we're going to assume this three-class-of-achiever model, then Strategic Data LLC's data implies that h is probably in the range of [.2, .3]. Is that absurd? I really don't know. Is Strategic Vision LLC a big fraud? I have no idea. I'd never even heard of them before Nate's first post.