Sunday, December 27, 2009

Scientific Controversies and Hot-button Issues

Posted by Danny Tarlow
I haven't done much work over the holiday, but I've had a chance to do quite a bit of reading. The book next to my bed is Shrijver's three volume series on Combinatorial Optimization: Polyhedra and Efficiency (Algorithms and Combinatorics), which lives up to the hype.

On the (arguably) lighter side, I've discovered the not-so-secret underworld of what I'll call "math 2.0" websites and blogs. Maybe I'm just out of the loop, but from my perspective, it seems that math and theoretical computer science have a more active internet community than the more applied practitioners like us in machine learning, applied statistics, and algorithms (there are of course many notable exceptions).

Anyhow, what I really wanted to write about comes from reading the blogs. The observation is how difficult it is for computer scientists and mathematicians to get their points across to the general public. Two examples particularly illustrate my point: In both cases, the overarching story is one of a mathematician against the press. As I understand it, in the first case, it's an information theorist arguing against an intelligent design argument based on information; in the second, it's a computational complexity theorist trying to dispel some of the possibly exaggerated claims made about a company's supposed quantum computer. The comment threads under both posts are quite interesting (and sometimes sadly comical) reads.

It's not just the stereotypical story of a mathematician being unable to communicate with normal people. Instead, the common theme in both cases is that the mathematician has spent a lot of time carefully putting down their position in a public and/or peer-reviewed form. The problem is that their opinion--though they claim it is still valid--is several years old. In the interim, their adversaries have come up with more to say and claim to have refuted the criticisms in more recent citations. The mathematicians disagree, saying that their old arguments still hold, citing the original papers, saying they can't be bothered to repeat their arguments fresh every time a new, unrelated argument that doesn't address their original concern gets made.

Now this all may sound reasonably well and good if we were in an academic setting: get a few well-respected members of the field and ask their opinion. In academia, we have two advantages, though:
  • We are used to trusting the opinions of our peers expressed via blind peer review schemes, and
  • It is difficult enough to write a reasonably good conference paper that we aren't completely (some may disagree) overwhelmed with crackpot ideas to evaluate.
Unfortunately, in blogs and the popular press, neither holds. Everybody assumes everybody else has a hidden agenda, and there is absolutely no way to get a qualified person to review every possibly crazy idea out there, much less fight a prolonged battle over it. As we often see in politics, the side with the most money, cleverest spin, and loudest voice tends to be heard clearest.

So then the question becomes where to go from here. The mathematician making the original argument can't be expected to spend time fighting every little battle, but very few people have the time, inclination, and ability to credibly pick up the battle in their stead (in the two example cases, the issue is over different definitions of information, and subtleties about how "quantum" a quantum computer is. See here and comments 23, 24, and 26 here for examples). Even in ridiculous cases where an argument really has no merit, there is enough jargon and a long enough comment thread to make a casual observer think that the issue is "complicated," and therefore either side could be right with equal plausibility.

Yet, for lack of a better idea, we as academics by and large still use the same strategy for making an argument: write the paper, then move on. If we need to make a point, refer the reader to the paper. By putting the ideas in the record and possibly presenting the ideas to our peers, we've done our part.

I don't think this is good enough, but there's not an easy answer. Going back to the politics setting, there are difficult questions about when to put up a fight and when not to even dignify some crazy assertion with a response. Minor wording issues can be blown out of proportion. It is often considered harmful in a political debate to give a long answer. And yet to fight these battles on the public stage, we as academics are woefully inept, which is no surprise, since public relations is a tricky game and most of us have zero training.

So here's one idea: academic conference and journal bodies are quite good at deciding whether an idea deserves their stamp of approval. Peer review isn't perfect, but it's pretty good. Unfortunately, once a paper is accepted or rejected, the responsibility of the reviewing body ends. They will passively make the material accessible (either for free or behind a pay wall) and not give it further thought.

I'd like to see these bodies take a more active position in the public eye. The content published in the conference or journal becomes the agenda of the reviewing body. If somebody puts out an idea in the public eye that contradicts something published in the proceedings, the review body's public relations (PR) wing decides how to address it. In some cases, it may be proper to ignore it. In others, a real, prolonged fight may be needed.

I don't know all the exact details of how it would work, but there are several benefits to having a conference-specific PR body fight the battle:
  • It's harder to undermine the motivations of an entire organization than an individual scientist. It's also harder for one scientist to co-opt the opinion of the full organization.
  • By having the same organization's name come up repeatedly, it will begin to build a reputation for the body in the public eye.
  • Professional public relations people would be in the loop to help scientists make their point effectively.
There are other tangential benefits, increasing the exposure of the field to the public and making more clear the practical implications of the research being done.

The downside is obviously that it would cost money and require additional organization to maintain a permanent PR body for each major conference. I can't help but think that a concerted PR effort would be a good investment for many of these hard-to-understand fields, though.

Edit: I didn't find a place to put it, but this story came up in one of the comment threads.

Edit 2: Another related example. This time an academic against possibly faulty sex ratio statistics (how often parents have boys vs girls) that got picked up by the popular press. The academic was then "refuted" by Wikipedia.

[Aside] I'd love it if machine learning people started using Math Overflow (MO). It seems like a nice way to get a bit of crossover between the fields, which in many cases aren't as different as they appear on the surface. Here are some example posts that have a fair number of upvotes (i.e., they are good questions according to the MO community) and that I find interesting as a machine learning researcher: There are plenty more examples in topics like combinatorics, statistics, probability, graph theory, and optimization. [/Aside]

Thursday, November 26, 2009

Data analysis in a presidential campaign

Posted by Danny Tarlow
Here's a good video by Dan Siroker about how the Obama campaign used data analysis and web analytics in the 2008 campaign. It might be a stretch to say that data won the campaign, but it sure didn't hurt to be doing lots of interesting stuff with it.
http://websiteoptimizer.blogspot.com/2009/11/new-video-how-we-used-data-to-win.html

He does a good job driving home how valuable it is to always be defining and measuring relevant performance indicators and to do it automatically and quantitatively. It's easy to say you're going to measure and keep on top of the indicators, but it's much harder to pull it off in a clean, focused way, especially as you get more and more data from more and more sources. The campaign did a great job of it, though, and in my (not unique) opinon, this gave a huge leg up. His points also apply to his startup, CarrotSticks.com (which you should go try!).

My favorite part of the video? 43:30, because I think that almost counts as a "shout out" (I spent a little time working with the analytics team and fall somewhere between "friends from college" and "PhDs in computer science that happened to walk through the door").

Friday, October 30, 2009

New York Times data

Posted by Danny Tarlow
I haven't checked it out yet, but this looks pretty interesting:
For more than 150 years, The New York Times has meticulously indexed its archives. Through this process, we have developed an enormous collection of subject headings, ranging from “Absinthe”[1] to “Zoos”[2]. Unfortunately, our list of subject headings is an island. For example, even though we can show you every article written about “Colbert, Stephen [3],” our databases can’t tell you that he was born on May 13, 1964, or that he lost the 2008 Grammy for best spoken word album to Al Gore. To do this we would need to map our subject headings onto other Web databases such as Freebase and DBPedia. So that’s exactly what we did.
http://open.blogs.nytimes.com/2009/10/29/first-5000-tags-released-to-the-linked-data-cloud/

Monday, October 19, 2009

Emacs for C++

Posted by Danny Tarlow
I really need to try this: http://nflath.com/2009/10/c-customizations-for-emacs/

Tuesday, October 13, 2009

NIPS 2009 Workshops

Posted by Danny Tarlow
The list of NIPS 2009 workshops is also up (and has been for a while, I think):
http://nips.cc/Conferences/2009/Program/schedule.php?Session=Workshops

As usual, there look to be a number of interesting sessions. The ones that look most interesting to me are the following:

Twitter and Toronto

Posted by Danny Tarlow
In case the data collection falls through for the other alternative, my "Government 2.0" group is also planning out a Toronto-based app that uses Twitter data as the primary data source. The ground rules are pretty loose: come up with an application that uses data about the city of Toronto and serves some subset of people in Toronto.

Being a group of 4 AI/Machine Learning-ish PhD students, we think it would be fun to try to take a stab at some fairly substantive machine learning problems -- and there seems to be no shortage of need for this on Twitter. Of course, the first step towards an interesting machine learning application is an interesting way of looking at some interesting data. More to come...

Smart meters

Posted by Danny Tarlow
I had a meeting today with some U of T colleagues and a couple people at Toronto Hydro for the "Government 2.0" course I'm currently enrolled in. Like most people probably have, I had heard the buzz about "The Smart Grid" and "Smart Meters," but up until a short time ago, I didn't realize that the smart meter project is not only real, but up and running, collecting data in hundreds of thousands of homes across the Toronto area. From the Toronto Hydro site:
As of August 31, 2009, we had 602,418 smart meters installed across our city and we continue with our installation schedule.
We went in today and got more details about the data, and a quick demo about what the data looks like: for each house that has a smart meter running, they get hourly electricity consumption readings 24 hours a day. This is pretty cool -- it's to the point where you can see a spike in electricity use at 10pm and start speculating about whether it was caused by somebody turning on the dryer. Even just looking at the raw readings, you can start to get a bit of an idea about what the causes of electricity consumption might be.

So the question now is to formulate a project that could make good use of this data. There are some simple ground rules: no privacy violations, and nothing too controversial (E.g., no law enforcement applications). Other than that, it's pretty wide open. What would you do with this data?

Saturday, September 26, 2009

Analysis of Pollster Fraud and Oklahoma Students

Posted by Danny Tarlow
I've been following with interest the recent series of posts at fivethirtyeight.com regarding the polling firm Strategic Vision, LLC. It's all very interesting -- the short of it is that Nate Silver is presenting statistical evidence that there are anomalies in Strategic Vision LLC's polling numbers, which are very unlikely to be the result of random statistical fluctuation. Nate is clear that there could be other explanations for the results, but one possibility is that the firm is just making up the numbers. From what I read, the firm has not released any alternative explanations for the anomalies. You can read the full details directly from the source:
http://www.fivethirtyeight.com/2009/09/strategic-vision-polls-exhibit-unusual.html
http://www.fivethirtyeight.com/2009/09/comparison-study-unusual-patterns-in.html
http://www.fivethirtyeight.com/2009/09/are-oklahoma-students-really-this-dumb.html

In the latest post about one poll of Oklahoma students, the first part of the argument takes this form:
1A. Assume a wrong model of the world.
1B. Notice that in the limit, this wrong model of the world matches Strategic Vision, LLC's reported numbers.

2A. Assume an arguably less wrong model of the world.
2B. Notice that in the limit, this less wrong model of the world doesn't really match Strategic Vision, LLC's reported numbers.

I say in the limit, because the blue curves in Nate's graphs are generated from 50,000 simulated students, while the red curves are generated from the reported results for 1000 real (debatably) students. This feels a little bit weird to me, because it is not saying anything about how likely it is for a sample of 1000 points to deviate from the expectation. Beyond that, I understand that it makes the graph easier to read if you can just plot one distribution against the other, but I think what we're really interested in is something slightly different.

I want to tackle a related question: Suppose that we accept Nate's (admittedly overly simplistic) model of the world, where there are three equally sized classes of student: low, medium, and high achieving and that we trust the survey's report of per-questions correct percentages for each question.

Now, let's generalize the model a bit -- let there be a heterogeneity parameter, h, that indicates how different the low, medium, and high achieving groups are. We would say that the low group gets questions right with probability (1 - h) * base_rate, (where base_rate is the reported per-question correct percentages), medium gets questions right at the base rate, and the high group gets questions right with probability (1 + h) * the base rate. Setting h to 0 gives Nate's first homogeneous model (1A above), and setting h to .5 gives his second, more realistic model (2A above).

So now we could ask what are the likely settings of h, given the data that Strategic Vision LLC has released. By Bayes rule, P(h | counts_data, base_rate_data) \propto P(counts_data | h, base_rate_data) P(h, base_rate_data). In order to make progress now, we need to come up with a measure for P(counts_data | h, base_rate_data). In other words, if we assume that h and the per-question percentages are fixed, how likely are we to generate a specific set of counts data. Nate presents the graphs in his post and asks us to visually inspect how different the curves are, so I use this as the basis for the likelihood function:
l = the sum of absolute value of differences between each bin of observed versus expected # correct questions

So if the expected counts_data result is [x0, x1, x2], then we're assuming that the likelihood of generating [y0, y1, y2] is proportional to -[abs(x0 - y0) + abs(x1 - y1) + abs(x2 - y2)]. In other words, we are (very roughly) linearly more incredulous as the area between Nate's two curves gets bigger.

What I want to do now is estimate what Strategic Vision LLC's data is saying about h -- that is, P(h | counts_data, base_rate_data). We could just use a large sample of say 50,000 students to get expected counts, but I want to also capture some of the uncertainty in the counts data due to only pulling 1000 students.

I do this by replacing
l = sum(abs(expected_counts - counts_data))

with
l = (1/N) sum_i [sum(abs(sampled_count_i - counts_data))]

where I get sampled_count_i by dividing the students into three groups, flipping coins independently for each student, and counting a question as correct with proper probability for the (group, question) combination. I repeat this with N=2000 for several values of h, between 0 and .6. I posted the code I used for this below, so you can see everything down to the last detail. The following graph shows -l, which I call Curve Mismatch, versus h: I also show error bars of the standard deviation of individual sample likelihoods -- roughly, how much variation is there resulting from sampling 1000 students and using a loop of 2000 samples to estimate l.

So this is showing the estimated (negative) likelihood of different values of h when given Strategic Vision LLC's data and my assumptions detailed above. What this shows is fairly interesting: values of h from .2 to .3 look to possibly even better explain the data than a value of 0, although there is quite a bit of uncertainty in the range of 0 to .3.

How do we interpret this? If Nate had chosen to give his low knowledge group a little more knowledge and his high knowledge group a little less (.2 worth, to be precise), then we'd expect the mismatch in his curves to be less than we see in the .5 case, and maybe even less than we see in the h=0 case. Actually, I'll just make the curve for h=.3 now: Compared to the h=0 one from Nate: Pretty similar, huh?

So Nate's conclusion is that Strategic Vision LLC's data implies the absurd model of h = 0, so we should distrust their data -- sort of like a proof by contradiction, but not a proof. I argue instead that Strategic Vision LLC's data implies that h could very easily be in the range of [0, .3], but it's pretty unlikely that h is .5.

And of course, if you want to be Bayesian, you can also start to think about what your prior belief over h is. Most people probably don't think all students achieve equally, so it seems reasonable to think h is greater than zero, but I'm not entirely convinced that it's as high as .5. If I had to try to put numbers to it, my completely uninformed, personal prior would probably be increasing on the range [0, .2], flat from [.2, .6], then decreasing beyond .6.

So my personal, Bayesian conclusion from all of this: if we're going to assume this three-class-of-achiever model, then Strategic Data LLC's data implies that h is probably in the range of [.2, .3]. Is that absurd? I really don't know. Is Strategic Vision LLC a big fraud? I have no idea. I'd never even heard of them before Nate's first post.

For those that want to check my work and follow along at home, here's the code I used to generate the graphs. The numbers hard coded in there are from the Strategic Vision LLC data that's up at Nate's post. The first file is for the first graph, the second file is for the second graph.
import numpy as np
from pylab import *

num_runs = 2000  # number of simulated data sets to make per model                                                                                                                          
num_students = 1000
fraction_right = np.array([.28, .26, .27, .1, .14, .61, .43, .11, .23, .29])
num_questions = len(fraction_right)

sv_correct_counts = np.array([46, 158, 246, 265, 177, 80, 22, 6, 0, 0, 0]) / 1000.0
heterogeneities = [.6, .5, .4, .3, .2, .1, 0]

result_means = []
result_stds = []

for heterogeneity in heterogeneities:
    count_diff = np.zeros(num_runs)

    for run in range(num_runs):
        sim_correct_counts = np.zeros(num_questions + 1)


        for i in range(num_students):
            answers = np.random.rand(num_questions)

            if i < num_students / 3:
                num_right = sum(answers < (1 - heterogeneity) * fraction_right)
            elif i < 2 * num_students / 3:
                num_right = sum(answers < fraction_right)
            else:
                num_right = sum(answers < (1 + heterogeneity) * fraction_right)

            sim_correct_counts[num_right] += 1

        sim_correct_counts /= num_students

        count_diff[run] = 100 * np.sum(np.abs(sim_correct_counts - sv_correct_counts))

    print heterogeneity, np.mean(count_diff), np.std(count_diff)

    result_means.append(np.mean(count_diff))
    result_stds.append(np.std(count_diff))


errorbar(heterogeneities, result_means, yerr=result_stds)
xlim([-.1, .7])
title("Strategic Vision, LLC vs. Simulated Test Score Counts")
xlabel("Model Heterogeneity")
ylabel("Curve Mismatch")
show()
Second graph:
import numpy as np
from pylab import *

num_students = 50000
fraction_right = np.array([.28, .26, .27, .1, .14, .61, .43, .11, .23, .29])
num_questions = len(fraction_right)

sv_correct_counts = np.array([46, 158, 246, 265, 177, 80, 22, 6, 0, 0, 0]) / 1000.0
heterogeneity = .3

sim_correct_counts = np.zeros(num_questions + 1)

for i in range(num_students):
    answers = np.random.rand(num_questions)

    if i < num_students / 3:
        num_right = sum(answers < (1 - heterogeneity) * fraction_right)
    elif i < 2 * num_students / 3:
        num_right = sum(answers < fraction_right)
    else:
        num_right = sum(answers < (1 + heterogeneity) * fraction_right)

    sim_correct_counts[num_right] += 1


sim_correct_counts /= num_students
plot(range(num_questions+1), sim_correct_counts, 'b-o', lw=2)
plot(range(num_questions+1), sv_correct_counts, 'r-x', lw=2)
legend(['Simulated', 'Actual'])
show()

Monday, September 21, 2009

NIPS 2009 Accepted Papers

Posted by Danny Tarlow
The list of accepted papers for NIPS 2009 has been released:
http://nips.cc/Conferences/2009/Program/accepted-papers.php

As usual, it looks to be an interesting conference. I'm particularly interested in most of the ones that have "MAP" in the title.

Netflix Prize Awarded and New Contest Announced

Posted by Danny Tarlow
BellKor pulls out the last minute win:
http://bits.blogs.nytimes.com/2009/09/21/netflix-awards-1-million-prize-and-starts-a-new-contest/

There is also a new contest in the works:
The new contest is going to present the contestants with demographic and behavioral data, and they will be asked to model individuals’ “taste profiles,” the company said. The data set of more than 100 million entries will include information about renters’ ages, gender, ZIP codes, genre ratings and previously chosen movies. Unlike the first challenge, the contest will have no specific accuracy target. Instead, $500,000 will be awarded to the team in the lead after six months, and $500,000 to the leader after 18 months.
This definitely sounds like a more interesting data set. It will be particularly interesting to see if this additional information improves the performance of the systems that did well on the original contest.

Related Posts:

Thursday, September 17, 2009

Research Group Blogs

Posted by Danny Tarlow
Daniel Tunkelang has a post about, among other things, CSAIL at MIT's Haystack blog:
I wish that more universities and departments would encourage their faculty and students to blog. As Daniel Lemire has pointed out, it’s a great way for academic researchers to get their ideas out and build up their reputations and networks. He should know–he leads by example. Likewise, Haystack is setting a great example for university blogs, and is a credit to MIT and CSAIL.
I definitely agree. Anybody from Toronto want to join in on the fun?

Scraping in Python

Posted by Danny Tarlow
Not every website out there has their data available via a nice API. Now, I wish it weren't the case, but I can think of several good reasons for a company or organization not to release their data:
  1. The data is valuable and provides a competitive advantage to the company.
  2. It would require hardware upgrades to support lots of downloads or additional requests.
  3. There aren't enough engineering resources at the company to make it worthwhile to devote an engineer to building out an API.
At the same time, though, most websites show you a good bit of data in some format or another every time you visit their site: Google gives you a small data set of web pages relevant to a given keyword every time you enter a query; Digg gives you a data set of recent, popular links; Twitter gives you a data set of recent events from your sphere; you get the point.

Given that all of the sites I listed above have APIs, I have to think that (1) isn't actually as big of a concern as it might seem, at least as long as there are reasonable limits that keep somebody from flat-out duplicating the site. Anyhow, whatever the reason, sometimes the best way to get your hands on some data is to crawl the website and scrape the data from the raw HTML.

Now, there is some touchy legal ground involved with scraping. For example, the law firm WilmherHale (I don't know anything about the firm) has a 2003 article about the legality of web scraping:
Based on these cases, it would appear that anyone who, without authorization, uses a web "scraper" or similar computer program to access and download data from a third party website risks potential and perhaps serious legal claims from the website operator. However, the cases suggest that, for website operators that wish to protect the data available on their website, the failure to observe some basic precautions may compromise or even preclude such claims. Specifically: * website operators should ensure that their website terms and conditions specifically prohibit unauthorized access or downloading of data using any computer program; and * website operators should either clearly identify the terms and conditions of use on each webpage containing valuable data or provide an obvious link to a webpage with those conditions.
I do not in any way advocate using this code to scrape data from a website that disallows it. I also recommend that you contact the website administrators to get permission before scraping any data from any site.

With that out of the way, there may be some cases where you have permission to scrape data. This is the case I'm going to consider from here on out.

Now this should go without saying, but you definitely want to go easy on the site's servers. The last thing you want to do is fire off an accidental denial of service attack with thousands of requests a minute. Hopefully the server has some automated systems in place to deal with such a basic attack (E.g., by blocking you), but there's no reason to test it, especially when the site owner has so graciously allowed you to crawl their data. I've found that an average of a request per minute is reasonable for small crawl jobs, but some people advocate backing off even more if it's a big job. Yes, it might take weeks, but sometimes that's the price you have to pay.

There are sometimes other cases where you have permission to crawl the site, but the website has built in mechanisms to block requests that appear to be crawlers. If it's not already in place, it can be a bit of a pain to set up a whitelist based on IP address or some other tag, especially for sites where the engineering resources are tight as is. In this case, it may be helpful to make your requests look as much like typical web traffic as possible.

The two most useful things I've found to do in this case are:
  • Add some randomness to the visit frequency (beyond just waiting M + N * rand() seconds between requests).
  • Send realistic looking headers along with the request.
I've implemented a basic crawler that uses all of these strategies. You start by telling it the first page that you want to visit, along with how many subsequent pages you want. You need to define a pattern that pulls a link to the next page from the most recently downloaded page. It will then download a page, find the "next" link, download the next page, ... up to the number of pages you request. You still have to parse all of the html, but all of the pages will be there waiting for you, sitting in your output/ directory.

Remember, though, only use this in cases where you have permission from the website owner.
import sys
import time
import re
from subprocess import call
from datetime import datetime
import numpy as np


HEADERS = {
    "Host" : "www.SOME-SITE-HERE.com",
    "User-Agent" : "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.10) Gecko/2009042315 Firefox/3.0.10",
    "Accept" : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language" : "en-us,en;q=0.5",
    "Accept-Charset" : "ISO-8859-1,utf-8;q=0.7,*;q=0.7",
    "Keep-Alive" : "300",
    "Connection" : "keep-alive"
}


class GenericCrawler():

    base_url = 'http://www.SOME-SITE-HERE.com/'
    
    next_page_pattern = r'<a href="/([^"]*?)">Next Page</a>'


    def __init__(self, starting_page):
        self.starting_page = starting_page
        self.next_page_re = re.compile(self.next_page_pattern)


    def Get(self, num_pages):
        next_page_url = self.starting_page

        for i in range(num_pages):
            print "Page %s" % i
            self.FetchPage(i, next_page_url)
            next_page_url = self.NextLinkInLastResult()


    def FetchPage(self, page_num, relative_url):
        
        request_url = '%s%s' % (self.base_url, relative_url)

        current_time = datetime.now()
        output_file = "output/%s_%s_%s_%s_%s_%s__p%s.html" % (self.starting_page,
                                                              current_time.year, current_time.month,
                                                              current_time.day, current_time.hour,
                                                              current_time.minute, page_num)
        self.last_result = output_file

        # Grab the file and put it in the 
        print request_url

        curl_args = ["curl", "-o", output_file]
        for h in HEADERS:
            curl_args.append('-H')
            curl_args.append("%s: %s" % (h, HEADERS[h]))
        curl_args.append(request_url)
            
        call(curl_args) 

        # Don't overload the server or trip the spider detector
        time.sleep(30 + 30 * np.random.random())

        if np.random.random() < .02:
            print "Taking a long break"
            time.sleep(300 + 300 * np.random.random())


    def NextLinkInLastResult(self):
        f = open(self.last_result, 'r')

        for line in f:
            m = self.next_page_re.findall(line);
            if len(m) > 0:
                print "Next page relative URL: ", m[0]
                return m[0]

        return None


if __name__ == "__main__":

    import sys

    starting_page = sys.argv[1]
    num_pages = int(sys.argv[2])

    h = GenericCrawler(starting_page)
    h.Get(num_pages)

Wednesday, September 9, 2009

Advanced NFL Stats

Posted by Danny Tarlow
Every so often you come across somebody who is just doing a really great job and deserves a special mention. Brian at Advanced NFL Stats is a great example of doing thorough and insightful analysis of NFL football data:
http://www.advancednflstats.com/

Monday, August 31, 2009

Why is it so hard to reduce energy usage in buildings?

Posted by Danny Tarlow
Whether you believe it or not, you have probably heard in some way or another that our current habits of energy use are not sustainable. That is, if we remove fossil fuels from the picture, we are a long way from being able to provide the energy that we as a society use (and take for granted) on a daily basis.

For those of us interested in the numbers, a great starting point is David MacKay's book (free online), Sustainable Energy - Without the Hot Air. He goes to great length to present an unbiased, rigorously derived analysis of the energy used, available, and potentially available for the UK in the future under different scenarios. Everything is presented in common units, so you start to get an intuition about, say, the relative energy savings of unplugging your phone charger when you're not using it versus carpooling to work.

In one of his thousand word summaries, he lays out the problem plainly:
Even if we imagine strong efficiency measures and smart technology-switches that halved our energy consumption [from 125 kWh per day per person to 60 kWh per day] (which would be lower than the per-capita consumption of any developed country today), we should not kid ourselves about the challenge of supplying 60 kWh per day without fossil fuels. Among the low-carbon energy supply options, the three with the biggest potential are wind power, nuclear power, and concentrating solar power in other peoples' deserts. And here is the scale that is required if (for simplicity) we wanted to get one third from each of these sources: we would have to build wind farms with an area equal to the area of Wales; we would have to build 50 Sizewells of nuclear power; and we would need solar power stations in deserts covering an area twice the size of Greater London.
Finding new sources of energy is half of the picture, and MacKay does a good job covering it. The other half of the picture comes from finding ways to use less energy. One of the most important places to look here is in energy consumed by buildings, which -- by most measures I've heard -- account for around 40% of all energy consumption in the developed world.

There are other good reasons for making buildings more energy efficient; namely, if you operate a number of large buildings (think Walmart, McDonalds, Disney), it can save you quite a bit of money, especially if energy costs rise. It is also good for a company's public image.

As with the energy generation side, there are plenty of suggestions for how to reduce energy use in buildings, from adding glazing or shading devices to windows, increasing insulation in the walls, using more efficient light bulbs, using natural ventilation for cooling, reusing heat produced from other sources to heat the building, storing energy in ice, water, or other mediums, etc.

Unfortunately, every building is unique due to its design, location, surrounding climate, construction imperfections, and operational patterns, so there is no one-size-fits-all solution that gives the most energy savings bang for the buck. For example, adding extra insulation to a building that has its door or windows left open all day (say, a store that wants to attract customers inside), is going to have very little effect.

In light of this, how do we decide what energy-saving measures to implement?

One approach that has been gaining popularity in recent years is called LEED, which stands for Leadership in Energy and Environmental Design.

LEED is a checklist system; implementing a specific measure gives you a set number of points. A building designer can pick and choose any subset of measures, and if the sum of points is above a certain level, the building is called "LEED Certified":
In LEED 2009 there are 100 possible base points plus an additional 6 points for Innovation in Design and 4 points for Regional Priority.
  • Sustainable Sites (26 Possible Points)
  • Water Efficiency (10 Possible Points)
  • Energy and Atmosphere (35 Possible Points)
  • Materials and Resources (14 Possible Points)
  • Indoor Environmental Quality (15 Possible Points)
  • Innovation in Design (6 Possible Points)
  • Regional Priority (4 Possible Points)
Buildings can qualify for four levels of certification:
  • Certified - 40-49 points
  • Silver - 50-59 points
  • Gold - 60-79 points
  • Platinum - 80 points and above
(from Wikipedia)

The New York Times recently wrote an article on a study highlighting the shortcomings of LEED:
But in its own study last year of 121 new buildings certified through 2006, the Green Building Council found that more than half — 53 percent — did not qualify for the Energy Star label and 15 percent scored below 30 in that program, meaning they used more energy per square foot than at least 70 percent of comparable buildings in the existing national stock.
We begin to see the problem. Building designers like LEED because it is simple: check off the items you want, and you get the certification. Unfortunately, LEED doesn't reliably predict the actual energy savings that you will see in a building.

A second approach, which is actually used as a subroutine in some of the LEED checkboxes, is to use building simulation tools to predict how much energy a building will use given its design, location, and purpose. The leading community that develops building simulation tools is the International Building Performance Simulation Association (IBPSA, pronounced "ih-bip-suh"), which organizes several conferences and a journal, and its members are active in developing most of the popular building simulation tools, such as EnergyPlus and ESP-r.

The tools operate at an impressive level of detail and simulate a great deal of physics, but it is still surprisingly difficult to predict how changing a building will affect how much energy it consumes. The greatest difficulty is in modeling human occupancy and operational patterns, but there are also significant difficulties dealing with weather, construction imperfections, and model simplifications that are made to speed up the computations. Further and perhaps most insidiously, the person doing the energy modeling is often unsure about the inputs to give a tool, since it can be a serious challenge to track down exact material and equipment properties. Worse, these tools do not really deal with uncertainty, so the output is an estimate of energy usage, but there is no attempt to quantify how certain the model is, and errors of 50-100% are not that uncommon.

Now before I go too far criticizing these building simulation tools, I think it's important to repeat that making good predictions for any building we can imagine is a very hard problem. Each building really is a little world in itself, and no two are the same. The number of options for materials, equipment, and configurations is huge, and peculiarities of any piece of the system can have a large impact on the outcome. The people who work on these tools are aware of their shortcomings, and many very smart people are working on ways to improve them.

I was recently at a panel discussion on Validation at Building Simulation 2009, though, and I think I fundamentally disagree with most of what I heard there. The big question at play was how to go forward validating building simulation tools. Current practice relies heavily on validating tools against themselves, arguing that the first step in checking a tool is to see that it gets the physics correct and doesn't diverge from the existing tools. There is some validation on real buildings, but it is quite limited.

At the beginning of the discussion, I was encouraged. Each of the panelists was very critical of current validation methods, and they acknowledged that the errors made in predictions are one of the major reasons these tools haven't seen more widespread adoption outside the community. The way forward as they seemed to see it, though, was to focus more on the physics and thermodynamics. They argued that the best thing to do would be to build a synthetic test building, outfitted with a huge array of sensors that could measure every internal component in a model, then simulation tools could be validated based on whether their internal state matched the measured state for the test building.

Though this would undoubtedly produce interesting data, it would be expensive, and I don't think it's the most important type of data to collect. If it were possible to know all of the inputs and occupancy patterns of a building, the tools would already do a pretty good job.

Instead, I think there needs to be more of a focus on the generative model -- what the characteristics of the building stock are, and how the characteristics are related. For example, a tool should know that if a building is grocery store in Arizona, it's probably running the air conditioner extremely high in the summer, or that if a building was built in the 1950's in Chicago, it probably uses material X and is more likely than average to have leaky window installations. Maybe people in New York are more likely to leave their heaters on over night even when buildings aren't occupied.

In order to learn these types of things, we need a different type of data. Rather than complete, extremely detailed data about one controlled building, we need a huge amount of possibly incomplete data about real buildings from all over the world with all sorts of designs, constructions, and operational patterns. The problem is not that we don't understand the core physics. The problem is that we don't know how to take data that we've collected and analysis that we've done on one building and apply it to other buildings.

I raised my hand and asked during the panel session if anybody had considered creating a central respository to hold data from all of the case studies that were presented at these conferences, so that we could evaluate simulation tools as they are actually used. My argument was (a) if the tools came up with good estimates of energy use across tens of thousands of real buildings, then validation would be pretty much done, and we could be fairly confident that they would come up with good estimates of energy use for the buildings that we're interested in modeling; and (b) even if they estimates are way off, at least the tool would be able to report that it has no confidence in its answer -- it could know that it has large errors on certain types of buildings in certain locations (e.g., using cross validation).

I was surprised to be met with quite a bit of resistance. The one liner that I got in return was, "maybe it would be right, but it would probably be right for all the wrong reasons." So my question is this: would you rather be right for all the wrong reasons or wrong for all the right reasons?

To be continued...

Thursday, August 20, 2009

Python for Scientific Computing Conference

Posted by Danny Tarlow
The SciPy 2009 conference is happening right now in Pasadena, California. I am unfortunately in the wrong region of California to be attending, but the contents of the conference -- and particularly the tutorials -- look great.

The ones I would be most excited to attend would be the sympy and pyCUDA tutorials.

I've long been a big fan of sympy for hairy symbolic math computations. There have been some frustrations using it, but it is growing to be a quite usable library, and I've found the support via the mailing list to be top notch.

I'd never heard of pyCUDA before today, but it promises to be a clean, friendly, Python interface to GPU programming. It is based on the Nvidia CUDA API, which looks to have been used to significantly speed up some cool and relevant applications, like sliding window-based object detection.

The best part? Twitter is buzzing that the tutorial videos are avaiable online:
http://www.archive.org/search.php?query=scipy

City data: Toronto and San Francisco

Posted by Danny Tarlow
My year of "sabbatical" is quickly coming to an end, so I'm packing up my San Francisco apartment and getting ready to head back to Toronto.

The upsides of moving back to Toronto are
(a) I'll have more time to focus on research, and
(b) I need to take a few more classes.

I'm obviously excited about (a).

I'm looking forward to (b), mostly because the (only) class I've decided on at this point is Greg Wilson's software consulting project course. It is supposed to be a cool course every year, but this year it sounds particularly interesting: each project will somehow make use of data that the city of Toronto is getting ready to release. I'm teaming up with another one of my AI group buddies, so together we make the graduate student contingent.

We're now brainstorming interesting projects -- hopefully somehow incorporating cool machine learning. Of course, the difficulty so far is that we don't yet know exactly what data is going to be available to us. Interestingly, though, my other city -- the city of San Francisco -- just released its own variant of publicly available city data:
http://www.datasf.org/

Lately I've found myself arguing quite a bit that we should be collecting and releasing more data. Now that these cities seem to be doing it, hopefully we can turn around and find something interesting to do with it. If you have great ideas, I'd love to hear them.

Thursday, August 13, 2009

Mutable heaps in C++ (for updates to priority_queue priorities)

Posted by Danny Tarlow
I'm usually very happy with the STL C++ data structures and libraries, so I was a bit surprised to find that the built-in heap doesn't support updating a value. This comes up, for example, in iterative algorithms where in each iteration we make a change to a small number of variables at the front of a queue. The changes may affect some other non-front variable values as well, but we want to reuse most of the sorting work that we've done in previous iterations. In this case, we keep the same heap from iteration to iteration and just call update() on the changed values when necessary at a cost of O(log N) for binary heaps.

After not finding what I needed in the core STL, I turned to the Boost C++ libraries and found the mutable_queue. The class lies way far off in the generic programming way of doing things, but I won't comment on that.

Anyway, to define a mutable_queue, there are four template arguments that need to be filled in:
  template <class IndexedType, 
            class RandomAccessContainer = std::vector<IndexedType>, 
            class Comp = std::less<typename RandomAccessContainer::value_type>,
            class ID = identity_property_map >
  class mutable_queue { ...
Conveniently, though, when we are working with basic types, we can just give it one template argument, and the defaults will work for the other template parts. This worked great for me when I tried it out with a simple mutable_queue<double>, and it was pretty easy to get it running and working correctly.

What I need, however, is a mutable_queue of Node * objects. So it seemed fairly straightforward to fill in the template arguments:
typedef Node * entry_t;
typedef vector<entry_t> storage_t;
typedef CompareNodeStar comp_t;
typedef identity_property_map prop_map_t;
typedef mutable_exposed_queue<entry_t, storage_t,
    comp_t, prop_map_t> queue_t;

queue_t *q = new queue_t(N, comp_t(), prop_map_t());
Which says give me a queue of Node *'s, use a vector of Node *'s as the underlying storage, compare elements using the CompareNodeStar comparison function that I defined, and use the default property_map*.

Unfortunately, this gives me errors, as do several other attempts at filling in property_map types. *By the way, here's what a property map is:
The property map interface consists of a set of concepts (see definition of "concept" in [1] and [2]) that define a syntax for mapping key objects to corresponding value objects. Since the property map operations are global functions (actually they don't have to be global, but they are always called unqualified and may be found via argument dependent lookup), it is possible to overload the map functions such that nearly arbitrary property map types and key types can be used. The interface for property maps consists of three functions: get(), put(), and operator[].
At this point, I'd already sunk a couple hours into the boost library, and I was getting a bit frustrated -- why should I have to learn about the internal details of property_maps when all I want to do is apply the data structure in a fairly natural way? Yes, maybe I'm being lazy, and granted I don't use C++ as much as I used to, but it seems to me that this should be easier than it is. (Maybe somebody can come along and point out the correct way of doing this -- I put in a reasonable effort to figure it out, but by no means am I confident that I'm even on the right track).

The other source of my frustration is probably that heaps are not that complicated. To update a heap, all you really have to do is figure out whether the element is ok where it is, needs to be pushed up, or needs to be pushed down. So I changed tactics and found a less generic (but still pretty generic) heap update implementation, called HEAPPlus:
http://www.cse.ust.hk/~lihw/code/heapplus/

For my purposes, I like this implementation quite a bit better. It uses the same style of interface as STL heaps, where you have a vector as the core storage structure, then heaps are just a set of algorithms that maintain a proper ordering of elements in the vector. All that heapplus.h does is define update_heap(start_iterator, end_iterator, edit_position_iterator, new_value).

This worked (almost) on the first try for me for doubles, Nodes, and Node *'s, but it's still not exactly what I want. I want to be able to change the value of an element in the heap separately from the call to updating the heap. This comes up in the iterative algorithm example, where the iterative algorithm maintains pointers to the Node elements, so it can change their priority directly. Granted, you could have the algorithm change the priority, then pass a pointer to the changed node as the new_value, but that seems redundant to me. It's more natural for me to think of leaving the last argument off: update_heap_pos(start_iterator, end_iterator, edit_position_iterator).

So that's what I did. I took the HEAPPlus implementation and defined a couple new functions to support update_heap_pos, then I wrote some simple examples to show the usage. Here's the test file, test.cpp:
// test.cpp
// Used to test heapplus.h
// lihw
// 2006.12.18
// ChangeLog:
//  2009.08.10  - Added tests for Node and Node * heaps, and added Comparator
//                argument to print_heap, so that pop_heap works as expected.
//                (Danny Tarlow: dtarlow@cs.toronto.edu)
//  2006.12.18  - Initially created (lihw)

#include <iostream>
#include <ostream>
#include <functional>
#include <algorithm>
#include <vector>
#include "heapplus.h"

#include <ctime>
#include <cstdlib>

using namespace std;


template<typename Container>
inline
void PRINT_ELEMENTS(const Container& container, const char* title= 0) 
{
    if (title)
        std::cout << title << std::endl;

    typename Container::const_iterator iter, beg = container.begin(),
             end = container.end();
    for (iter = beg; iter != end; ++iter) 
        std::cout << *iter << " ";
    std::cout << std::endl;
}

template<typename Container, typename Compare>
void print_heap(const Container& container, Compare comp, const char* title = 0)
{
    if (title)
        std::cout << title << std::endl;

    std::vector<typename Container::value_type> temp(container.begin(),
            container.end());
    while (temp.size()) {
        std::cout << temp[0] << " ";
        std::pop_heap(temp.begin(), temp.end(), comp);
        temp.pop_back();
    }
    std::cout << std::endl;
}


/**
 *  A very simple node class just to test the update_heap_pos functions
 * and to show an example usage.  Each node lies on an x-y grid, and it
 * has a priority associated with it, which we assume will be computed and
 * and modified by some procedure outside the scope of these functions.
 *
 * This example shows how to maintain a queue of grid points, ordered by
 * priority, where updates take at worst O(lg N) time.
 **/

class Node {

  friend ostream &operator<<(ostream &os, Node &n);
  friend bool operator<(Node a, Node b);
  friend bool operator==(Node &a, Node &b);
  friend ostream &operator<<(ostream &os, Node *n);

public:

  Node() : x_(0), y_(0), priority_(0.0) { }

  Node(int x, int y) : x_(x), y_(y), priority_(0.0) { }

  double Priority() { return priority_; }

  void SetPriority(double priority) {
    priority_ = priority;
  }

protected:
  double priority_;
  int x_;
  int y_;
};

ostream &operator<<(ostream &os, Node &n) {
  os << "(" << n.x_ << "," << n.y_ << "):" << n.Priority();
  return os;
}
bool operator<(Node a, Node b) {
  return (a.Priority() < b.Priority());
}
bool operator==(Node &a, Node &b) {
  return (a.x_ == b.x_ && a.y_ == b.y_);
}
ostream &operator<<(ostream &os, Node *n) {
  os << "(" << n->x_ << "," << n->y_ << "):" << n->Priority();
  return os;
}
struct CompareNodeStar {
  bool operator()(Node *a, Node *b) {
    return (a->Priority() < b->Priority());
  }
};


int main(void) {

    int side = 4;
    int N = side * side;

    vector<double> doubleheap(0);
    vector<Node> nodeheap(0);
    vector<Node *> nodestarheap(0);

    for (int x = 0; x < side; x++) {
      for (int y = 0; y < side; y++) {
 Node node = Node(x, y);
 Node *nodestar = new Node(x, y);
 double priority = (x - side / 2.0) * (y - side / 4.0); 

 node.SetPriority(priority);
 nodestar->SetPriority(priority);

 doubleheap.push_back(priority);
 nodeheap.push_back(node);
 nodestarheap.push_back(nodestar);
      }
    }

    cout << "heap<double>" << endl;
    //PRINT_ELEMENTS(doubleheap, "Initial heap");
    print_heap(doubleheap, less<double>());


    make_heap(doubleheap.begin(), doubleheap.end());
    //PRINT_ELEMENTS(doubleheap, "Made heap");
    print_heap(doubleheap, less<double>());

    doubleheap[4] = 100;
    //PRINT_ELEMENTS(doubleheap, "edited");
    print_heap(doubleheap, less<double>());

    vector<double>::iterator itdouble = doubleheap.begin() + 4;
    update_heap_pos(doubleheap.begin(), doubleheap.end(), itdouble);
    //PRINT_ELEMENTS(doubleheap, "Updated and sorted");
    print_heap(doubleheap, less<double>());



    cout << endl;
    cout << "heap<Node>" << endl;
    print_heap(nodeheap, less<Node>());

    make_heap(nodeheap.begin(), nodeheap.end());
    print_heap(nodeheap, less<Node>());

    nodeheap[4].SetPriority(100);
    print_heap(nodeheap, less<Node>());

    vector<Node>::iterator it = nodeheap.begin() + 4;
    update_heap_pos(nodeheap.begin(), nodeheap.end(), it);
    print_heap(nodeheap, less<Node>());



    cout << endl;
    cout << "heap<Node *>" << endl;
    //PRINT_ELEMENTS(nodestarheap, "Initial heap");
    print_heap(nodestarheap, CompareNodeStar());

    make_heap(nodestarheap.begin(), nodestarheap.end(), CompareNodeStar());
    //PRINT_ELEMENTS(nodestarheap, "Made heap");
    print_heap(nodestarheap, CompareNodeStar());

    nodestarheap[4]->SetPriority(100);
    //PRINT_ELEMENTS(nodestarheap, "edited");
    print_heap(nodestarheap, CompareNodeStar());

    vector<Node *>::iterator itstar = nodestarheap.begin() + 4;
    update_heap_pos(nodestarheap.begin(), nodestarheap.end(), itstar, CompareNodeStar());
    //PRINT_ELEMENTS(nodestarheap, "Updated and sorted");
    print_heap(nodestarheap, CompareNodeStar());


    return 0;
}

And here is the modified heapplus.h file:
// Heap Plus implementation 1.01alpha
// lihw 
// 2006.12.18
// ChangeLog:
//  2009.08.10  - Added update_heap_pos functions so that we can adjust
//                heap values outside of update_heap functions -- e.g., if
//                we have external pointers into the heap entries -- then
//                call update_heap on the position only, regardless of the value.
//                (Danny Tarlow: dtarlow@cs.toronto.edu)
//  2006.12.18  - Initially created (lihw)

#ifndef HEAPPLUS_H_
#define HEAPPLUS_H_

#include <iterator>

namespace std {

template<typename _RandomAccessIterator, typename _Distance, typename _Tp, typename _Compare>
inline void
__up_heap(_RandomAccessIterator __first, _RandomAccessIterator __end, _RandomAccessIterator __pos,
        _Distance, _Tp __value,  _Compare __comp)
{
    _Distance __parent = (__pos - __first - 1) /  2;
    _Distance __index = __pos - __first;

    while (__index > 0 && __comp(*(__first + __parent), __value)) {
        *(__first + __index) = *(__first + __parent);
        __index = __parent;
        __parent = (__parent - 1) / 2;
    }
       
    if (__pos != (__first + __index))   
        *(__first + __index) = __value;
}

template<typename _RandomAccessIterator, typename _Distance, typename _Tp, typename _Compare>
inline void
__down_heap(_RandomAccessIterator __first, _RandomAccessIterator __last, _RandomAccessIterator __pos,
        _Distance, _Tp __value, _Compare __comp)
{
    _Distance __len = __last - __first;
    _Distance __index = __pos - __first;
    _Distance __left = __index * 2 + 1;
    _Distance __right = __index * 2 + 2;
    _Distance __largest;

    while (__index < __len) {
        if (__left < __len && __comp(*(__first + __right), *(__first + __left))) {
            __largest = __left;
        } else {
            __largest = __right;
        }
        if (__largest < __len && __comp(__value, *(__first + __largest))) {
            *(__first + __index) = *(__first + __largest);
            __index = __largest;
            __left = __index * 2 + 1;
            __right = __index * 2 + 2;
        } else
            break;
    }

    if (__pos != (__first + __index))   
        *(__first + __index) = __value;
}

template<typename _RandomAccessIterator, typename _Distance, typename _Tp, 
    typename _Compare>
inline void
__update_heap(_RandomAccessIterator __first, _RandomAccessIterator __last,
        _RandomAccessIterator __pos, _Distance, _Tp __value, _Compare __comp) 
{
    *(__pos) = __value;
    
    _Distance __index = (__pos - __first);
    _Distance __parent = (__index - 1) / 2;

    if (__index > 0 && __comp(*(__first + __parent), __value))
        __up_heap(__first, __last, __pos, _Distance(), __value, __comp);
    else
        __down_heap(__first, __last, __pos, _Distance(), __value, __comp);
}

template<typename _RandomAccessIterator, typename _Distance, typename _Compare>
inline void
__update_heap(_RandomAccessIterator __first, _RandomAccessIterator __last,
        _RandomAccessIterator __pos, _Distance, _Compare __comp) 
{
    _Distance __index = (__pos - __first);
    _Distance __parent = (__index - 1) / 2;

    if (__index > 0 && __comp(*(__first + __parent), *(__pos)))
      __up_heap(__first, __last, __pos, _Distance(), *(__pos), __comp);
    else
      __down_heap(__first, __last, __pos, _Distance(), *(__pos), __comp);
}

template<typename _RandomAccessIterator, typename _Tp>
inline void
update_heap(_RandomAccessIterator __first, _RandomAccessIterator __last,
        _RandomAccessIterator __pos, _Tp __value) 
{
      typedef typename iterator_traits<_RandomAccessIterator>::value_type _ValueType;
      typedef typename iterator_traits<_RandomAccessIterator>::difference_type _DistanceType;

      __update_heap(__first, __last, __pos, _DistanceType(), __value, less<_ValueType>());
}

template<typename _RandomAccessIterator, typename _Tp, typename _Compare>
inline void
update_heap(_RandomAccessIterator __first, _RandomAccessIterator __last,
        _RandomAccessIterator __pos, _Tp __value, _Compare __comp) 
{
      __update_heap(__first, __last, __pos, __value, __comp);
}


template<typename _RandomAccessIterator>
inline void
update_heap_pos(_RandomAccessIterator __first, _RandomAccessIterator __last,
  _RandomAccessIterator __pos) {
      typedef typename iterator_traits<_RandomAccessIterator>::value_type _ValueType;
      typedef typename iterator_traits<_RandomAccessIterator>::difference_type _DistanceType;

      __update_heap(__first, __last, __pos, _DistanceType(), less<_ValueType>());
}


template<typename _RandomAccessIterator, typename _Compare>
inline void
update_heap_pos(_RandomAccessIterator __first, _RandomAccessIterator __last,
  _RandomAccessIterator __pos, _Compare __comp) {
      typedef typename iterator_traits<_RandomAccessIterator>::value_type _ValueType;
      typedef typename iterator_traits<_RandomAccessIterator>::difference_type _DistanceType;

      __update_heap(__first, __last, __pos, _DistanceType(), __comp);
}



}; // namespace std

#endif // !HEAPPLUS_H_

And finally, here is the output from compiling and running the test case:
heap<double>
2 2 -0 1 1 0 0 0 0 -0 -1 -0 -1 -2 -2 -4 
2 2 1 1 0 0 -0 0 0 -0 -0 -1 -1 -2 -2 -4 
2 2 100 1 0 0 -0 0 0 -0 -0 -1 -1 -2 -2 -4 
100 2 2 1 0 0 -0 0 0 -0 -0 -1 -1 -2 -2 -4 

heap<Node>
(0,0):2 (3,3):2 (0,1):-0 (1,0):1 (3,2):1 (2,2):0 (3,1):0 (2,1):0 (2,3):0 (2,0):-0 (3,0):-1 (1,1):-0 (1,2):-1 (0,2):-2 (1,3):-2 (0,3):-4 
(3,3):2 (0,0):2 (3,2):1 (1,0):1 (3,1):0 (2,3):0 (1,1):-0 (2,2):0 (2,1):0 (2,0):-0 (0,1):-0 (1,2):-1 (3,0):-1 (0,2):-2 (1,3):-2 (0,3):-4 
(3,3):2 (0,0):2 (1,0):100 (3,2):1 (3,1):0 (2,3):0 (1,1):-0 (2,2):0 (2,1):0 (2,0):-0 (0,1):-0 (3,0):-1 (1,2):-1 (1,3):-2 (0,2):-2 (0,3):-4 
(1,0):100 (3,3):2 (0,0):2 (3,2):1 (3,1):0 (2,3):0 (1,1):-0 (2,2):0 (2,1):0 (2,0):-0 (0,1):-0 (3,0):-1 (1,2):-1 (1,3):-2 (0,2):-2 (0,3):-4 

heap<Node *>
(0,0):2 (3,3):2 (0,1):-0 (1,0):1 (3,2):1 (2,2):0 (3,1):0 (2,1):0 (2,3):0 (2,0):-0 (3,0):-1 (1,1):-0 (1,2):-1 (0,2):-2 (1,3):-2 (0,3):-4 
(3,3):2 (0,0):2 (3,2):1 (1,0):1 (3,1):0 (2,3):0 (1,1):-0 (2,2):0 (2,1):0 (2,0):-0 (0,1):-0 (1,2):-1 (3,0):-1 (0,2):-2 (1,3):-2 (0,3):-4 
(3,3):2 (0,0):2 (1,0):100 (3,2):1 (3,1):0 (2,3):0 (1,1):-0 (2,2):0 (2,1):0 (2,0):-0 (0,1):-0 (3,0):-1 (1,2):-1 (1,3):-2 (0,2):-2 (0,3):-4 
(1,0):100 (3,3):2 (0,0):2 (3,2):1 (3,1):0 (2,3):0 (1,1):-0 (2,2):0 (2,1):0 (2,0):-0 (0,1):-0 (3,0):-1 (1,2):-1 (1,3):-2 (0,2):-2 (0,3):-4 


Update: Yanbin Lu has taken this code and updated it further. See the software section from his homepage for the details.

Tuesday, August 4, 2009

Dead simple parallelization in Python

Posted by Danny Tarlow
I'm at the stage in one research project where I've done all of the brainstorming, derivations, and prototyping, and most indicators suggest that we might have a good idea. So the next step is to figure out how it works on a lot of real-world data. For me, this usually means running the program on lots of different inputs and/or with many combinations of parameters (e.g., for cross validation). Further, right now I'm working a lot with images and doing computations at the individual pixel level -- which can get pretty expensive pretty quickly if you're not careful.

I started this phase (as I usually do, actually) by re-writing all of my code in C++. I like to do this first to speed things up, but it also is useful to me as a debugging and sanity check stage, because I have to go through everything I've done in great detail, and I get to think about whether it's still the right idea.

Anyhow, when the task just involves lots of executions of an executable with different inputs, there's no real need to do any fancy parallelization. Instead, I use a dead simple version, where a script spawns off N threads, waits for them to finish, then spawns off N more threads. It keeps doing this until it has exhausted the stack of commands.

Now I know there are plenty of other ways to do this, but for my purposes, the following Python code works great for me.
import os
from threading import Thread

NUM_THREADS = 7

class ThreadedCommand(Thread):

    def __init__(self, shell_call):
        Thread.__init__(self)
        self.shell_call = shell_call

    def run(self):
        os.system(command)


commands = []
for parameter_1 in [5, 25, 100]:
    for parameter_2 in [5, 25, 100]:
        for parameter_3 in [.1, .25, .35]:
            command = "./command do_stuff --parameter_1 %s --parameter_2 %s--parameter_3 %s" % (parameter_1, parameter_2, parameter_3)
            commands.append(command)
                                                                                                                                                                                                                                                                                                                                   

while len(commands) > 0:
    running_threads = []
    for i in range(NUM_THREADS):
        try:
            command = commands.pop(0)
            new_thread = ThreadedCommand(command)                                                                                                                                                                                                                                                                                                                            
            running_threads.append(new_thread)
            new_thread.start()
        except:
            continue

    for thread in running_threads:
        thread.join()
        print "Joined %s" % thread.shell_call

The quest for artificial intelligence

Posted by Danny Tarlow
Nils Nilsson is a Professor Emeritus in Computer Science at Stanford. Anybody who has taken a course on algorithms knows about A* search -- Nils co-invented it. He was also instrumental in developing one of the earliest mobile robots, SHAKEY, back in 1966. It's pretty amazing to look at the old pictures and read about the project.
From http://www.ai.sri.com/shakey/


If you're looking for somebody who lived through and helped shape modern robotics, machine learning, and artificial intelligence, you'd be hard-pressed to find a better exemplar than Nils.

He's also an extremely friendly and interesting guy. I know because I spent the summer after my third year of undergrad as one of his two research assistants. We were looking at how to do unsupervised learning in an incremental way, so that we could use what we'd already learned to build up progressively richer and more meaningful representations of the world, but that's a story for another day.

While I was working with him -- which was about four years ago now -- Nils told me about a book project he was working on, trying to round up information and write a history of artificial intelligence and robotics. I learned today that he's finished the book, which is set to be published later in the year. The best part is that the book is available free online in the meantime:

THE QUEST FOR ARTIFICIAL INTELLIGENCE: A HISTORY OF IDEAS AND ACHIEVEMENTS
Nils Nilsson
http://ai.stanford.edu/~nilsson/QAI/qai-webpage.html
I haven't sunk my teeth into it, but I skimmed through most of the chapters. It looks to be a great read -- Nils has always seemed to be conscious of the big picture, and I think it shows in this book. Chapter 1 starts by chronicling dreams of artificial intelligence in ancient literature. It's a great way to start the book. It is also very comprehensive, weighing it at 700 pages. I'll likely be buying a hard copy when it hits the presses.

Wednesday, July 29, 2009

Python RESTful Objects

Posted by Danny Tarlow
I like this Python class design idea:
http://www.pythonprogramminglanguage.info/2009/07/28/restful-python-objects/

Wednesday, July 22, 2009

Personal rapid transit

Posted by Danny Tarlow
I had the privilege of getting to tour the soon-to-be-opened ULTra personal rapid transit station at Heathrow airport in London yesterday. It's essentially a mass transit system that uses personal, autonomous cars as the unit of transport, rather than bus or subway. I think it is a really cool project, and there are several interesting angles.

First, it greatly simplifies transport, because you just enter the destination you want to end up at, and it will take you there -- no more listening to garbled announcements on the bus and subway to figure out if you're at the right stop, and no trying to navigate maps and transfers. If you want to go from parking lot A to terminal 5, it will just take you there.

Second, there's a real focus on efficiency -- not transporting "empty space." If you look at the normal bus routes during off-hours of the day, they'll be driving a huge bus while only transporting one person. That's a lot of extra weight and thus extra energy to be using every 15 minutes of every day of the year. The PRT car has a four-person capacity, and they'll only be driving when there are passengers inside.

Finally, the technology is just really cool. There is an elevated track that the cars have to stay on, but they're able to make turns, plan routes, and more-or-less drive themselves. It's not a full blown autonomous vehicle, but it's yet another case where they've simplified how much technology is needed for the problem (by building special tracks) while still maintaining the "cool" factor.

You can see a video of it in action here. And this isn't science fiction -- it's actually built, and about to be opened:


Then there are several more videos at the website:
http://www.atsltd.co.uk/media/videos/

Monday, July 20, 2009

Upcoming talks

Posted by Danny Tarlow
For those of you who think that my month in Europe is one big vacation... well, you're mostly right. However, I would like to remind you for the record, that I am doing a bit of work as well. Right now, I'm putting together slides for two upcoming talks:
  • I'll be speaking on July 23 at Microsoft Research in Cambridge on some recent work I've been doing on how to make max-product belief propagation tractable even in the case of non-local interactions. The title is tentatively, "Tractability in High Order Binary Factor Graphs".
  • I'll be co-presenting with Andrew Peterman on July 28 at the IBPSA conference on Building Simulation in Glasgow, talking about our paper, "Automatically Calibrating a Probabilistic Graphical Model of Building Energy Consumption."
I'll put some slides up publicly (and maybe write a bit about each project) sometime fairly soon, when the time is right.

Summer school wrap up

Posted by Danny Tarlow
I've been quiet recently on the blog front for two pretty significant reasons: (a) I've been busy attending summer school, talking with interesting people, hanging out on the beach, and traveling through Italy; and (b) I've had very little prolonged, reliable, free internet access. Most times I've signed online have been to hastily answer a few emails and maybe to book a train, flight, or hotel for the next city.

I'm now settled down a bit in Paris, so I thought I'd give a quick recap of summer school.

The over-arching theme of the school was how to use machine learning to attack hard problems in computer vision. There is plenty of overlap in the tools and problems that researchers in both fields think about, but there is a bit of prejudice on both sides. The oversimplified version is that machine learning people think computer vision people use ad-hoc algorithms that are over-engineered to only work well on the specific data sets they're using; and computer vision people think that machine learning people are too high on their pedestals, coming up with methods that may be "pretty," but that would never work on large enough problems to be of any interest.

In reality, neither of these views are completely true, and we saw some great talks at the school showing how you can come up with well-justified approaches using machine learning towards solving computer vision problems that are efficient and powerful.

All of the talks were very good, but two of the highlights for me were Pushmeet Kohli's talk on using graph cuts for energy minimization, and Rob Fergus's talk about internet-scale image retrieval.

Pushmeet's abstract:
Over the last few years energy minimization has emerged as an indispensable tool in computer vision. The reason for this rising popularity has been the successes of efficient Graph cut based minimization algorithms in solving many low level vision problems such as image segmentation, object recognition, and stereo reconstruction. This tutorial will explain how these algorithms work, and what classes of minimization problems they can solve.
The nice part of the talk is that he went far beyond the standard case, where you have only two classes (e.g., foreground and background) with submodular potentials (crudely: nearby pixels should prefer to take the same label). In the past 5 or so years, some serious advances have been made in making these methods applicable to a wide variety of interesting, realistic problems. He talked about the non-discrete case, the non-submodular case, the multi-label case, and the higher-order potentials case.

Rob Fergus's abstract:
Existing object recognition methods are not practical to use on huge image collections found on the Internet and elsewhere. Recently, a number of computer vision approaches have been proposed that scale to millions of images. At their core, many of them rely on machine learning techniques to learn compact representations that can be used in efficient indexing schemes. My talk will review this work, highlighting the common learning techniques used and discuss open problems in this area.
The nice part about this talk was that he chose approaches that were extremely practical. Rather than having a method (say, machine learning) that you want to apply to a problem whether it fits or not, this was a more gentle approach, using efficient algorithms and data structures, then only applying a small amount of machine learning in the cases where it produced the largest gains -- in this case, learning hash functions.

Also, if you're clicking around the ICVSS page, don't miss this one =P:
http://svg.dmi.unict.it/icvss2009/Studentship.htm

Wednesday, July 1, 2009

$1000 for rediscovering forgotten discoveries

Posted by Danny Tarlow
I will be heading off to Sicily on Friday for the ICVSS Summer School on Machine Learning for Computer Vision. I'm pretty excited to see Sicily, meet some fellow students, and see what should be some great lectures -- the list of speakers looks very promising.

There is an interesting exercise at the school that I haven't seen before. To make it even more interesting, they're attaching a $1000 prize to the winner. Here's the email I got describing the exercise:
READING GROUP: A critical review of old ideas, with treasure hunt

PRIZE: $1000

Computer Vision is a fairly "horizontal" field, where often ideas are proposed, forgotten, and then re-invented. This is typical of nascent fields and "high entropy" research environments. At various points, it is useful to revisit old literature, and try to read them in a modern key. What ideas survived? Have they taken "new forms"? What leads have turned into dead-ends? In this reading group we will attempt such a re- reading for two authors that influenced the field in its beginning, but whose influence has waned over the years.

Students will be asked to read two books: One is David Marr's 1980 book Vision. The other is James Jerome Gibson's 1986 book "The ecological approach to visual perception". More than 20 years have passed, which is enough time to reflect.

Students will be asked to compile a list of "ideas" or "leads" proposed or discussed by these authors. For each idea, they will be asked to evaluate it in relation to the current literature: Is this idea still valid? Is it reflected in the current literature? Has this idea been proven fruitless? Has it been forgotten but may still be worthy of pursuit? They will be asked to give a few examples from the modern literature (say ICCV/CVPR/ECCV proceedings from the past 4-5 years) where this idea is adopted or disputed.

The competition will then proceed as follows: Students will be called to present and describe these ideas, and discuss whether it is a "thumbs up" or a "thumbs down" idea. If the student manages to convince the audience, the idea is put into a bucket. If others have found the same idea, the number of people that found it is a "score" for that idea.

There will be two prizes for the competition. One goes to the individual or group that found the most number of ideas. The prize will be a bottle of Champaign to share among the winners. The second prize ($1000) will be for the treasure hunt, and will be given to the individual that will have found one idea that he/she can convince the audience is worth pursuing, and that nobody else in the audience has found.

Students will be asked to submit their list of ideas found on the first day of the school. This competition will require that students consult the library of their institution ahead of time, and plan enough time to read these books, possibly more than once, in relation to current literature.

I bought a copy of the Gibson book and will take a look at it on the airplane to Europe, and maybe also in my hotel the night of my layover. Regardless, I think it sounds like a fun exercise -- I don't have any examples off the top of my head, but it seems that there are many cases in history where good ideas have been discovered, forgotten, then rediscovered. If nothing else, hopefully I'll get a glimpse of how the world appeared in 1986.

PS. If anybody has some prize-winning ideas, I'm more than happy to fairly divvy up the prize =P

Monday, June 29, 2009

Solar energy -- what's the hold up?

Posted by Danny Tarlow
Guest post by Andrew Peterman from an op-ed piece in the Idaho Statesman, dated 05/28/09.

Andrew and I will be giving a joint presentation in Scotland next month on a data-driven approach to predicting energy consumption in buildings -- I was the data guy, and he was the energy guy. He grew up in Boise, has B.S. and M.S. degrees in Civil and Environmental Engineering from Stanford University, and currently works for the Walt Disney Company on worldwide sustainability, energy and environmental issues.

I've written a little bit about energy issues in the past, but most of the posts have been inspired by conversations with Andrew. This is meant as a test -- if it's received well, we might ask Andrew to write guest posts here more often about a variety of energy and energy policy issues. Let me know what you think.


Idaho should be a leader when it comes to solar energy
By Andrew Peterman
(Previously published in the Idaho Statesman)

I often find myself wondering why, as a state, we are not yet a leader in centralized or rooftop solar energy generation. According to the Energy Information Administration (EIA) and Interstate Renewable Energy Council, Idaho is not even ranked in the top 10, yet our close neighbor, Nevada, is ranked third in large-scale generation and second in per capita small and large solar PV generation.

The sun has incredible potential within Idaho to provide a key source of non-carbon emitting electricity and heating. Solar is on the cutting-edge of providing a financially smart method to reduce our dependence on foreign oil and reduce the deadly buildup of greenhouse gas emissions. We can bring Idaho to a state of foreign oil independence with an array of options, but not without harnessing the power we naturally absorb from the sun.

Idaho gets a lot of sun. Across the state, engineers and energy analysts have estimated that we have the potential to produce 60 billion kilowatt-hours of electricity annually from solar energy (Department of Energy) - over 40 percent of Idaho's total energy demand using solar alone (EIA).

Solar technology can scale from the single-family homeowner to the large utility companies building massive collector systems and investing in our state.

Given Idaho's premium location for solar collection combined with nationally rising energy costs driven by high oil prices, solar technology has the potential to bring foreign oil independence to the state, reduce Idahoans' monthly utility bill, and protect Idaho's lakes, streams, mountains and valleys from the potentially harmful effects of global warming.

Solar is one of the cleanest technologies that makes financial sense. Incentives are currently in place to make the initial investment in solar panels more affordable and often financially beneficial for the individual or family. Through incentive programs such as Northwest Solar Cooperative - Green Tag Purchase, every Idahoan has an opportunity to receive money for every kW of solar electricity produced.

Solar produced electricity is estimated in some cases to be as low as 2 cents/kWh. Solar projects have the potential to create numerous jobs locally to help install, maintain, and monitor the performance of solar projects. Installation and maintenance jobs - nearly 50 percent of the total cost - cannot be shipped overseas.

Solar benefits all. Solar has the potential to provide the following meaningful benefits to Idahoans, without singling out a particular group:
  • Lowers peak electricity demand across the state.
  • Reduces the need to build new transmission and distribution.
  • Protects Idaho streams and rivers.
  • Lowers emissions of global warming pollution.
  • Challenges citizens of Idaho to take action and own the solution to the climate change problem.

The benefits of solar sound obvious.

Why is our state lagging? In these tough economic times, the last thing on Idahoans' minds is solar electricity production. This is why we need our local leaders to help bring about the necessary incentives and training to implement progressive technologies. As individuals, we have to take on much of the responsibility of reducing energy consumption and switching to cleaner sources. Idahoans have to own the solution to global warming, but our leaders must first provide resources to help create these new market opportunities.

The global imperative to move toward cleaner energy sources has presented unprecedented opportunity to create new jobs and bolster the struggling economy of Idaho. Through hard work, we have the potential to be on the cutting edge in this newly forming green economy - but not without solar.

Saturday, June 27, 2009

Netflix algorithm: Prize Tribute Recommendation Algorithm in Python

Posted by Danny Tarlow
The Netflix Prize
My little region of the internet is abuzz with news that the 10% improvement needed to win the $1M Netflix prize has been achieved. The prize is now in the "last call" stage:
As of the submission by team "BellKor's Pragmatic Chaos" on June 26, 2009 18:42:37 UTC, the Netflix Prize competition entered the "last call" period for the Grand Prize. In accord with the Rules, teams have thirty (30) days, until July 26, 2009 18:42:37 UTC, to make submissions that will be considered for this Prize. Good luck and thank you for participating!

The winning method is a combination of several different teams' approaches, blended together using some form of ensemble learning. It's an interesting lesson -- to really eek out the last bit of performance, it appears to work better to throw the kitchen sink of methods at the problem and combine them in an intelligent way. This is rather contrary to the "ideal" research approach, where some clever, (relatively) simple, elegant breakthrough would lead to the best performance.

In honor of the prize barrier being broken, I put together a little implementation of an early leader's approach to the problem. They experimented with several different approaches, but the one I use the most is the original probabilistic matrix factorization (PMF) approach. To see all the details, including how it performs on the full Netflix problem, see Russ and Andrei's paper.

The Problem
The rating prediction problem is easy to explain: there is a set of users, a set of items (movies, in the Netflix case) and a set of ratings. For some combinations of user and movie, you have a rating, which says how much the user liked the movie. This is called the training set. From the ratings in the training set, we want to build a model that will predict the rating given by some previously unseen combination of user and movie.

In order to evaluate the model, some known ratings are taken out of the training set and put in a test set. The model then takes the training set as input, along with the user-movie combinations that are in the test set, but it isn't shown the actual ratings in the test set. Instead, it is asked to make predictions based on what it has learned from the training set. In the case of the Netflix challenge, Netflix never reveals the actual values of these ratings -- they will only take in a set of predictions and report back your score.

For example, a small data set might look as follows:
Danny - Good Will Hunting: 10
Danny - Prefontaine: 10
Danny - Casino Royale: 5
Tyler - Toy Story: 2
Tyler - Top Gun: 10
Tyler - Casino Royale: 10
Tyler - The Little Mermaid: 2
Tyler - Prefontaine: 6
Andrew - Up: 10
Andrew - Toy Story: 10
Andrew - Top Gun: 3

Probabilistic Matrix Factorization
The big idea of PMF is to describe each user and each movie by a small set of attributes. In this case, we could say something along the lines that Danny likes dramas and doesn't really like action or Disney movies. Tyler loves action movies but hates Disney movies, and Andrew loves Disney movies but nothing else. Toy Story and Up are Disney movies, Top Gun and Casino Royale are action movies, and Good Will Hunting and Prefontaine are dramas.

This could be represented by the PMF model by using three dimensional vectors to describe each user and each movie. If the first dimension was how much of a Disney movie it is, the second is how much of an action movie it was, and the third is how much of a drama it is, then the latent vectors might look something like as follows:
Danny: [0, 3, 10]
Andrew: [10, 3, 5]
Tyler: [0, 8, 4]

Good Will Hunting: [0, 0, 1]
Prefontaine: [0, .5, 1]
Casino Royale: [0, 1, .2]
Toy Story: [1, .1, .1]
Top Gun: [0, 1, .2]
...

To predict how a user rates a movie, we take the dot product of the user vector with the item vector. So to predict how Tyler would rate Top Gun:
0 * 0 + 8 * 1 + .2 * 4 = 8.8

Not bad, given that we know his actual rating was a 10.

Learning
However, in the PMF model, the descriptions of the users and movies are not known ahead of time. Instead, the learning algorithm automatically discovers these latent characteristics, finding latent representations of users and movies such that the model's predicted ratings match the ratings in the training set as best as possible. Going back to the example, if we wanted to make the predicted rating match the actual rating even more, we could bump up Tyler's love for action movies (or make several other changes to the latent vectors). When you have many users rating many movies, it gets trickier to know how to change the latent vectors, because making Tyler's rating for Top Gun closer might make several other of his ratings get worse. Thankfully, calculus and the PMF algorithm help us out here.

For my implementation, I don't worry too much about making it efficient enough to scale to something like the Netflix problem, but I use the same model presented in the paper, and I use a similar gradient-ascent optimization scheme (though I use batch updates with momentum instead of stochastic updates).

I test my code using synthetic data, where I first make up latent vectors for users and items, then I generate some training set ratings by multiplying some latent user vectors by latent item vectors then adding some noise. I then discard the latent vectors and just give the model the synthetic ratings.

The code progresses iteratively, increasing the likelihood of the data given the latent vectors at each iteration. I use 5 dimensional latent vectors for these experiments, but you could use any small number. Here's what the likelihood (which you can think of as roughly the (negative) sum of squared error between predictions on the training set and actual training set values) looks like over time:

Results
Upon convergence of the training, the latent vectors look like this. There is one row for each individual user and item, then there are five columns per row -- these are the learned descriptions of users and movies:

Finally, using the learned latent vectors, we can then make rating predictions for any combination of user and movie. I do that here: The left side of the image has the vertical stacking of user latent vectors, the top side has the horizontal stacking of movie latent vectors, then the large bottom right region has the predicted ratings for every combination of user and item. Note that if a user or item has relatively small entries in all of its latent vectors, that represents something along the lines of being non-polar (apathetic, almost), while users or items with larger positive or negative latent values will be more polarizing -- the love-it-or-hate-it idea. You can see these sorts of patterns appear in the rows and columns.

The Code
And here is the code that generates the synthetic data, trains the model, and produces the figures:

Update (10/25/09): Fixed a bug in try_updates that was causing overcounting of the regularization component of the update. Thanks, Conrad.
import pylab
import matplotlib.pyplot as plt
import matplotlib.cm as cm

import numpy
import pickle

class ProbabilisticMatrixFactorization():

    def __init__(self, rating_tuples, latent_d=1):
        self.latent_d = latent_d
        self.learning_rate = .0001
        self.regularization_strength = 0.1
        
        self.ratings = numpy.array(rating_tuples).astype(float)
        self.converged = False

        self.num_users = int(numpy.max(self.ratings[:, 0]) + 1)
        self.num_items = int(numpy.max(self.ratings[:, 1]) + 1)
        
        print (self.num_users, self.num_items, self.latent_d)
        print self.ratings

        self.users = numpy.random.random((self.num_users, self.latent_d))
        self.items = numpy.random.random((self.num_items, self.latent_d))

        self.new_users = numpy.random.random((self.num_users, self.latent_d))
        self.new_items = numpy.random.random((self.num_items, self.latent_d))           


    def likelihood(self, users=None, items=None):
        if users is None:
            users = self.users
        if items is None:
            items = self.items
            
        sq_error = 0
        
        for rating_tuple in self.ratings:
            if len(rating_tuple) == 3:
                (i, j, rating) = rating_tuple
                weight = 1
            elif len(rating_tuple) == 4:
                (i, j, rating, weight) = rating_tuple
            
            r_hat = numpy.sum(users[i] * items[j])

            sq_error += weight * (rating - r_hat)**2

        L2_norm = 0
        for i in range(self.num_users):
            for d in range(self.latent_d):
                L2_norm += users[i, d]**2

        for i in range(self.num_items):
            for d in range(self.latent_d):
                L2_norm += items[i, d]**2

        return -sq_error - self.regularization_strength * L2_norm
        
        
    def update(self):

        updates_o = numpy.zeros((self.num_users, self.latent_d))
        updates_d = numpy.zeros((self.num_items, self.latent_d))        

        for rating_tuple in self.ratings:
            if len(rating_tuple) == 3:
                (i, j, rating) = rating_tuple
                weight = 1
            elif len(rating_tuple) == 4:
                (i, j, rating, weight) = rating_tuple
            
            r_hat = numpy.sum(self.users[i] * self.items[j])
            
            for d in range(self.latent_d):
                updates_o[i, d] += self.items[j, d] * (rating - r_hat) * weight
                updates_d[j, d] += self.users[i, d] * (rating - r_hat) * weight

        while (not self.converged):
            initial_lik = self.likelihood()

            print "  setting learning rate =", self.learning_rate
            self.try_updates(updates_o, updates_d)

            final_lik = self.likelihood(self.new_users, self.new_items)

            if final_lik > initial_lik:
                self.apply_updates(updates_o, updates_d)
                self.learning_rate *= 1.25

                if final_lik - initial_lik < .1:
                    self.converged = True
                    
                break
            else:
                self.learning_rate *= .5
                self.undo_updates()

            if self.learning_rate < 1e-10:
                self.converged = True

        return not self.converged
    

    def apply_updates(self, updates_o, updates_d):
        for i in range(self.num_users):
            for d in range(self.latent_d):
                self.users[i, d] = self.new_users[i, d]

        for i in range(self.num_items):
            for d in range(self.latent_d):
                self.items[i, d] = self.new_items[i, d]                

    
    def try_updates(self, updates_o, updates_d):        
        alpha = self.learning_rate
        beta = -self.regularization_strength

        for i in range(self.num_users):
            for d in range(self.latent_d):
                self.new_users[i,d] = self.users[i, d] + \
                                       alpha * (beta * self.users[i, d] + updates_o[i, d])
        for i in range(self.num_items):
            for d in range(self.latent_d):
                self.new_items[i, d] = self.items[i, d] + \
                                       alpha * (beta * self.items[i, d] + updates_d[i, d])
        

    def undo_updates(self):
        # Don't need to do anything here
        pass


    def print_latent_vectors(self):
        print "Users"
        for i in range(self.num_users):
            print i,
            for d in range(self.latent_d):
                print self.users[i, d],
            print
            
        print "Items"
        for i in range(self.num_items):
            print i,
            for d in range(self.latent_d):
                print self.items[i, d],
            print    


    def save_latent_vectors(self, prefix):
        self.users.dump(prefix + "%sd_users.pickle" % self.latent_d)
        self.items.dump(prefix + "%sd_items.pickle" % self.latent_d)
    

def fake_ratings(noise=.25):
    u = []
    v = []
    ratings = []
    
    num_users = 100
    num_items = 100
    num_ratings = 30
    latent_dimension = 10
    
    # Generate the latent user and item vectors
    for i in range(num_users):
        u.append(2 * numpy.random.randn(latent_dimension))
    for i in range(num_items):
        v.append(2 * numpy.random.randn(latent_dimension))
        
    # Get num_ratings ratings per user.
    for i in range(num_users):
        items_rated = numpy.random.permutation(num_items)[:num_ratings]

        for jj in range(num_ratings):
            j = items_rated[jj]
            rating = numpy.sum(u[i] * v[j]) + noise * numpy.random.randn()
        
            ratings.append((i, j, rating))  # thanks sunquiang

    return (ratings, u, v)


def plot_ratings(ratings):
    xs = []
    ys = []
    
    for i in range(len(ratings)):
        xs.append(ratings[i][1])
        ys.append(ratings[i][2])
    
    pylab.plot(xs, ys, 'bx')
    pylab.show()


def plot_latent_vectors(U, V):
    fig = plt.figure()
    ax = fig.add_subplot(121)
    cmap = cm.jet
    ax.imshow(U, cmap=cmap, interpolation='nearest')
    plt.title("Users")
    plt.axis("off")

    ax = fig.add_subplot(122)
    ax.imshow(V, cmap=cmap, interpolation='nearest')
    plt.title("Items")
    plt.axis("off")

def plot_predicted_ratings(U, V):
    r_hats = -5 * numpy.ones((U.shape[0] + U.shape[1] + 1, 
                              V.shape[0] + V.shape[1] + 1))

    for i in range(U.shape[0]):
        for j in range(U.shape[1]):
            r_hats[i + V.shape[1] + 1, j] = U[i, j]

    for i in range(V.shape[0]):
        for j in range(V.shape[1]):
            r_hats[j, i + U.shape[1] + 1] = V[i, j]

    for i in range(U.shape[0]):
        for j in range(V.shape[0]):
            r_hats[i + U.shape[1] + 1, j + V.shape[1] + 1] = numpy.dot(U[i], V[j]) / 10

    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.imshow(r_hats, cmap=cm.gray, interpolation='nearest')
    plt.title("Predicted Ratings")
    plt.axis("off")


if __name__ == "__main__":

    DATASET = 'fake'

    if DATASET == 'fake':
        (ratings, true_o, true_d) = fake_ratings()

    #plot_ratings(ratings)

    pmf = ProbabilisticMatrixFactorization(ratings, latent_d=5)
    
    liks = []
    while (pmf.update()):
        lik = pmf.likelihood()
        liks.append(lik)
        print "L=", lik
        pass
    
    plt.figure()
    plt.plot(liks)
    plt.xlabel("Iteration")
    plt.ylabel("Log Likelihood")

    plot_latent_vectors(pmf.users, pmf.items)
    plot_predicted_ratings(pmf.users, pmf.items)
    plt.show()

    pmf.print_latent_vectors()
    pmf.save_latent_vectors("models/")
Related Posts: