Monday, February 27, 2012

Preliminary 2011 Season Data

Posted by Lee

In addition to data from the 2006-2010 seasons shared publicly via Google Docs

We've published some preliminary data for the 2011 season. This uses the same format as past seasons' data and spans the beginning of the 2011 season through 2/26.

After Selection Sunday (March 11th), we will publish an updated set of data for the 2011 season. Please let us know if you find any problems with the preliminary data.

Machine March Madness 2012: Starter Code

Posted by Danny Tarlow
I've started a github repository for the 2012 March Madness competition, to which I've committed some python code that I worked on over the weekend:
https://github.com/dtarlow/Machine-March-Madness

Here, you can find code that parses data from previous seasons, constructs the past brackets, and learns a few different models based on past data. More details are in the README.

I will post in more detail about the models once I get them working a bit better, but I encourage you to take a look at the high level structure in learn_synthetic.py and model.py.

I've brainstormed a bunch of TODOs at the bottom of the README, so if you'd like to jump in and work on some of those, please do. Or feel free to go off in your own direction.

For detailed discussions of the code, questions, or bug reports/fixes, head on over to the official Google group.

Saturday, February 25, 2012

Google group for March Madness competition...

Posted by Danny Tarlow
... here.

We'll use the Google group for discussion of issues related to rules, but other posts are fair game: maybe you're looking for somebody to team up with, or maybe you want to brainstorm modeling ideas, etc.

Thursday, February 23, 2012

Machine March Madness 2012

Posted by Danny Tarlow
Every year, the NCAA College Basketball seasons ends with a tournament of 64 teams. Humans around the US (but also elsewhere in the world) fill in brackets with predictions of the outcome, enter pools, and wait excitedly for the results.

College basketball is a streaky and fairly high variance game, so there are many chances for an underdog to make a run deep into the tournament. We see this often -- for example, last year's tournament featured a final four made up of 3, 4, 8, and 11 seeds -- leading to the colloquial tournament name, "March Madness".

So without further ado, it is my pleasure to announce that this year, this blog, in conjunction with commissioner Lee, will host another "Machine March Madness" contest. The big idea is simple: using data from this season and from past seasons (which we will provide -- e.g., past data here: full and simple), build a computer system that fills out a bracket, then pit yourself against the field of silicon competition. You can see posts from last season's tournament here, and some press coverage here.

We'll get more details coming soon, including details about prizes. For now, you can do a few things.
  1. Download the past data (full and simple), and start thinking about how you'd model the tournament. To get some starter ideas, I recommend this timeless post by George Dahl.
  2. Let us know in the comments if there is any other data that you would like to use. The rule we have is that all systems must be built using the same data, but we're open to suggestions about what this data is.
  3. Get started!


Update: Here's a question about additional data to use, posted on Quora.

Thursday, August 11, 2011

Testing Intuitions about Markov Chain Monte Carlo: Do I have a bug?

Posted by Danny Tarlow
For one project I've been working on recently, I'm using a Markov Chain Monte Carlo (MCMC) method known as slice sampling. There are some good tutorials, examples, and implementations out there (e.g., by Iain Murray or Radford Neal), but for various reasons, I wanted to implement my own version.

Now, debugging MCMC algorithms is somewhat troublesome, due to their random nature and the fact that chains just sometimes mix slowly, but there are some good ways to be pretty sure that you get things right. For example, the Geweke method is highly regarded as _the_ method to make sure you're getting it right. So this exercise is not actually really about debugging. It's more about testing intuitions about the behavior of a sampler.

With that out of the way, on to the question:
I implemented my sampler, initialized it with small random numbers for the parameters, and set it running on a simple test model (which I'm intentionally not describing in detail). One high level statistic that is relevant to look at is the (log) probability of samples versus iteration of the sampler, so I made that plot. It looks like this:
This plot looks a bit surprising. Upon initialization, the sampler moves directly to regions of space that have very low probability (remember, this is a _log_ probability*), and it appears to just keep going to exponentially less and less probable regions. The point of a sampler is that it should spend an amount of time in a state in proportion to the state's probability. And this sampler is making a beeline to a state that is e^-600 times less probable than where it started.

So here's the question: do I have a bug? In other words, if you were my supervisor and I came to you with this plot, would you dismiss this plot and send me back to debugging. If not, explain how this possibly could make sense.

I'll post my answer sometime in the next couple days.

* I'm leaving out constants, so the graph would be shifted down (but wouldn't change shape or scale) if I was including all the constants.

Sunday, April 10, 2011

Crawling Code

Posted by Lee

One of the contestants requested that I upload the code to crawl the boxscores. I have done so and it is available on github: https://github.com/leezen/boxscore-crawler

Note that Yahoo changed its format starting around March 10th and the code uses the flag to get the old boxscore format. It is unclear how long this option will remain available from Yahoo.

2011 Predictive Analytics Challenge Winner

Posted by Lee

We knew it would be a machine, but we didn't know which one until UConn's victory ensured The Pain Machine the title as winner of the 2011 March Madness Predictive Analytics Challenge! Congratulations to Scott Turner and his entry on the victory -- his second in a row! He will be receiving a $25 gift certificate to Amazon.com. It doesn't sound like he'll be resting on his laurels as he's started a blog that will detail further development of his system.

Thank you to all the participants and entrants this year. We would love to know how you thought the contest went, how we can improve for next year, and any other feedback you might have! We look forward to your participation again next year!