Thursday, February 23, 2012

Machine March Madness 2012

Posted by Danny Tarlow
Every year, the NCAA College Basketball seasons ends with a tournament of 64 teams. Humans around the US (but also elsewhere in the world) fill in brackets with predictions of the outcome, enter pools, and wait excitedly for the results.

College basketball is a streaky and fairly high variance game, so there are many chances for an underdog to make a run deep into the tournament. We see this often -- for example, last year's tournament featured a final four made up of 3, 4, 8, and 11 seeds -- leading to the colloquial tournament name, "March Madness".

So without further ado, it is my pleasure to announce that this year, this blog, in conjunction with commissioner Lee, will host another "Machine March Madness" contest. The big idea is simple: using data from this season and from past seasons (which we will provide -- e.g., past data here: full and simple), build a computer system that fills out a bracket, then pit yourself against the field of silicon competition. You can see posts from last season's tournament here, and some press coverage here.

We'll get more details coming soon, including details about prizes. For now, you can do a few things.
  1. Download the past data (full and simple), and start thinking about how you'd model the tournament. To get some starter ideas, I recommend this timeless post by George Dahl.
  2. Let us know in the comments if there is any other data that you would like to use. The rule we have is that all systems must be built using the same data, but we're open to suggestions about what this data is.
  3. Get started!

Update: Here's a question about additional data to use, posted on Quora.


Scott Turner said...

I'm looking forward to it!

I wonder, though, if you shouldn't throw it open to let contestants use any data they'd like. Wouldn't it be interesting to see what data people end up using?

(Although getting data is pretty difficult...)

Danny Tarlow said...

Lee and I are meeting tomorrow to talk about stuff like this, so I'll add it to the agenda.

In general, I'm fairly open to the idea, although part of me thinks it's more interesting if no human input enters the loop. Are you proposing that people can use human-based inputs e.g., the tournament seeding, human created power rankings, etc?

On the other hand, perhaps figuring out which data to trust is also an interesting question. Maybe there should be two divisions, but we'd want to make sure there are enough competitors in each division to justify that.

What do other people think?