Friday, March 11, 2011

Official 2011 March Madness Predictive Analytics Challenge

Posted by Lee
Welcome to the second annual March Madness Predictive Analytics Challenge! I'm very excited about this event and I hope you are, too! We're still trying to line up some prizes, but for sure, like last year, there will be a gift certificate to

This year's format will be more or less the same as last year's.


Most readers of this blog are probably familiar with the general idea of what this contest is about. In case you aren't a frequent reader or a fan of college basketball, this section will serve as a brief introduction. March 11th is "Selection Sunday" where the teams for the NCAA College Basketball tournament will be selected. In total, there will be 68 teams with 8 teams playing four "play-in" games on March 15th and 16th to determine the field of 64. For the purposes of this contest, you do not need to worry about these initial play-in games. The remaining 64 teams are then pit against each other in a bracket with one national champion emerging as the winner. Every year, millions of people fill in their predictions of who will be the winners and losers of the games. People participate in leagues or pools with other people to see who has the best bracket. We would like YOU to participate in our algorithm-only pool. That is, your bracket must be completed by a computer algorithm based upon historical data without the use of human judgment.

Contest Format

The format is fairly simple. We will have two pools: a Tournament pool and a Sweet Sixteen pool. Entries in both pools will be evaluated on the typical exponential point scoring system. Correct picks get 1, 2, 4, 8, 16, and 32 points depending on the depth in the bracket (1 point in the first round, 2 points in the second round, etc). The entry only needs to pick the winning team. Thus, if the other team is no longer in the tournament, but the winning team is picked, points are still awarded. Each person is limited to one entry per pool. Each pool will have a winner determined by the submission scoring the most points.


TOURNAMENT pool entries must be submitted no later than March 17, 2011 (the first day of play in the round of 64).
SWEET SIXTEEN pool entries must be submitted no later than March 24, 2010 (the beginning of the sweet sixteen round).


  • Your bracket must be chosen completely by a computer algorithm.
  • The computer algorithm must base the decision upon historical data.
  • You may not hard code selections into your algorithm (e.g., "Always pick Stanford over Cal")
  • Your algorithm may only use the data set published for the tournament. The data will be released on Sunday, March 13.
  • The above rule is fairly restricting, but I believe this provides a more even playing field. The contest should be about your algorithm's predictive capabilities and not a data advantage one person has over another.
  • You must be able to provide code that shows how your entry picks the winners. In other words, your bracket and the selection of winning teams in your bracket must be reproducible by me on a machine.
  • In the event of a tie, the entry with the EARLIER submission time wins.


We'll be using Yahoo's bracket system for the contest submissions. Please send an e-mail to leezen+MarchMadness at gmail for the group password to join. Please include your team name, team members, and brief description.




As described above, only the official contest data on this blog is acceptable for use in this contest. You can get a sample of the data, which has all games from the 2006 season through February 2011. Please see this post for details. I will also update this post on Sunday with a link to the full data set.

UPDATE: I have an updated post with details on the final data: Selection Sunday Data

Additional Information

Please be aware that algorithm computation time will be somewhat important in this task. You will be able to predict most of your games ahead of time between March 13th and 17th but because of the four play-in games, you will need to predict the outcome of four games between March 15th and 17th as the match-ups in the round of 64 will not be known until the play-in games are complete.

If you have other questions, concerns, etc. please comment on this post and I'll do my best to answer.


DN said...

I didn't know you can put + into an email address

DN said...

Few more questions

How come some teams are unknown UNK?

So we don't know the march draw or the players who will play?

What form is the input of the prediction code? home and away team?


Lee said...

The "UNK" teams are teams that do not have a mapping in the team codes file - they are typically non-Division 1 teams or no longer in Division 1 (which is where the tournament is played).

You'll know today (March 13th) who the teams are and who they'll play. However, 4 games won't be determined until March 16th at which point you can make those predictions.

Go said...

Hi I'm interested in the event even though it is a bit late. Looking at the games.tsv, I'm wondering why the scores/results of the games are not there? Thanks.

Danny Tarlow said...

Great! We'd love to have you join.

The fine-grained data only has scores implicit in the data. See Lee's explanation of the data here:

If you're just looking for scores, there's also simpler, aggregate data available here: