Saturday, March 13, 2010

Official March Madness Predictive Analytics Challenge Announcement

Posted by Lee
NEW UPDATE Please see the 2011 March Madness Algorithm Prediction Contest.

OLD UPDATE Player data has been updated to fix a bug. 2010 player data is also now included. Please refer to the more recent posts for the latest data.


Welcome to the inaugural March Madness Predictive Analytics Challenge! I'm very excited about this event and I hope you are, too! We already have some prizes lined up (keep reading for details) and we're hoping to get some more prizes set up.

As Danny said in his previous post, I'll be acting as the commissioner for this contest. In this post, I'll be explaining the format for the challenge, rules, available data, and prizes.

Background

Most readers of this blog are probably familiar with the general idea of what this contest is about. In case you aren't a frequent reader or a fan of college basketball, this section will serve as a brief introduction. Tomorrow is "Selection Sunday" where the teams for the NCAA College Basketball tournament will be selected. In total, there will be 65 teams with 2 teams playing a "play-in" game to determine the field of 64. These 64 teams are then pit against each other in a bracket with one national champion emerging at the winner. Every year, millions of people fill in their predictions of who will be the winners and losers of the games. People participate in leagues or pools with other people to see who has the best bracket. We would like YOU to participate in our algorithm-only pool. That is, your bracket must be completed by a computer algorithm based upon historical data without the use of human judgment. With that said, let's take a quick look at the format.

Contest Format

The format is fairly simple. We will have two pools: a Tournament pool and a Sweet Sixteen pool. Entries in both pools will be evaluated on the typical exponential point scoring system. Correct picks get 1, 2, 4, 8, 16, and 32 points depending on the depth in the bracket (1 point in the first round, 2 points in the second round, etc). The entry only needs to pick the winning team. Thus, if the other team is no longer in the tournament, but the winning team is picked, points are still awarded. Each person is limited to one entry per pool. Each pool will have a winner determined by the submission scoring the most points.

Deadlines

TOURNAMENT pool entries must be submitted no later than March 18, 2010 at 1am.
SWEET SIXTEEN pool entries must be submitted no later than March 25, 2010 at 1am.
Entries past the deadline will not be accepted.

Rules

  • Your bracket must be chosen completely by a computer algorithm.
  • The computer algorithm must base the decision upon historical data.
  • You may not hard code selections into your algorithm (e.g., "Always pick Stanford over Cal")
  • Your algorithm may only use the data published on this blog. This includes the data described in this post as well as the other data Danny has published beforehand.
  • The above rule is fairly restricting, but I believe this provides a more even playing field. The contest should be about your algorithm's predictive capabilities and not a data advantage one person has over another.
  • You must be able to provide code that shows how your entry is chosen. In other words, your bracket and the selection of winning teams in your bracket must be reproducible by me on a machine.
  • In the event of a tie, the entry with the EARLIER submission time wins.

Submissions

EDIT Thanks to Matt's suggestion in the comments, we'll be using Yahoo's bracket system for the contest submissions. Please send an e-mail to leezen+MarchMadness at gmail for the group password to join. UPDATE Sweet Sixteen prediction bracket is open to contestants.

Prizes

Tournament Bracket: First Place - a custom vinyl sticker, or laser etching featuring the yet to be revealed, super secret "Smell the Data" logo, courtesy of Doug Tarlow
Sweet Sixteen Bracket: First Place - $25 Amazon.com Gift Certificate

Data

As described above, data previously posted on this blog is acceptable for use in this contest (description of the previous data). In addition, we've provided CSV dumps of the 2006-2009 seasons. These are available here: I apologize that the format is not exactly the same as Danny's as it includes some additional attributes. The GameDataCsv.zip files contain game result data while PlayerDataUpdated.zip contains two files: one for 2006-2009 (Players2.csv) and one for 2010. Please see http://blog.smellthedata.com/2010/03/updated-player-data.html for why there is updated player data. The player data columns are:
  • ID (GUID)
  • Name
  • Height
  • Position
  • Team
  • Year
  • Class (Freshman, Sophomore, Junior, Senior)
  • Games - the number of games the player participated in
  • Field goals (shots) made, excluding three point shots
  • Field goal attempts, exlcuding three point shots
  • Three point shots made
  • Three point shots attempted
  • Free throws made
  • Free throw attempts
  • Assists
  • Blocks
  • Rebounds
  • Steals

The 2010 player data has a slightly different schema (sorry!) It includes three sets of field goal figures -- field goals made and attempted without 3 pointers, field goals made and attempted including 3 pointers, and 3 pointers made and attempted. Also note that the last four columns are in slightly different order.
  • ID (GUID)
  • Name
  • Height
  • Position
  • Team
  • Year
  • Class (Freshman, Sophomore, Junior, Senior)
  • Games - the number of games the player participated in
  • Field goals (shots) made, excluding three point shots
  • Field goal attempts, exlcuding three point shots
  • Field goals (shots) made, including three point shots
  • Field goal attempts, including three point shots
  • Three point shots made
  • Three point shots attempted
  • Free throws made
  • Free throw attempts
  • Rebounds
  • Assists
  • Steals
  • Blocks

The game result data columns are:
  • Game Date
  • Days Count
  • Home Points
  • Away Points
  • Overtime
  • Home Team
  • Away Team
  • Home Team Name
  • Away Team Name
  • Game Type

I tried to make this file as backward compatible with Danny's file as possible. I've kept most of the columns, including Days Count. Note that some are negative from 2009-11-08. "Overtime" is either "True" or "False" and True indicates that the game went into Overtime. The Game Type will be either Other, Regular, NCAA Tournament, or Conference Tournament; with the latter two being postseason. Regular refers to the regular season while Other is usually because the game is played at an invitational or other tournament. One thing to note here is that not all games are home or away, there are often neutral courts (the NCAA tournament for example is considered neutral). To preserve the formatting however, I had to pick a home or away team for neutral court games. In these games, I decided to pick the winning team as the home team. If you'd prefer this broken out or altered, please let me know and I can change it sooner than later. One should note that all NCAA tournament games are considered as neutral courts.

Additional Information

If you have other questions, concerns, etc. please comment on this post and I'll do my best to answer.

11 comments:

Danny Tarlow said...

Awesome! Great job putting this all together!

Matt Curry said...

Cool idea, but what about just making a group at espn.com (or any other site). Seems like it would be a ton easier then dealing with all the email submissions.

Lee said...

Thanks for the comment, Matt. We could do that, but I wasn't sure about the sweet sixteen bracket on those sites. I thought it might be easier to just do it manually for now. If you know how to do that part, I'd definitely be interested. Thanks!

Matt Curry said...

Looks like yahoo does a second chance bracket that starts with the sweet 16 (http://tournament.fantasysports.yahoo.com/t1).

ESPN has nothing but the normal.

CBS Sportsline has round by round options.

Those were the only 3 I checked.

Either way, thanks again for the data. Not sure if I'll be able to pull anything together in time, but it's fun to mess with anyway.

Shelley said...

This sounds like a lot of fun! I would especially like to win so that I could have a laser etching done by Product Designer Doug Tarlow!

Danny Tarlow said...

Nice, Matt. Lee updated it to use Yahoo.

Ryan said...

Is there no 2010 player data allowed then?

Lee said...

Ryan: I was not able to find the 2010 player data in the same format as the 2006-2009 data. If you have it , we can make it available to everyone and allow its use.

Lee said...

This looks pretty good: http://stats.ncaa.org/team/inst_team_list/10260?division=1

I will try to find time tonight to crawl and parse this and provide it to everyone.

Danny Tarlow said...

bjfish's script here looks like a good starting point as well:
http://gist.github.com/332084

Lee said...

@Ryan sorry for the delay, I'll be posting the 2010 player data soon. I'll also write a post about it as the data is the same format, but with some caveats