Sunday, March 27, 2011

2011 Algorithmic March Madness: Machines Lock in Victory over Humans

Posted by Danny Tarlow
Well that was an exciting and surprising weekend of basketball. (8)Butler beat (2)Florida in overtime, and (11)Virginia Commonwealth handily took care of (1)Kansas, eliminating the last remaining #1 seed. Rounding out the Final Four are (3)UConn and (4)Kentucky.

March Madness is usually good for some Cinderella stories, but this Final Four seems particularly improbable. There are no #1 seeds remaining (this has been the case 3 times in March Madness history, according to the TV announcers), and never before have teams seeded as low as #11 and #8 met in a Final Four game.

None of the entries in our second annual March Madness Algorithm Challenge saw this amount of madness coming. Two entrants correctly predicted that (3)UConn would make the final four (Team Delete Kernel and The Pain Machine), but no other algorithms or baselines got any Final Four teams correct. In fact, the only entry that has any more chance at points is The Pain Machine, which has UConn winning one more game.

So the question of which algorithm will win the contest is still not settled: a UConn victory on April 2, and The Pain Machine walks home with the prize; a UConn loss, and Team Delete Kernel is our winner.

What is settled at this point is that a machine will claim victory over the human-aided competition. The human baselines include our commissioner Lee's bracket; the Higher Seed bracket (where the human intervention came via the committee that chose seeds); and the Nate Silver baseline, which was a part-human, part-computer effort.

To give you an idea of the potentially winning methodologies, Scott Turner (The Pain Machine) describes his approach here, and Kevin Lee (Team Delete Kernel) based his model on the method described here.

So it's premature to congratulate a winner yet, but let me tritely say that I, for one, welcome our new March Madness algorithm overlords.
ENTRANT                         R1  R2  R3  R4  R5  Winner      Pts  Possible
Team Delete Kernel              23  20  16  8   -   Ohio St.    67   67
Human (Lee)*                    25  18  16  0   -   Kansas      59   59
Baseline (Higher Seed)*         25  20  12  0   -   Ohio St.    57   57
The Pain Machine                19  18  8   8   -   Kansas      53   69
Baseline (Nate Silver)*         25  20  8   0   -   Ohio St.    53   53
InItToWinIt                     22  20  8   0   -   Kansas      50   50
Baseline (TrueSkill)            26  18  4   0   -   Ohio St.    48   48
Danny's Dangerous Picks         22  16  8   0   -   Duke        46   46
Baseline (LRMC)                 25  16  4   0   -   Ohio St.    45   45
DukeRepeats                     23  16  4   0   -   Duke        43   43
Point Differential Centrality   23  16  4   0   -   Ohio St.    43   43
dirknbr1                        23  8   4   0   -   Ohio St.    35   35

* Denotes human-involvement.
You can see the full brackets here (Update: actually, it looks like Yahoo took them down). Also, there is a second, Sweet 16 contest that we haven't mentioned lately. Stay tuned for an update on that front.

Friday, March 25, 2011

Algorithmic March Madness: Elite 8 Update

Posted by Danny Tarlow
The Sweet 16 saw some major upsets, with two of the remaining three number one seeds falling. Perhaps most impressive was Arizona's (5 seed) 93-77 thumping of Duke (1 seed). In the Elite 8, we have the following seeds: 4, 2, 5, 3, 1, 11, 8, 2.

No entrants predicted that Baylor Butler (8 seed) or VCU (11 seed) would have made it this far, but we're starting to see the first signs of weakness in the Higher Seed and Nate Silver baselines. At this point, our Commissioner's human-chosen baseline bracket is tied for first with Team Delete Kernel's algorithmically chosen bracket. Don't count out InItToWinIt or The Pain Machine quite yet, though. I haven't worked through all the scenarios, but if Kansas wins it all and Arizona beats UConn, InItToWinIt has a good shot at the title. If UConn makes it to the final game, The Pain Machine is looking strong. And if Kentucky wins it all, Team Delete Kernel looks to have at least a share of the title locked up. Interestingly, I haven't been able to find a potential outcome where any baseline other than Lee's human-chosen bracket has a chance at the title. So after a tough start, the algorithmically-chosen brackets are making their move!

Currently, here are the standings:
ENTRANT                         R1  R2  R3  R4  R5  Winner      Pts  Possible
Human (Lee)                     25  18  16  -   -   Kansas      59   131
Team Delete Kernel              23  20  16  -   -   Ohio St.    59   91
Baseline (Higher Seed)          25  20  12  -   -   Ohio St.    57   81
Baseline (Nate Silver)          25  20  8   -   -   Ohio St.    53   77
InItToWinIt                     22  20  8   -   -   Kansas      50   106
Baseline (TrueSkill)            26  18  4   -   -   Ohio St.    48   48
Danny's Dangerous Picks         22  16  8   -   -   Duke        46   70
The Pain Machine                19  18  8   -   -   Kansas      45   125
Baseline (LRMC)                 25  16  4   -   -   Ohio St.    45   69
DukeRepeats                     23  16  4   -   -   Duke        43   67
Point Differential Centrality   23  16  4   -   -   Ohio St.    43   51
dirknbr1                        23  8   4   -   -   Ohio St.    35   59

Tuesday, March 22, 2011

Sweet Sixteen Bracket

Posted by Lee

If you did not have a chance to get your entry in for the full 63-pick tournament, you still have a chance to make picks in the condensed 15-pick format for our Sweet Sixteen Bracket! The next set of games begin Thursday night (March 24) Eastern time, so you will need to submit your brackets by then. If you already entered in the original bracket, you have been invited to the new one. If you are new and would like to participate, please e-mail leezen+MarchMadness at gmail.

Currently, in the full bracket prediction contest, Team Delete Kernel and InItToWinIt are sitting in the top half of the pack. Don't count The Pain Machine out yet, though, it can still score a possible 153 points, which is more than TrueSkill has available even though it currently ranks third.

P.S. Sorry I did not get around to posting competitor profiles this weekend.

Monday, March 21, 2011

After Round 2: Points from Upsets

Posted by Danny Tarlow
You can see the current standings here. At the moment, 3 of the top 5 (including the top 2) entries used some sort of human judgement in their algorithm: picking based on Higher Seed (tied for 1st) is the judgement of the seeding committee; the Nate Silver (tied for 1st) baseline uses seed information along with several other power ratings, some of which are human based, and some of which are computer based; and the Lee bracket (tied for 4th) was filled out by Lee with no computer assistance. The top two computer entries are the TrueSkill algorithm (3rd) that Scott Turner implemented and the Delete Kernel (tied for 4th) entry from Kevin, who built his entry based on the simplest 1D probabilistic matrix factorization model that I wrote about previously (and released code for).

I think the take-away at this point is that the real winner right now is whoever decided on the seeding of teams. The Higher Seed bracket is tied in first place, and the strength of the other brackets mostly comes from how closely their picks matched the higher seed. Yes, there have been a lot of upsets, including some big ones, and the entrants did indeed pick some upsets, but the entrants didn't generally pick the right upsets.

Here's an alternative way of looking at results that reveals this. I took the point totals for each contestant and split off the contribution to the point total that came from picking an upset. For the two rounds, I report "A/B" where B is the total number of points the entrant earned in the Yahoo bracket, and A is the number of points that came from predicting upsets. I define "upset points" to be points gained from a correct prediction, where the winning team is not the best ranked seed that could possibly have made it to that point. So even though Richmond (12) was the favorite over Morehead (13), predicting Richmond making it to the Sweet 16 would give a contestant 2 "upset points", because the best ranked seed that could have made it to that point was Louisville (4). Here are the results:
Points from upsets

TEAM                   R1      R2     Total
Delete Kernel:        4/23    2/20    6/43
InItToWinIt:          2/22    2/20    4/42
Human (Lee):          1/25    2/18    3/43
Point Differential:   3/23    0/16    3/39
Silver:               3/25    0/20    3/45
LRMC:                 3/25    0/16    3/41
Dirknbr1:             0/23    2/8     2/31
Danny:                2/22    0/16    2/38
TrueSkill:            2/26    0/18    2/44
The Pain Machine:     0/19    0/18    0/37
Higher Seed:          0/25    0/20    0/45
Under this evaluation measure (which is admittedly not what anybody was trying to optimize), the completely computer-based models are doing better. Perhaps the real take-away at the moment, though, is that predicting upsets is hard!

Friday, March 18, 2011

Current Standings

Posted by Lee

I've opened up the Yahoo Group for anyone who wants to see the standings and the submitted brackets.

Remember that if you were not able to submit a bracket in time for the full tournament, you will have a second chance as we will run a Sweet Sixteen bracket as well. The Sweet Sixteen does not start until 3/24, so you have almost a week to tweak and get something working before then!

The first round of play is over and after 32 completed games, the baseline predictors are doing very well against the algorithms. I have pasted the full standings below (I realized after last year's tournament that Yahoo does not preserve the brackets and standings, so I will try to do a better job keeping track of them on the blog this year).

Rank Bracket R1 R2 R3 R4 Semis Finals Points Possible
1 Baseline (TrueSkill) 24         Ohio St. 24 185
2 Baseline (Higher Seed) 23         Ohio St. 23 184
2 Baseline (Nate Silver) 23         Ohio St. 23 184
2 Baseline (LRMC) 23         Ohio St. 23 182
2 Human (Lee) 23         Kansas 23 182
6 Point Differential Centrality 22         Ohio St. 22 153
7 Danny's Dangerous Picks 21         Duke 21 182
7 DukeRepeats 21         Duke 21 180
7 dirknbr1 21         Ohio St. 21 168
10 Team Delete Kernel 20         Ohio St. 20 181
10 InItToWinIt 20         Kansas 20 165
12 The Pain Machine 17         Kansas 17 174

I will be posting introductions of the various contenders over the course of this weekend. In the meantime, cheer on the algorithms... or human baselines - the higher seed and "Lee" - if you fear the machines! Enjoy the exciting set of games coming this weekend!

Early Analysis of Algorithmic March Madness Entries

Posted by Danny Tarlow
Scott Turner has been analyzing entrants into our 2011 March Madness Predictive Analytics Challenge, and he has some interesting insights and comments. I'm posting them verbatim here.

In the last 10 years, the upset rate in the first round has been 22%. (Ignoring 8-9 matchups.) Looking at the other 6 entries in the contest (the ones I think are entries, anyway), they predict 16 upsets -- about 8%. Of the baselines, only LMRC comes close with 5 upsets (15%).

For my entry, I forced 15% upset picks over the entire tournament (~9). (15% is about the overall upset rate in the tournament for the past 10 years.) So I have 4 in the first round and 5 in the later rounds.

Also, everyone has a #1 seed winning it all, but no one has Pittsburgh.

(I'm assuming that Point Diff, InitToWinIt, DukeRepeats, Danny's, dirknbr1 and mine are the competitors.)

Ignoring upset picks where there's only a 1 seed differential (e.g., 9 over an 8), these are the upset picks across all the competitors:

Qty     Upset   Delta   Details
[1]     3 > 1   2       (BYU > Pitt)
[2]     6 > 3   3       (Xav > Syr, SJU > BYU)
[3]     7 > 2   5       (Was > UNC, A&M > ND, UCLA > Fla)
[1]     9 > 1   8       (ODU > Pitt)
[4]     10 > 7  3       (FSU > A&M, UGA > Was, PSU > Temple, MSU > UCLA)
[1]     10 > 2  8       (UGA > UNC)
[3]     11 > 6  5       (Missou > Cinc, Gonz > SJU x 2)
[1]     12 > 9  3       (USU > ODU)
[4]     12 > 5  7       (USU > KSU x 4)
[2]     12 > 4  8       (USU > Wis x 2)
[2]     12 > 3  8       (USU > BYU x 2)
[1]     12 > 2  12      (USU > Pitt, USU > Kan)
[2]     13 > 4  11      (Bel > Wis x 2)
[1]     14 > 6  8       (St. Pete > Georgetown)
[1]     14 > 3  11      (St. Pete > Purdue)
which if I add correctly is 30 upsets picks. The upset % over the last ten years (again ignoring 1 seed differential) is 119/519 = ~23%. By that metric these brackets ought to show 86 (!) upsets - almost 3x more. (The number is even worse if you take out my bracket, where I tried to force the right upset metric, but even my bracket should show more upsets.)

I think there's an interesting discussion to be had over whether a computer predictor ought to pick the most likely event on a game by game basis or not. The former leads to mostly "chalk" picks -- the only time you're going to go against the seeding is where the seeding is wrong (according to your model). Imagine a tournament where in every game the better team has a 1% advantage. The hypothetical predictor will pick the favorites in every game -- and miss almost half the games. Is that the best strategy?

It's also interesting to note that USU counts for a full third of the upset picks. If we assume that most of these models are only picking upsets if they think there's been an actual seeding mistake, then USU was the only "consensus" mis-seeding. That's kind of an interesting result in itself. That suggests that the committee does a good job seeding (at least in some sense) and/or that these predictors don't do a good job of discerning "hidden" information about the teams.

Of the 30 upset predictions, after the first day we know that 2 are right (the picks of Gonz > SJU), 13 are wrong, and the remaining 13 are undecided. That doesn't look like such a good record for the predictors :-). The correct upset pick was a delta of 5, which is a pretty big upset.

I can't see right now whether the ESPN Tourney Challenge shows statistics like how many people in the tourney picked particular upsets, but it would be interesting to compare the competitor upset picks to how people picked.

-- Scott Turner

Thursday, March 17, 2011

New Scientist Article

Posted by Danny Tarlow
MacGregor Campbell over at wrote a great article about our March Madness contest:

My only complaint is that Lee doesn't get the recognition for all of the hard work that he's put into making the contest happen. The contest literally would not have happened without him. (Thanks, Lee.) The other details are quite good, though. I'm particularly impressed by the clear and accurate description of the algorithms.

Thanks MacGregor! And thanks to all the participants who have entered the main bracket contest. If you didn't make it in time for the main contest entry, there's still the Sweet 16 contest. Get those algorithms running!

Wednesday, March 16, 2011

Less than 24 hours left

Posted by Lee
There are less than 24 hours left before the tip-off of the first game in the round of 64 in the NCAA tournament. A friendly reminder that you will need to have your entries submitted before the first game begins. If you're feeling crunched for time, we'll have a sweet sixteen bracket like last year -- so even if you haven't started coding/modeling/training/etc. it's not too late!

Tuesday, March 15, 2011

Dr. Scott Turner on March Madness: The Pain Machine 2011

Posted by Danny Tarlow
This is a guest post by Dr. Scott Turner, the force behind team "The Pain Machine," which was the co-winner of the Sweet 16 contest from last year. In this post, he'll describe what to expect from his entry to this year's algorithmic March Madness prediction contest. If you are planning to enter and would like to contribute a guest post, please email Danny or Lee.

Dr. Turner has a Ph.D. in Artificial Intelligence from UCLA. His dissertation subject was a program called MINSTREL that told stories about King Arthur and his knights, as a way to explore issues in creativity and storytelling. Since obtaining his Ph.D. in 1993, Dr. Turner has worked for the Aerospace Corporation, where he advises the nation's space programs on software and systems engineering issues.

As a lifelong college basketball fan and a student of AI, I was intrigued last year when I saw Danny's call for participants in a tournament prediction contest. I put together a program and managed to tie Danny in the Sweet Sixteen bracket.

The program I wrote last year used a genetic algorithm to evolve a scoring equation based upon features such RPI, strength of schedule, wins and losses, etc., and selected the equation that did the best job of predicting the same outcome as the games in the training set. I felt the key in winning a tournament picking contest was in guessing the upsets, so I added some features to the prediction model intended to identify and pick likely upsets.

In retrospect, the competition pool for this contest is so small that picking upsets is probably not as important as it is in (say) the full Yahoo pool, where you need to gamble more if you hope to distinguish yourself from the mass of consensus picks.

Since the contest last year I have continued to work on the Pain Machine. The focus of my effort shifted more towards predicting the margins of regular season games. My latest models predict regular season games about 80% correctly with an error of about 8 points. Since 2/5, the Pain Machine has predicted 63% correctly against the spread (from for selected games where it has identified a strong advantage. However, the sample size for that is very small so the result may not be meaningful. Performance of the model against the 2009 and 2010 tournament games is similar (or better), although again the sample size is very small.

In the course of the last year I have identified several interesting (i.e., non-obvious) keys to understanding and predicting college basketball games. But perhaps the most striking realization has been the variability of the outcome of college game. In Stokol's original paper on his LMRC method for predicting college basketball games, he compares pairwise home-and-home matchups and determines that a team needs to win at home by >21 points to have an even chance to beat the same team on the road! While I don't agree with the magnitude of that result, it's clear that what I would have evaluated as a "convincing" win -- say by 10 or 12 points -- is actually surprisingly weak evidence of the superiority of Team A over Team B. A case in point is Virginia Tech's home win over Duke late in the season. Virginia Tech is at best a bubble team and still managed to beat one of the best two or three teams in the country. Does this mean that Virginia Tech is significantly better than we thought? Or Duke significantly worse? Probably not. So it's hard to imagine a predictor that could consistently predict those sorts of outcomes, which makes the problem all the more interesting!

-- Scott Turner

Monday, March 14, 2011

March Madness Predictions: Code Description

Posted by Danny Tarlow
In the last post, I showed outputs of the 1D version of my matrix factorization model for predicting March Madness results. Here, I'm posting the code along with a brief description of how to get it running, so you can replicated my results, possibly as the basis for your entry into the 2011 March Madness Predictive Analytics Challenge.

To start, there are two Python files you need:
You'll also need the pickle file with the data:
Put all the files in a directory, and make a directory named "models" to store the output.

Now there are two steps:
  1. Train a model by running "python".
  2. Print out the offensive and defensive scores by running, "python".
That's it!

If you'd like to simulate particular games, there is a function in called simulate_game(team_code_A, team_code_B). There is also old code to simulate a full bracket, but that hasn't been updated from previous years (yet).

If you'd like to train higher dimensional models or play around with the regularization parameter, feel free to change things at the top of __init__() in Higher dimensional models are harder to visualize, so one idea would be to sort teams based on how they are predicted to fare against a favorite like Duke ("dau" team code).

Happy predicting!

The Algorithm's 2011 March Madness Predictions Part 1

Posted by Danny Tarlow
We seem to have ironed out most issues related to data for the 2011 March Madness Predictive Analytics Challenge. As a reminder, check the March Madness label to see all contest-related posts in one place. If you'd like to enter the competition, there's still time! (And Lee has posted new, easier to use data.)

With announcements out of the way, it is with great pleasure that I announce that I've got my algorithm running on the 2011 season data. In this post, I will show outputs from the simple 1D version of my probabilistic matrix factorization model. I'll run the higher dimensional version that I use to actually make predictions and write about it in a later post (or you can follow the instructions here, run the code, and train the model yourself).

How the Model Works: The basic idea is very simple: we want to represent each team's offensive and defensive strength with a single number (in the 1D version) for each. We will make a prediction for the result of team i playing team j as follows:

Predicted score for i = offense_strength_i * defense_strength_j
Predicted score for j = offense_strength_j * defense_strength_i

It should be clear that higher numbers mean better offense, and lower numbers mean better defense.

The learning algorithm looks over all games played this season and tries to find a setting of offensive and defensive strengths for each team such that scores predicted by the model best match the actual outcomes observed*. (If you want the short answer, this is achieved via the miracles of calculus.)

What I'm Showing: I will first report a combined measure, which takes the offensive strength and subtracts the defensive strength. If you think of having each team play against a baseline team with offensive and defensive strength 1, then the difference tells you how much you expect to the team to win by (or, if it's negative, to lose by). Afterwards, I show the top offenses and the top defenses, along with their strengths learned by the model. In all cases, I report the top 50 teams. Keep in mind that the algorithm knows nothing about rankings, players, or anything other than the final score of each game. Also keep in mind that I know less than the algorithm about what happened this year in college basketball. There are some important caveats in the comments under this post.

If you'd like to reproduce these results at home, follow the instructions and run the code in the next post.

So without further ado, here are the outputs. You can use these to predict the score of any game by plugging into the formula above. (Only the top 50 teams are shown in the post for each measure. The outputs for all teams in the database are here.)

Combined Rating

Duke (3.15)
Ohio St. (3.10)
Kansas (3.07)
Washington (2.71)
Pittsburgh (2.64)
Texas (2.57)
Kentucky (2.52)
Purdue (2.43)
Louisville (2.42)
BYU (2.41)
Notre Dame (2.41)
Syracuse (2.39)
North Carolina (2.34)
San Diego St. (2.26)
Wisconsin (2.21)
Connecticut (2.15)
Georgetown (2.01)
Missouri (2.00)
Arizona (2.00)
Illinois (1.99)
Cincinnati (1.96)
West Virginia (1.95)
Vanderbilt (1.95)
Florida (1.94)
Marquette (1.93)
UNLV (1.93)
Villanova (1.92)
Kansas St. (1.84)
Maryland (1.83)
Gonzaga (1.78)
Michigan St. (1.72)
St. Mary's (1.72)
Virginia Tech (1.72)
St. John's (1.70)
Utah St. (1.68)
Washington St. (1.65)
Florida St. (1.63)
Texas A&M (1.63)
Belmont (1.62)
Clemson (1.59)
Xavier (1.58)
Michigan (1.57)
Temple (1.55)
New Mexico (1.50)
George Mason (1.49)
UCLA (1.47)
Colorado (1.47)
Minnesota (1.46)
Northwestern (1.44)
USC (1.43)


Washington (10.72)
Duke (10.43)
Kansas (10.39)
BYU (10.32)
Oakland (10.20)
Missouri (10.18)
Ohio St. (9.97)
North Carolina (9.89)
Virginia Military (9.87)
Notre Dame (9.79)
Colorado (9.79)
Kentucky (9.76)
Vanderbilt (9.73)
Louisville (9.69)
Marquette (9.67)
Maryland (9.65)
Arizona (9.64)
Providence (9.62)
Pittsburgh (9.56)
Long Island (9.56)
St. Mary's (9.56)
Connecticut (9.54)
Syracuse (9.54)
La Salle (9.49)
Texas (9.49)
South Dakota St. (9.47)
Purdue (9.47)
Iona (9.46)
California (9.42)
Duquesne (9.42)
Belmont (9.40)
Villanova (9.39)
Georgetown (9.37)
Gonzaga (9.35)
Texas Tech (9.35)
Iowa St. (9.30)
Washington St. (9.29)
Mississippi (9.29)
Illinois (9.29)
Northwestern (9.29)
UNLV (9.25)
Boston Coll. (9.21)
Kansas St. (9.20)
Florida (9.18)
Xavier (9.17)
Detroit (9.14)
New Mexico (9.11)
St. John's (9.10)
Michigan St. (9.06)
Charleston (9.06)


Wisconsin (6.61)
San Diego St. (6.74)
New Orleans (6.79)
Cincinnati (6.84)
Utah St. (6.85)
Ohio St. (6.87)
Nebraska (6.88)
Texas (6.92)
Pittsburgh (6.92)
USC (6.96)
Alabama (6.98)
Penn St. (6.99)
Old Dominion (6.99)
West Virginia (7.00)
Texas A&M (7.01)
Clemson (7.02)
Michigan (7.03)
Purdue (7.04)
Cal Poly (7.05)
Virginia (7.12)
Virginia Tech (7.13)
Florida St. (7.15)
Syracuse (7.15)
Drexel (7.18)
Temple (7.22)
Kentucky (7.23)
Florida (7.24)
Stephen F. Austin (7.25)
Louisville (7.26)
Duke (7.29)
Richmond (7.29)
Illinois (7.30)
Kansas (7.32)
UNLV (7.32)
Michigan St. (7.34)
Fairfield (7.34)
Saint Louis (7.35)
Georgetown (7.36)
Kansas St. (7.36)
St. Peter's (7.37)
Northern Iowa (7.37)
Denver (7.38)
Notre Dame (7.39)
UAB (7.39)
St. John's (7.39)
Connecticut (7.39)
Montana (7.40)
UCLA (7.40)
Seton Hall (7.41)
South Florida (7.42)

* There's also some regularization.

Sunday, March 13, 2011

More Data Updates

Posted by Lee
To remedy the caveat about team field goals I mentioned in the previous post, I am now explicitly generating the score data by crawling the scoreboard. I have posted this to this file with the following format: date, home team, away team, home score, away score, home team won (1 yes, 0 no).

I also found out that March 11, 2011 was missing from the data. I have re-crawled this date and added it to the files. If you want to use this date in your data, please re-download the .zip files. The data is inclusive through March 12, 2011.

Data Caveats

Posted by Lee
As we've seen in the past 12 hours, there have been some hiccups with the data. I apologize that data quality has been a bit lacking. Hopefully things will be relatively stable from here. I also wanted to point out a few other caveats in the data.
  • Some games are missing completely. This is because at crawl time, sometimes the server hiccups and returns an empty page even though I am not being throttled.
  • Some games exist in Games.tsv but there is no player data for it. This is because in some cases, the Yahoo! box score format is missing columns like minutes and the parser does not expect this.
  • Some games have inaccurate full aggregate point totals. In a few cases, field goals exist at the team level, but not the individual player level. I expect this to be relatively rare, but does cause some score discrepancies in the aggregate.
I'll try to keep this list up to date. Please comment/e-mail if you run into other items such as the above.

Aggregate Game Results

Posted by Lee
To make things even easier, I've uploaded a file containing aggregate game results.

The format for this file is:
  1. Game Date
  2. Home Team
  3. Away Team
  4. Home Team Points
  5. Away Team Points
  6. Home Team Win (home team points > away team points) - 1 if the home team won, 0 if the away team won
The columns are comma-separated.

Selection Sunday Today!

Posted by Lee
UPDATE 3: 3/11/2011 was missing from the data. The archive has been corrected. It was also renamed to reflect that it contains all games through 3/12 and not 3/13.

UPDATE 2: It turns out that I was incorrectly parsing boxscores where the teams had a seed number in front of their name, leading to this being interpreted as "UNK" when in fact, these are incredibly important games (thanks, Rob). I have fixed this and uploaded new data. Please re-download the data. The links below have been corrected to point at the new data as well.

UPDATE: Thanks to the keen observation of one of our contestants (thanks, Scott), we've updated the data as the boxscore format we were crawling changed starting 3/10. The link below has been updated to the new data.

Today is Selection Sunday, when we will find out who is playing in this year's NCAA tournament. This also means that all the conferences have completed their games and it is time to release the data. Full contest details are available here.

Please download these two files: The team code mapping is there to help you convert between the codes in the actual boxscore data and the actual names of the teams. As with previous data release, there is a team code called "UNK" that refers to a team that is not within the list of codes. These are often small teams either in non-Division 1 conferences or no longer in Division 1.

There are two files within the data archive: Players.tsv and Games.tsv.

Games.tsv - each row corresponds to a game played during the season.
  1. Game ID
  2. Date
  3. Home team
  4. Away team

Players.tsv - each row corresponds to a single player's performance in one game.
  1. Player name
  2. Game ID
  3. Player's team
  4. Minutes played
  5. Field goals made
  6. Field goals attempted
  7. Three pointers made
  8. Three pointers attempted
  9. Free throws made
  10. Free throws attempted
  11. Offensive rebounds
  12. Defensive rebounds
  13. Assists
  14. Turnovers
  15. Steals
  16. Blocks
  17. Personal fouls
Note that three pointers are included in the number of field goals. Thus if a player has made 3 field goals and 1 three pointer, then that player has scored 7 (2 + 2 + 3) points.

In the previous post, I had a faulty link to join the group on Yahoo. The correct link is As with before, please e-mail leezen+MarchMadness at gmail for the password. Please include your team name, team members, and brief description.

Good luck!

Friday, March 11, 2011

Official 2011 March Madness Predictive Analytics Challenge

Posted by Lee
Welcome to the second annual March Madness Predictive Analytics Challenge! I'm very excited about this event and I hope you are, too! We're still trying to line up some prizes, but for sure, like last year, there will be a gift certificate to

This year's format will be more or less the same as last year's.


Most readers of this blog are probably familiar with the general idea of what this contest is about. In case you aren't a frequent reader or a fan of college basketball, this section will serve as a brief introduction. March 11th is "Selection Sunday" where the teams for the NCAA College Basketball tournament will be selected. In total, there will be 68 teams with 8 teams playing four "play-in" games on March 15th and 16th to determine the field of 64. For the purposes of this contest, you do not need to worry about these initial play-in games. The remaining 64 teams are then pit against each other in a bracket with one national champion emerging as the winner. Every year, millions of people fill in their predictions of who will be the winners and losers of the games. People participate in leagues or pools with other people to see who has the best bracket. We would like YOU to participate in our algorithm-only pool. That is, your bracket must be completed by a computer algorithm based upon historical data without the use of human judgment.

Contest Format

The format is fairly simple. We will have two pools: a Tournament pool and a Sweet Sixteen pool. Entries in both pools will be evaluated on the typical exponential point scoring system. Correct picks get 1, 2, 4, 8, 16, and 32 points depending on the depth in the bracket (1 point in the first round, 2 points in the second round, etc). The entry only needs to pick the winning team. Thus, if the other team is no longer in the tournament, but the winning team is picked, points are still awarded. Each person is limited to one entry per pool. Each pool will have a winner determined by the submission scoring the most points.


TOURNAMENT pool entries must be submitted no later than March 17, 2011 (the first day of play in the round of 64).
SWEET SIXTEEN pool entries must be submitted no later than March 24, 2010 (the beginning of the sweet sixteen round).


  • Your bracket must be chosen completely by a computer algorithm.
  • The computer algorithm must base the decision upon historical data.
  • You may not hard code selections into your algorithm (e.g., "Always pick Stanford over Cal")
  • Your algorithm may only use the data set published for the tournament. The data will be released on Sunday, March 13.
  • The above rule is fairly restricting, but I believe this provides a more even playing field. The contest should be about your algorithm's predictive capabilities and not a data advantage one person has over another.
  • You must be able to provide code that shows how your entry picks the winners. In other words, your bracket and the selection of winning teams in your bracket must be reproducible by me on a machine.
  • In the event of a tie, the entry with the EARLIER submission time wins.


We'll be using Yahoo's bracket system for the contest submissions. Please send an e-mail to leezen+MarchMadness at gmail for the group password to join. Please include your team name, team members, and brief description.




As described above, only the official contest data on this blog is acceptable for use in this contest. You can get a sample of the data, which has all games from the 2006 season through February 2011. Please see this post for details. I will also update this post on Sunday with a link to the full data set.

UPDATE: I have an updated post with details on the final data: Selection Sunday Data

Additional Information

Please be aware that algorithm computation time will be somewhat important in this task. You will be able to predict most of your games ahead of time between March 13th and 17th but because of the four play-in games, you will need to predict the outcome of four games between March 15th and 17th as the match-ups in the round of 64 will not be known until the play-in games are complete.

If you have other questions, concerns, etc. please comment on this post and I'll do my best to answer.

Tuesday, March 8, 2011

Predicting March Madness: Tournament Structure and Bracket Scoring Measures

Posted by Danny Tarlow
This is the second post aimed at getting you all excited about entering the 2011 March Madness computer predictions challenge. If you missed George's earlier post, go check it out first, to get the creative juices flowing.

George wrote about the problem of given two teams, how likely is it is that one will beat the other. What I will talk about now is how the tournament structure comes into play--the concerns that go beyond which of two teams will win the next game they're scheduled to play against each other. I'll assume throughout that we have at our disposal a good model of individual games.

The first natural question is whether any additional thought is necessary. If we have a good model of individual games, can't we just greedily use the model to predict all first round outcomes, then all second round outcomes, ..., then be done? Maybe it's not such a bad idea. That's what I've done the past few years, after all.

I'm going to argue that there may be good reasons to work on factoring the tournament structure and contest scoring rules into your model. Let's begin by thinking about the tournament probabilistically. In the most general setup, there is a probability distribution P over the joint space of all outcomes. In a bracket with 64 teams (ignore the play-in game for now), there are 2^32 possible results for the first round. Given those, there are 2^16 possible results for the second round. Given those, 2^8, then 2^4, then 2^2, then 2 possible outcomes for later rounds. Multiplying all those, we get 2^63 possible joint outcomes. A model should be to be able to assign a probability to any of those full outcomes.

Compact representation of probabilities (aka the model): 2^63 is a very large number, so it's not feasible to represent the probability of each outcome as a separate entry in a very large table. Beyond the practical concerns, it also wouldn't be a good idea from a statistical estimation perspective--such a model would make inefficient use of training data.

To illustrate this, imagine for a moment that the same tournament happens repeatedly (the same teams start in each position of the bracket, but there may be different outcomes), and suppose we want to use data from the last N tournaments to estimate the probability of outcomes in the next repetition of the tournament. One approach is to just count how often each full bracket resulted in the last N tournaments. This would use on the order of 2^63 parameters. If N is infinite (or _extremely_ large), then this is a perfectly good approach; however, when we have limited data (i.e., nearly always), nearly all entries in the big table will be zero; in this case, it is important to have a more compact representation.

The standard solution is to find a more compact representation of P. Here's one, which we'll call the fully factorized model: the probability of a full tournament outcome P(g_1, ..., g_63) over 63 games g_1 ... g_63 is proportional to f(g_1) * f(g_2) * ... * f(g_63). For each game, we can look at who the participants are, then estimate f by looking over the last N repetitions of the tournament where the appropriate two teams met in a game. It should be clear that there are exponentially many fewer parameters in the compact model than there are in the big joint table model--in the simplest version, we just need to represent one number for each pair of teams, so there are on the order of M^2 parameters if there are M teams in the tournament. (Going further, which is necessary in practice, when there is only one tournament per year, if we were to use e.g., a matrix factorization model, we could represent an "offensive strengths" vector (or scalar) and a "defensive strengths" vector (or scalar), then we could produce

f(team i, team j) = g(team i offense * team j defense - team i defense * team j offense),

which requires only on the order of M parameters.)

Regardless of how f is computed, the fully factorized model makes some practical sense as well, because on first pass, we typically don't expect the result of an early game in the South bracket to affect the outcome of an early game in the East bracket, for example. In the fully factorized model, that assumption is built in. Statistically, the rough basic intuition is that all else being equal, if we have fewer parameters in the model, we need fewer past data instances in order to pin down their values accurately.

The flip side of using a simpler model is that we lose the ability to model certain interactions. Let's think critically about this: do we really believe that the outcome of each game depends only on which teams are playing in the current game? If you listen to sports announcers, it's not clear to me that the answer is yes. For example,
  • Some teams might do better or worse under pressure, players might get tired after several intense games in a row, or the location of the game affect the outcome. In all these cases, the round in which the game takes place could be important.
  • If an underdog pulls off a big upset over a top seed, the underdog might gain confidence and become stronger in their next games.
  • If a top seed just barely beats a big underdog, the top seed might either lose confidence or snap out of a lull.
  • If an underdog in say the East bracket beats a top seed in say the South bracket on Day 1, it might give other underdogs a boost of confidence on Day 2.
  • Suppose we have a structure where A is playing B and C is playing D, then the winners play each other. If B is weak and C is a big rival of A's, it's possible that A could look past B to C and not come prepared for the game against B.

Are these things important to model? I'm not sure, but it would be interesting to try. If we do, we'll need a more complex representation of our tournament outcome model than the simple fully factorized approach gives.

Loss functions: Now let's switch gears and consider precisely what our goal is. I'll assume that each bracket challenge contestant can only enter one bracket into the contest. In the simplest case, we might be concerned only with whether we get the perfect bracket or not; this would be the case in a contest where a perfect bracket wins $5M, but there is no payout for a less-than-perfect bracket. Then, the optimal strategy is to choose the joint outcome that the model thinks is most likely. The optimality is easy to see, because the expected value is equal to $0 * P(not perfect) + $5M * P(perfect). So assuming the model captures all we know about the tournament, we should clearly choose the outcome that the model assigns the highest probability.

What about the case where the goal is to get as many predictions right as possible?

Maybe we are equally unhappy if we do worse than the baseline that makes all predictions based solely on seed, then we're linearly more happy as we beat that baseline by more. (For example, if you're betting against your mom, and she has a history of always choosing the better seed.)

Or maybe you are linearly more happy as you get more "points", but points are determined by some contest-specific scoring rules, like the first round is worth 1 point, the second round is worth 2, the third 4, etc.

Whatever your goal, there is likely some loss function that can be formulated to capture your desires for a model. Formally, a loss function is just a function of a model's prediction and the ground truth that tells how unhappy we are with the prediction. If the prediction is perfect (all games right!), loss should be 0, and if the prediction is the worst possible (all games wrong?!) the loss should be maximum (let's say 1). Obviously we'd prefer to just have a perfect bracket, but given such a hard problem, that's unrealistic. The job of a loss function is to tell the model how to trade off different kinds of mistakes. It's rarely the case that we really think there is one bracket that is perfect and all others (like the one that gets everything right except the play-in game) are equally bad, so formulating a good loss function generally takes some thought.

These types of questions are near and dear to my heart (though I generally think more about them in the context of computer vision than March Madness), so hopefully you now have some intuitions about why we might want to go beyond simple per-game models, and hopefully you're convinced that part of defining a good model is defining a good loss function.

A question that remains is how to actually build such a model. How do we formulate this idea that we are making multiple correlated predictions at once (predicting the whole bracket rather than just one game at a time), and how do we ask a learning algorithm to try to optimize a particular loss function? I'm glad you asked! Luckily, there is a whole subfield of machine learning that looks at these types of problems. It goes by the name of structured prediction, or structured output learning.

I'd write more about it here, but Hal Daumé III has written a great series on the exact subject over at his blog. So if you are interested in more, I encourage you to go check out the following posts:
And it should go with out saying: keep working on those models! The competition will be starting soon!