Thursday, August 11, 2011

Testing Intuitions about Markov Chain Monte Carlo: Do I have a bug?

Posted by Danny Tarlow
For one project I've been working on recently, I'm using a Markov Chain Monte Carlo (MCMC) method known as slice sampling. There are some good tutorials, examples, and implementations out there (e.g., by Iain Murray or Radford Neal), but for various reasons, I wanted to implement my own version.

Now, debugging MCMC algorithms is somewhat troublesome, due to their random nature and the fact that chains just sometimes mix slowly, but there are some good ways to be pretty sure that you get things right. For example, the Geweke method is highly regarded as _the_ method to make sure you're getting it right. So this exercise is not actually really about debugging. It's more about testing intuitions about the behavior of a sampler.

With that out of the way, on to the question:
I implemented my sampler, initialized it with small random numbers for the parameters, and set it running on a simple test model (which I'm intentionally not describing in detail). One high level statistic that is relevant to look at is the (log) probability of samples versus iteration of the sampler, so I made that plot. It looks like this:
This plot looks a bit surprising. Upon initialization, the sampler moves directly to regions of space that have very low probability (remember, this is a _log_ probability*), and it appears to just keep going to exponentially less and less probable regions. The point of a sampler is that it should spend an amount of time in a state in proportion to the state's probability. And this sampler is making a beeline to a state that is e^-600 times less probable than where it started.

So here's the question: do I have a bug? In other words, if you were my supervisor and I came to you with this plot, would you dismiss this plot and send me back to debugging. If not, explain how this possibly could make sense.

I'll post my answer sometime in the next couple days.

* I'm leaving out constants, so the graph would be shifted down (but wouldn't change shape or scale) if I was including all the constants.

Update: This is long overdue, but Iain Murray nailed it in the comments:
No there isn’t (necessarily) a bug. This type of plot is very easily produced with valid code: e.g. by slice sampling a unit spherical Gaussian distribution in D=5000 dimensions and initializing at an atypical point of high-probability (much closer to the origin than sqrt(D) away). Simple Metropolis and slice samplers can’t change the log-prob by more than ≈1 per iteration, so large log-prob changes are slow and painful. Intelligent proposals, reparameterizations, or auxiliary variable methods can improve matters. This is a nice illustration that initializing with an optimizer (without some form of early stopping) can be a bad idea!
In fact, I was using Matlab's built-in slice sampler, along with a zero-mean, spherical, many thousand-dimensional Gaussian distribution for the likelihood, and initializing near the mode (0).

Sunday, April 10, 2011

Crawling Code

Posted by Lee

One of the contestants requested that I upload the code to crawl the boxscores. I have done so and it is available on github: https://github.com/leezen/boxscore-crawler

Note that Yahoo changed its format starting around March 10th and the code uses the flag to get the old boxscore format. It is unclear how long this option will remain available from Yahoo.

2011 Predictive Analytics Challenge Winner

Posted by Lee

We knew it would be a machine, but we didn't know which one until UConn's victory ensured The Pain Machine the title as winner of the 2011 March Madness Predictive Analytics Challenge! Congratulations to Scott Turner and his entry on the victory -- his second in a row! He will be receiving a \$25 gift certificate to Amazon.com. It doesn't sound like he'll be resting on his laurels as he's started a blog that will detail further development of his system.

Thank you to all the participants and entrants this year. We would love to know how you thought the contest went, how we can improve for next year, and any other feedback you might have! We look forward to your participation again next year!

Sunday, March 27, 2011

2011 Algorithmic March Madness: Machines Lock in Victory over Humans

Posted by Danny Tarlow
Well that was an exciting and surprising weekend of basketball. (8)Butler beat (2)Florida in overtime, and (11)Virginia Commonwealth handily took care of (1)Kansas, eliminating the last remaining #1 seed. Rounding out the Final Four are (3)UConn and (4)Kentucky.

March Madness is usually good for some Cinderella stories, but this Final Four seems particularly improbable. There are no #1 seeds remaining (this has been the case 3 times in March Madness history, according to the TV announcers), and never before have teams seeded as low as #11 and #8 met in a Final Four game.

None of the entries in our second annual March Madness Algorithm Challenge saw this amount of madness coming. Two entrants correctly predicted that (3)UConn would make the final four (Team Delete Kernel and The Pain Machine), but no other algorithms or baselines got any Final Four teams correct. In fact, the only entry that has any more chance at points is The Pain Machine, which has UConn winning one more game.

So the question of which algorithm will win the contest is still not settled: a UConn victory on April 2, and The Pain Machine walks home with the prize; a UConn loss, and Team Delete Kernel is our winner.

What is settled at this point is that a machine will claim victory over the human-aided competition. The human baselines include our commissioner Lee's bracket; the Higher Seed bracket (where the human intervention came via the committee that chose seeds); and the Nate Silver baseline, which was a part-human, part-computer effort.

To give you an idea of the potentially winning methodologies, Scott Turner (The Pain Machine) describes his approach here, and Kevin Lee (Team Delete Kernel) based his model on the method described here.

So it's premature to congratulate a winner yet, but let me tritely say that I, for one, welcome our new March Madness algorithm overlords.
ENTRANT                         R1  R2  R3  R4  R5  Winner      Pts  Possible
Team Delete Kernel              23  20  16  8   -   Ohio St.    67   67
Human (Lee)*                    25  18  16  0   -   Kansas      59   59
Baseline (Higher Seed)*         25  20  12  0   -   Ohio St.    57   57
The Pain Machine                19  18  8   8   -   Kansas      53   69
Baseline (Nate Silver)*         25  20  8   0   -   Ohio St.    53   53
InItToWinIt                     22  20  8   0   -   Kansas      50   50
Baseline (TrueSkill)            26  18  4   0   -   Ohio St.    48   48
Danny's Dangerous Picks         22  16  8   0   -   Duke        46   46
Baseline (LRMC)                 25  16  4   0   -   Ohio St.    45   45
DukeRepeats                     23  16  4   0   -   Duke        43   43
Point Differential Centrality   23  16  4   0   -   Ohio St.    43   43
dirknbr1                        23  8   4   0   -   Ohio St.    35   35

* Denotes human-involvement.
You can see the full brackets here (Update: actually, it looks like Yahoo took them down). Also, there is a second, Sweet 16 contest that we haven't mentioned lately. Stay tuned for an update on that front.

Friday, March 25, 2011

Algorithmic March Madness: Elite 8 Update

Posted by Danny Tarlow
The Sweet 16 saw some major upsets, with two of the remaining three number one seeds falling. Perhaps most impressive was Arizona's (5 seed) 93-77 thumping of Duke (1 seed). In the Elite 8, we have the following seeds: 4, 2, 5, 3, 1, 11, 8, 2.

No entrants predicted that Baylor Butler (8 seed) or VCU (11 seed) would have made it this far, but we're starting to see the first signs of weakness in the Higher Seed and Nate Silver baselines. At this point, our Commissioner's human-chosen baseline bracket is tied for first with Team Delete Kernel's algorithmically chosen bracket. Don't count out InItToWinIt or The Pain Machine quite yet, though. I haven't worked through all the scenarios, but if Kansas wins it all and Arizona beats UConn, InItToWinIt has a good shot at the title. If UConn makes it to the final game, The Pain Machine is looking strong. And if Kentucky wins it all, Team Delete Kernel looks to have at least a share of the title locked up. Interestingly, I haven't been able to find a potential outcome where any baseline other than Lee's human-chosen bracket has a chance at the title. So after a tough start, the algorithmically-chosen brackets are making their move!

Currently, here are the standings:
ENTRANT                         R1  R2  R3  R4  R5  Winner      Pts  Possible
Human (Lee)                     25  18  16  -   -   Kansas      59   131
Team Delete Kernel              23  20  16  -   -   Ohio St.    59   91
Baseline (Higher Seed)          25  20  12  -   -   Ohio St.    57   81
Baseline (Nate Silver)          25  20  8   -   -   Ohio St.    53   77
InItToWinIt                     22  20  8   -   -   Kansas      50   106
Baseline (TrueSkill)            26  18  4   -   -   Ohio St.    48   48
Danny's Dangerous Picks         22  16  8   -   -   Duke        46   70
The Pain Machine                19  18  8   -   -   Kansas      45   125
Baseline (LRMC)                 25  16  4   -   -   Ohio St.    45   69
DukeRepeats                     23  16  4   -   -   Duke        43   67
Point Differential Centrality   23  16  4   -   -   Ohio St.    43   51
dirknbr1                        23  8   4   -   -   Ohio St.    35   59

Tuesday, March 22, 2011

Sweet Sixteen Bracket

Posted by Lee

If you did not have a chance to get your entry in for the full 63-pick tournament, you still have a chance to make picks in the condensed 15-pick format for our Sweet Sixteen Bracket! The next set of games begin Thursday night (March 24) Eastern time, so you will need to submit your brackets by then. If you already entered in the original bracket, you have been invited to the new one. If you are new and would like to participate, please e-mail leezen+MarchMadness at gmail.

Currently, in the full bracket prediction contest, Team Delete Kernel and InItToWinIt are sitting in the top half of the pack. Don't count The Pain Machine out yet, though, it can still score a possible 153 points, which is more than TrueSkill has available even though it currently ranks third.

P.S. Sorry I did not get around to posting competitor profiles this weekend.

Monday, March 21, 2011

After Round 2: Points from Upsets

Posted by Danny Tarlow
You can see the current standings here. At the moment, 3 of the top 5 (including the top 2) entries used some sort of human judgement in their algorithm: picking based on Higher Seed (tied for 1st) is the judgement of the seeding committee; the Nate Silver (tied for 1st) baseline uses seed information along with several other power ratings, some of which are human based, and some of which are computer based; and the Lee bracket (tied for 4th) was filled out by Lee with no computer assistance. The top two computer entries are the TrueSkill algorithm (3rd) that Scott Turner implemented and the Delete Kernel (tied for 4th) entry from Kevin, who built his entry based on the simplest 1D probabilistic matrix factorization model that I wrote about previously (and released code for).

I think the take-away at this point is that the real winner right now is whoever decided on the seeding of teams. The Higher Seed bracket is tied in first place, and the strength of the other brackets mostly comes from how closely their picks matched the higher seed. Yes, there have been a lot of upsets, including some big ones, and the entrants did indeed pick some upsets, but the entrants didn't generally pick the right upsets.

Here's an alternative way of looking at results that reveals this. I took the point totals for each contestant and split off the contribution to the point total that came from picking an upset. For the two rounds, I report "A/B" where B is the total number of points the entrant earned in the Yahoo bracket, and A is the number of points that came from predicting upsets. I define "upset points" to be points gained from a correct prediction, where the winning team is not the best ranked seed that could possibly have made it to that point. So even though Richmond (12) was the favorite over Morehead (13), predicting Richmond making it to the Sweet 16 would give a contestant 2 "upset points", because the best ranked seed that could have made it to that point was Louisville (4). Here are the results:
Points from upsets

TEAM                   R1      R2     Total
Delete Kernel:        4/23    2/20    6/43
InItToWinIt:          2/22    2/20    4/42
Human (Lee):          1/25    2/18    3/43
Point Differential:   3/23    0/16    3/39
Silver:               3/25    0/20    3/45
LRMC:                 3/25    0/16    3/41
Dirknbr1:             0/23    2/8     2/31
Danny:                2/22    0/16    2/38
TrueSkill:            2/26    0/18    2/44
The Pain Machine:     0/19    0/18    0/37
Higher Seed:          0/25    0/20    0/45
Under this evaluation measure (which is admittedly not what anybody was trying to optimize), the completely computer-based models are doing better. Perhaps the real take-away at the moment, though, is that predicting upsets is hard!

Friday, March 18, 2011

Current Standings

Posted by Lee

I've opened up the Yahoo Group for anyone who wants to see the standings and the submitted brackets.

Remember that if you were not able to submit a bracket in time for the full tournament, you will have a second chance as we will run a Sweet Sixteen bracket as well. The Sweet Sixteen does not start until 3/24, so you have almost a week to tweak and get something working before then!

The first round of play is over and after 32 completed games, the baseline predictors are doing very well against the algorithms. I have pasted the full standings below (I realized after last year's tournament that Yahoo does not preserve the brackets and standings, so I will try to do a better job keeping track of them on the blog this year).

 Rank Bracket R1 R2 R3 R4 Semis Finals Points Possible 1 Baseline (TrueSkill) 24 Ohio St. 24 185 2 Baseline (Higher Seed) 23 Ohio St. 23 184 2 Baseline (Nate Silver) 23 Ohio St. 23 184 2 Baseline (LRMC) 23 Ohio St. 23 182 2 Human (Lee) 23 Kansas 23 182 6 Point Differential Centrality 22 Ohio St. 22 153 7 Danny's Dangerous Picks 21 Duke 21 182 7 DukeRepeats 21 Duke 21 180 7 dirknbr1 21 Ohio St. 21 168 10 Team Delete Kernel 20 Ohio St. 20 181 10 InItToWinIt 20 Kansas 20 165 12 The Pain Machine 17 Kansas 17 174

I will be posting introductions of the various contenders over the course of this weekend. In the meantime, cheer on the algorithms... or human baselines - the higher seed and "Lee" - if you fear the machines! Enjoy the exciting set of games coming this weekend!

Early Analysis of Algorithmic March Madness Entries

Posted by Danny Tarlow
Scott Turner has been analyzing entrants into our 2011 March Madness Predictive Analytics Challenge, and he has some interesting insights and comments. I'm posting them verbatim here.

In the last 10 years, the upset rate in the first round has been 22%. (Ignoring 8-9 matchups.) Looking at the other 6 entries in the contest (the ones I think are entries, anyway), they predict 16 upsets -- about 8%. Of the baselines, only LMRC comes close with 5 upsets (15%).

For my entry, I forced 15% upset picks over the entire tournament (~9). (15% is about the overall upset rate in the tournament for the past 10 years.) So I have 4 in the first round and 5 in the later rounds.

Also, everyone has a #1 seed winning it all, but no one has Pittsburgh.

(I'm assuming that Point Diff, InitToWinIt, DukeRepeats, Danny's, dirknbr1 and mine are the competitors.)

Ignoring upset picks where there's only a 1 seed differential (e.g., 9 over an 8), these are the upset picks across all the competitors:

Qty     Upset   Delta   Details
[1]     3 > 1   2       (BYU > Pitt)
[2]     6 > 3   3       (Xav > Syr, SJU > BYU)
[3]     7 > 2   5       (Was > UNC, A&M > ND, UCLA > Fla)
[1]     9 > 1   8       (ODU > Pitt)
[4]     10 > 7  3       (FSU > A&M, UGA > Was, PSU > Temple, MSU > UCLA)
[1]     10 > 2  8       (UGA > UNC)
[3]     11 > 6  5       (Missou > Cinc, Gonz > SJU x 2)
[1]     12 > 9  3       (USU > ODU)
[4]     12 > 5  7       (USU > KSU x 4)
[2]     12 > 4  8       (USU > Wis x 2)
[2]     12 > 3  8       (USU > BYU x 2)
[1]     12 > 2  12      (USU > Pitt, USU > Kan)
[2]     13 > 4  11      (Bel > Wis x 2)
[1]     14 > 6  8       (St. Pete > Georgetown)
[1]     14 > 3  11      (St. Pete > Purdue)
which if I add correctly is 30 upsets picks. The upset % over the last ten years (again ignoring 1 seed differential) is 119/519 = ~23%. By that metric these brackets ought to show 86 (!) upsets - almost 3x more. (The number is even worse if you take out my bracket, where I tried to force the right upset metric, but even my bracket should show more upsets.)

I think there's an interesting discussion to be had over whether a computer predictor ought to pick the most likely event on a game by game basis or not. The former leads to mostly "chalk" picks -- the only time you're going to go against the seeding is where the seeding is wrong (according to your model). Imagine a tournament where in every game the better team has a 1% advantage. The hypothetical predictor will pick the favorites in every game -- and miss almost half the games. Is that the best strategy?

It's also interesting to note that USU counts for a full third of the upset picks. If we assume that most of these models are only picking upsets if they think there's been an actual seeding mistake, then USU was the only "consensus" mis-seeding. That's kind of an interesting result in itself. That suggests that the committee does a good job seeding (at least in some sense) and/or that these predictors don't do a good job of discerning "hidden" information about the teams.

Of the 30 upset predictions, after the first day we know that 2 are right (the picks of Gonz > SJU), 13 are wrong, and the remaining 13 are undecided. That doesn't look like such a good record for the predictors :-). The correct upset pick was a delta of 5, which is a pretty big upset.

I can't see right now whether the ESPN Tourney Challenge shows statistics like how many people in the tourney picked particular upsets, but it would be interesting to compare the competitor upset picks to how people picked.

-- Scott Turner

Thursday, March 17, 2011

New Scientist Article

Posted by Danny Tarlow
MacGregor Campbell over at newscientist.com wrote a great article about our March Madness contest:

My only complaint is that Lee doesn't get the recognition for all of the hard work that he's put into making the contest happen. The contest literally would not have happened without him. (Thanks, Lee.) The other details are quite good, though. I'm particularly impressed by the clear and accurate description of the algorithms.

Thanks MacGregor! And thanks to all the participants who have entered the main bracket contest. If you didn't make it in time for the main contest entry, there's still the Sweet 16 contest. Get those algorithms running!

Wednesday, March 16, 2011

Less than 24 hours left

Posted by Lee
There are less than 24 hours left before the tip-off of the first game in the round of 64 in the NCAA tournament. A friendly reminder that you will need to have your entries submitted before the first game begins. If you're feeling crunched for time, we'll have a sweet sixteen bracket like last year -- so even if you haven't started coding/modeling/training/etc. it's not too late!

Tuesday, March 15, 2011

Dr. Scott Turner on March Madness: The Pain Machine 2011

Posted by Danny Tarlow
This is a guest post by Dr. Scott Turner, the force behind team "The Pain Machine," which was the co-winner of the Sweet 16 contest from last year. In this post, he'll describe what to expect from his entry to this year's algorithmic March Madness prediction contest. If you are planning to enter and would like to contribute a guest post, please email Danny or Lee.

Dr. Turner has a Ph.D. in Artificial Intelligence from UCLA. His dissertation subject was a program called MINSTREL that told stories about King Arthur and his knights, as a way to explore issues in creativity and storytelling. Since obtaining his Ph.D. in 1993, Dr. Turner has worked for the Aerospace Corporation, where he advises the nation's space programs on software and systems engineering issues.

As a lifelong college basketball fan and a student of AI, I was intrigued last year when I saw Danny's call for participants in a tournament prediction contest. I put together a program and managed to tie Danny in the Sweet Sixteen bracket.

The program I wrote last year used a genetic algorithm to evolve a scoring equation based upon features such RPI, strength of schedule, wins and losses, etc., and selected the equation that did the best job of predicting the same outcome as the games in the training set. I felt the key in winning a tournament picking contest was in guessing the upsets, so I added some features to the prediction model intended to identify and pick likely upsets.

In retrospect, the competition pool for this contest is so small that picking upsets is probably not as important as it is in (say) the full Yahoo pool, where you need to gamble more if you hope to distinguish yourself from the mass of consensus picks.

Since the contest last year I have continued to work on the Pain Machine. The focus of my effort shifted more towards predicting the margins of regular season games. My latest models predict regular season games about 80% correctly with an error of about 8 points. Since 2/5, the Pain Machine has predicted 63% correctly against the spread (from Bodog.com) for selected games where it has identified a strong advantage. However, the sample size for that is very small so the result may not be meaningful. Performance of the model against the 2009 and 2010 tournament games is similar (or better), although again the sample size is very small.

In the course of the last year I have identified several interesting (i.e., non-obvious) keys to understanding and predicting college basketball games. But perhaps the most striking realization has been the variability of the outcome of college game. In Stokol's original paper on his LMRC method for predicting college basketball games, he compares pairwise home-and-home matchups and determines that a team needs to win at home by >21 points to have an even chance to beat the same team on the road! While I don't agree with the magnitude of that result, it's clear that what I would have evaluated as a "convincing" win -- say by 10 or 12 points -- is actually surprisingly weak evidence of the superiority of Team A over Team B. A case in point is Virginia Tech's home win over Duke late in the season. Virginia Tech is at best a bubble team and still managed to beat one of the best two or three teams in the country. Does this mean that Virginia Tech is significantly better than we thought? Or Duke significantly worse? Probably not. So it's hard to imagine a predictor that could consistently predict those sorts of outcomes, which makes the problem all the more interesting!

-- Scott Turner

Monday, March 14, 2011

March Madness Predictions: Code Description

Posted by Danny Tarlow
In the last post, I showed outputs of the 1D version of my matrix factorization model for predicting March Madness results. Here, I'm posting the code along with a brief description of how to get it running, so you can replicated my results, possibly as the basis for your entry into the 2011 March Madness Predictive Analytics Challenge.

To start, there are two Python files you need:
You'll also need the pickle file with the data:
Put all the files in a directory, and make a directory named "models" to store the output.

Now there are two steps:
1. Train a model by running "python learner.py".
2. Print out the offensive and defensive scores by running, "python bracket.py".
That's it!

If you'd like to simulate particular games, there is a function in bracket.py called simulate_game(team_code_A, team_code_B). There is also old code to simulate a full bracket, but that hasn't been updated from previous years (yet).

If you'd like to train higher dimensional models or play around with the regularization parameter, feel free to change things at the top of __init__() in learner.py. Higher dimensional models are harder to visualize, so one idea would be to sort teams based on how they are predicted to fare against a favorite like Duke ("dau" team code).

Happy predicting!

The Algorithm's 2011 March Madness Predictions Part 1

Posted by Danny Tarlow
We seem to have ironed out most issues related to data for the 2011 March Madness Predictive Analytics Challenge. As a reminder, check the March Madness label to see all contest-related posts in one place. If you'd like to enter the competition, there's still time! (And Lee has posted new, easier to use data.)

With announcements out of the way, it is with great pleasure that I announce that I've got my algorithm running on the 2011 season data. In this post, I will show outputs from the simple 1D version of my probabilistic matrix factorization model. I'll run the higher dimensional version that I use to actually make predictions and write about it in a later post (or you can follow the instructions here, run the code, and train the model yourself).

How the Model Works: The basic idea is very simple: we want to represent each team's offensive and defensive strength with a single number (in the 1D version) for each. We will make a prediction for the result of team i playing team j as follows:

Predicted score for i = offense_strength_i * defense_strength_j
Predicted score for j = offense_strength_j * defense_strength_i

It should be clear that higher numbers mean better offense, and lower numbers mean better defense.

The learning algorithm looks over all games played this season and tries to find a setting of offensive and defensive strengths for each team such that scores predicted by the model best match the actual outcomes observed*. (If you want the short answer, this is achieved via the miracles of calculus.)

What I'm Showing: I will first report a combined measure, which takes the offensive strength and subtracts the defensive strength. If you think of having each team play against a baseline team with offensive and defensive strength 1, then the difference tells you how much you expect to the team to win by (or, if it's negative, to lose by). Afterwards, I show the top offenses and the top defenses, along with their strengths learned by the model. In all cases, I report the top 50 teams. Keep in mind that the algorithm knows nothing about rankings, players, or anything other than the final score of each game. Also keep in mind that I know less than the algorithm about what happened this year in college basketball. There are some important caveats in the comments under this post.

If you'd like to reproduce these results at home, follow the instructions and run the code in the next post.

So without further ado, here are the outputs. You can use these to predict the score of any game by plugging into the formula above. (Only the top 50 teams are shown in the post for each measure. The outputs for all teams in the database are here.)

Combined Rating

Duke (3.15)
Ohio St. (3.10)
Kansas (3.07)
Washington (2.71)
Pittsburgh (2.64)
Texas (2.57)
Kentucky (2.52)
Purdue (2.43)
Louisville (2.42)
BYU (2.41)
Notre Dame (2.41)
Syracuse (2.39)
North Carolina (2.34)
San Diego St. (2.26)
Wisconsin (2.21)
Connecticut (2.15)
Georgetown (2.01)
Missouri (2.00)
Arizona (2.00)
Illinois (1.99)
Cincinnati (1.96)
West Virginia (1.95)
Vanderbilt (1.95)
Florida (1.94)
Marquette (1.93)
UNLV (1.93)
Villanova (1.92)
Kansas St. (1.84)
Maryland (1.83)
Gonzaga (1.78)
Michigan St. (1.72)
St. Mary's (1.72)
Virginia Tech (1.72)
St. John's (1.70)
Utah St. (1.68)
Washington St. (1.65)
Florida St. (1.63)
Texas A&M (1.63)
Belmont (1.62)
Clemson (1.59)
Xavier (1.58)
Michigan (1.57)
Temple (1.55)
New Mexico (1.50)
George Mason (1.49)
UCLA (1.47)
Minnesota (1.46)
Northwestern (1.44)
USC (1.43)

Offenses

Washington (10.72)
Duke (10.43)
Kansas (10.39)
BYU (10.32)
Oakland (10.20)
Missouri (10.18)
Ohio St. (9.97)
North Carolina (9.89)
Virginia Military (9.87)
Notre Dame (9.79)
Kentucky (9.76)
Vanderbilt (9.73)
Louisville (9.69)
Marquette (9.67)
Maryland (9.65)
Arizona (9.64)
Providence (9.62)
Pittsburgh (9.56)
Long Island (9.56)
St. Mary's (9.56)
Connecticut (9.54)
Syracuse (9.54)
La Salle (9.49)
Texas (9.49)
South Dakota St. (9.47)
Purdue (9.47)
Iona (9.46)
California (9.42)
Duquesne (9.42)
Belmont (9.40)
Villanova (9.39)
Georgetown (9.37)
Gonzaga (9.35)
Texas Tech (9.35)
Iowa St. (9.30)
Washington St. (9.29)
Mississippi (9.29)
Illinois (9.29)
Northwestern (9.29)
UNLV (9.25)
Boston Coll. (9.21)
Kansas St. (9.20)
Florida (9.18)
Xavier (9.17)
Detroit (9.14)
New Mexico (9.11)
St. John's (9.10)
Michigan St. (9.06)
Charleston (9.06)

Defenses

Wisconsin (6.61)
San Diego St. (6.74)
New Orleans (6.79)
Cincinnati (6.84)
Utah St. (6.85)
Ohio St. (6.87)
Texas (6.92)
Pittsburgh (6.92)
USC (6.96)
Alabama (6.98)
Penn St. (6.99)
Old Dominion (6.99)
West Virginia (7.00)
Texas A&M (7.01)
Clemson (7.02)
Michigan (7.03)
Purdue (7.04)
Cal Poly (7.05)
Virginia (7.12)
Virginia Tech (7.13)
Florida St. (7.15)
Syracuse (7.15)
Drexel (7.18)
Temple (7.22)
Kentucky (7.23)
Florida (7.24)
Stephen F. Austin (7.25)
Louisville (7.26)
Duke (7.29)
Richmond (7.29)
Illinois (7.30)
Kansas (7.32)
UNLV (7.32)
Michigan St. (7.34)
Fairfield (7.34)
Saint Louis (7.35)
Georgetown (7.36)
Kansas St. (7.36)
St. Peter's (7.37)
Northern Iowa (7.37)
Denver (7.38)
Notre Dame (7.39)
UAB (7.39)
St. John's (7.39)
Connecticut (7.39)
Montana (7.40)
UCLA (7.40)
Seton Hall (7.41)
South Florida (7.42)

* There's also some regularization.

Sunday, March 13, 2011

Posted by Lee
To remedy the caveat about team field goals I mentioned in the previous post, I am now explicitly generating the score data by crawling the scoreboard. I have posted this to this file with the following format: date, home team, away team, home score, away score, home team won (1 yes, 0 no).

I also found out that March 11, 2011 was missing from the data. I have re-crawled this date and added it to the files. If you want to use this date in your data, please re-download the .zip files. The data is inclusive through March 12, 2011.

Data Caveats

Posted by Lee
As we've seen in the past 12 hours, there have been some hiccups with the data. I apologize that data quality has been a bit lacking. Hopefully things will be relatively stable from here. I also wanted to point out a few other caveats in the data.
• Some games are missing completely. This is because at crawl time, sometimes the server hiccups and returns an empty page even though I am not being throttled.
• Some games exist in Games.tsv but there is no player data for it. This is because in some cases, the Yahoo! box score format is missing columns like minutes and the parser does not expect this.
• Some games have inaccurate full aggregate point totals. In a few cases, field goals exist at the team level, but not the individual player level. I expect this to be relatively rare, but does cause some score discrepancies in the aggregate.
I'll try to keep this list up to date. Please comment/e-mail if you run into other items such as the above.

Aggregate Game Results

Posted by Lee
To make things even easier, I've uploaded a file containing aggregate game results.

The format for this file is:
1. Game Date
2. Home Team
3. Away Team
4. Home Team Points
5. Away Team Points
6. Home Team Win (home team points > away team points) - 1 if the home team won, 0 if the away team won
The columns are comma-separated.

Selection Sunday Today!

Posted by Lee
UPDATE 3: 3/11/2011 was missing from the data. The archive has been corrected. It was also renamed to reflect that it contains all games through 3/12 and not 3/13.

UPDATE 2: It turns out that I was incorrectly parsing boxscores where the teams had a seed number in front of their name, leading to this being interpreted as "UNK" when in fact, these are incredibly important games (thanks, Rob). I have fixed this and uploaded new data. Please re-download the data. The links below have been corrected to point at the new data as well.

UPDATE: Thanks to the keen observation of one of our contestants (thanks, Scott), we've updated the data as the boxscore format we were crawling changed starting 3/10. The link below has been updated to the new data.

Today is Selection Sunday, when we will find out who is playing in this year's NCAA tournament. This also means that all the conferences have completed their games and it is time to release the data. Full contest details are available here.

Please download these two files: The team code mapping is there to help you convert between the codes in the actual boxscore data and the actual names of the teams. As with previous data release, there is a team code called "UNK" that refers to a team that is not within the list of codes. These are often small teams either in non-Division 1 conferences or no longer in Division 1.

There are two files within the data archive: Players.tsv and Games.tsv.

Games.tsv - each row corresponds to a game played during the season.
1. Game ID
2. Date
3. Home team
4. Away team

Players.tsv - each row corresponds to a single player's performance in one game.
1. Player name
2. Game ID
3. Player's team
4. Minutes played
5. Field goals made
6. Field goals attempted
7. Three pointers made
8. Three pointers attempted
9. Free throws made
10. Free throws attempted
11. Offensive rebounds
12. Defensive rebounds
13. Assists
14. Turnovers
15. Steals
16. Blocks
17. Personal fouls
Note that three pointers are included in the number of field goals. Thus if a player has made 3 field goals and 1 three pointer, then that player has scored 7 (2 + 2 + 3) points.

In the previous post, I had a faulty link to join the group on Yahoo. The correct link is http://tournament.fantasysports.yahoo.com/t1/register/joinprivategroup_assign_team?GID=55350. As with before, please e-mail leezen+MarchMadness at gmail for the password. Please include your team name, team members, and brief description.

Good luck!

Friday, March 11, 2011

Official 2011 March Madness Predictive Analytics Challenge

Posted by Lee
Welcome to the second annual March Madness Predictive Analytics Challenge! I'm very excited about this event and I hope you are, too! We're still trying to line up some prizes, but for sure, like last year, there will be a gift certificate to Amazon.com.

This year's format will be more or less the same as last year's.

Background

Most readers of this blog are probably familiar with the general idea of what this contest is about. In case you aren't a frequent reader or a fan of college basketball, this section will serve as a brief introduction. March 11th is "Selection Sunday" where the teams for the NCAA College Basketball tournament will be selected. In total, there will be 68 teams with 8 teams playing four "play-in" games on March 15th and 16th to determine the field of 64. For the purposes of this contest, you do not need to worry about these initial play-in games. The remaining 64 teams are then pit against each other in a bracket with one national champion emerging as the winner. Every year, millions of people fill in their predictions of who will be the winners and losers of the games. People participate in leagues or pools with other people to see who has the best bracket. We would like YOU to participate in our algorithm-only pool. That is, your bracket must be completed by a computer algorithm based upon historical data without the use of human judgment.

Contest Format

The format is fairly simple. We will have two pools: a Tournament pool and a Sweet Sixteen pool. Entries in both pools will be evaluated on the typical exponential point scoring system. Correct picks get 1, 2, 4, 8, 16, and 32 points depending on the depth in the bracket (1 point in the first round, 2 points in the second round, etc). The entry only needs to pick the winning team. Thus, if the other team is no longer in the tournament, but the winning team is picked, points are still awarded. Each person is limited to one entry per pool. Each pool will have a winner determined by the submission scoring the most points.

TOURNAMENT pool entries must be submitted no later than March 17, 2011 (the first day of play in the round of 64).
SWEET SIXTEEN pool entries must be submitted no later than March 24, 2010 (the beginning of the sweet sixteen round).

Rules

• Your bracket must be chosen completely by a computer algorithm.
• The computer algorithm must base the decision upon historical data.
• You may not hard code selections into your algorithm (e.g., "Always pick Stanford over Cal")
• Your algorithm may only use the data set published for the tournament. The data will be released on Sunday, March 13.
• The above rule is fairly restricting, but I believe this provides a more even playing field. The contest should be about your algorithm's predictive capabilities and not a data advantage one person has over another.
• You must be able to provide code that shows how your entry picks the winners. In other words, your bracket and the selection of winning teams in your bracket must be reproducible by me on a machine.
• In the event of a tie, the entry with the EARLIER submission time wins.

Submissions

We'll be using Yahoo's bracket system for the contest submissions. Please send an e-mail to leezen+MarchMadness at gmail for the group password to join. Please include your team name, team members, and brief description.

TBD

Data

As described above, only the official contest data on this blog is acceptable for use in this contest. You can get a sample of the data, which has all games from the 2006 season through February 2011. Please see this post for details. I will also update this post on Sunday with a link to the full data set.

UPDATE: I have an updated post with details on the final data: Selection Sunday Data

Please be aware that algorithm computation time will be somewhat important in this task. You will be able to predict most of your games ahead of time between March 13th and 17th but because of the four play-in games, you will need to predict the outcome of four games between March 15th and 17th as the match-ups in the round of 64 will not be known until the play-in games are complete.

If you have other questions, concerns, etc. please comment on this post and I'll do my best to answer.

Tuesday, March 8, 2011

Predicting March Madness: Tournament Structure and Bracket Scoring Measures

Posted by Danny Tarlow
This is the second post aimed at getting you all excited about entering the 2011 March Madness computer predictions challenge. If you missed George's earlier post, go check it out first, to get the creative juices flowing.

George wrote about the problem of given two teams, how likely is it is that one will beat the other. What I will talk about now is how the tournament structure comes into play--the concerns that go beyond which of two teams will win the next game they're scheduled to play against each other. I'll assume throughout that we have at our disposal a good model of individual games.

The first natural question is whether any additional thought is necessary. If we have a good model of individual games, can't we just greedily use the model to predict all first round outcomes, then all second round outcomes, ..., then be done? Maybe it's not such a bad idea. That's what I've done the past few years, after all.

I'm going to argue that there may be good reasons to work on factoring the tournament structure and contest scoring rules into your model. Let's begin by thinking about the tournament probabilistically. In the most general setup, there is a probability distribution P over the joint space of all outcomes. In a bracket with 64 teams (ignore the play-in game for now), there are 2^32 possible results for the first round. Given those, there are 2^16 possible results for the second round. Given those, 2^8, then 2^4, then 2^2, then 2 possible outcomes for later rounds. Multiplying all those, we get 2^63 possible joint outcomes. A model should be to be able to assign a probability to any of those full outcomes.

Compact representation of probabilities (aka the model): 2^63 is a very large number, so it's not feasible to represent the probability of each outcome as a separate entry in a very large table. Beyond the practical concerns, it also wouldn't be a good idea from a statistical estimation perspective--such a model would make inefficient use of training data.

To illustrate this, imagine for a moment that the same tournament happens repeatedly (the same teams start in each position of the bracket, but there may be different outcomes), and suppose we want to use data from the last N tournaments to estimate the probability of outcomes in the next repetition of the tournament. One approach is to just count how often each full bracket resulted in the last N tournaments. This would use on the order of 2^63 parameters. If N is infinite (or _extremely_ large), then this is a perfectly good approach; however, when we have limited data (i.e., nearly always), nearly all entries in the big table will be zero; in this case, it is important to have a more compact representation.

The standard solution is to find a more compact representation of P. Here's one, which we'll call the fully factorized model: the probability of a full tournament outcome P(g_1, ..., g_63) over 63 games g_1 ... g_63 is proportional to f(g_1) * f(g_2) * ... * f(g_63). For each game, we can look at who the participants are, then estimate f by looking over the last N repetitions of the tournament where the appropriate two teams met in a game. It should be clear that there are exponentially many fewer parameters in the compact model than there are in the big joint table model--in the simplest version, we just need to represent one number for each pair of teams, so there are on the order of M^2 parameters if there are M teams in the tournament. (Going further, which is necessary in practice, when there is only one tournament per year, if we were to use e.g., a matrix factorization model, we could represent an "offensive strengths" vector (or scalar) and a "defensive strengths" vector (or scalar), then we could produce

f(team i, team j) = g(team i offense * team j defense - team i defense * team j offense),

which requires only on the order of M parameters.)

Regardless of how f is computed, the fully factorized model makes some practical sense as well, because on first pass, we typically don't expect the result of an early game in the South bracket to affect the outcome of an early game in the East bracket, for example. In the fully factorized model, that assumption is built in. Statistically, the rough basic intuition is that all else being equal, if we have fewer parameters in the model, we need fewer past data instances in order to pin down their values accurately.

The flip side of using a simpler model is that we lose the ability to model certain interactions. Let's think critically about this: do we really believe that the outcome of each game depends only on which teams are playing in the current game? If you listen to sports announcers, it's not clear to me that the answer is yes. For example,
• Some teams might do better or worse under pressure, players might get tired after several intense games in a row, or the location of the game affect the outcome. In all these cases, the round in which the game takes place could be important.
• If an underdog pulls off a big upset over a top seed, the underdog might gain confidence and become stronger in their next games.
• If a top seed just barely beats a big underdog, the top seed might either lose confidence or snap out of a lull.
• If an underdog in say the East bracket beats a top seed in say the South bracket on Day 1, it might give other underdogs a boost of confidence on Day 2.
• Suppose we have a structure where A is playing B and C is playing D, then the winners play each other. If B is weak and C is a big rival of A's, it's possible that A could look past B to C and not come prepared for the game against B.

Are these things important to model? I'm not sure, but it would be interesting to try. If we do, we'll need a more complex representation of our tournament outcome model than the simple fully factorized approach gives.

Loss functions: Now let's switch gears and consider precisely what our goal is. I'll assume that each bracket challenge contestant can only enter one bracket into the contest. In the simplest case, we might be concerned only with whether we get the perfect bracket or not; this would be the case in a contest where a perfect bracket wins \$5M, but there is no payout for a less-than-perfect bracket. Then, the optimal strategy is to choose the joint outcome that the model thinks is most likely. The optimality is easy to see, because the expected value is equal to \$0 * P(not perfect) + \$5M * P(perfect). So assuming the model captures all we know about the tournament, we should clearly choose the outcome that the model assigns the highest probability.

What about the case where the goal is to get as many predictions right as possible?

Maybe we are equally unhappy if we do worse than the baseline that makes all predictions based solely on seed, then we're linearly more happy as we beat that baseline by more. (For example, if you're betting against your mom, and she has a history of always choosing the better seed.)

Or maybe you are linearly more happy as you get more "points", but points are determined by some contest-specific scoring rules, like the first round is worth 1 point, the second round is worth 2, the third 4, etc.

Whatever your goal, there is likely some loss function that can be formulated to capture your desires for a model. Formally, a loss function is just a function of a model's prediction and the ground truth that tells how unhappy we are with the prediction. If the prediction is perfect (all games right!), loss should be 0, and if the prediction is the worst possible (all games wrong?!) the loss should be maximum (let's say 1). Obviously we'd prefer to just have a perfect bracket, but given such a hard problem, that's unrealistic. The job of a loss function is to tell the model how to trade off different kinds of mistakes. It's rarely the case that we really think there is one bracket that is perfect and all others (like the one that gets everything right except the play-in game) are equally bad, so formulating a good loss function generally takes some thought.

These types of questions are near and dear to my heart (though I generally think more about them in the context of computer vision than March Madness), so hopefully you now have some intuitions about why we might want to go beyond simple per-game models, and hopefully you're convinced that part of defining a good model is defining a good loss function.

A question that remains is how to actually build such a model. How do we formulate this idea that we are making multiple correlated predictions at once (predicting the whole bracket rather than just one game at a time), and how do we ask a learning algorithm to try to optimize a particular loss function? I'm glad you asked! Luckily, there is a whole subfield of machine learning that looks at these types of problems. It goes by the name of structured prediction, or structured output learning.

I'd write more about it here, but Hal DaumÃ© III has written a great series on the exact subject over at his blog. So if you are interested in more, I encourage you to go check out the following posts:
And it should go with out saying: keep working on those models! The competition will be starting soon!

Monday, February 21, 2011

2010 March Madness Contest Winning Algorithm: Code

Posted by Danny Tarlow
Last year's March Madness algorithm predictions contest ended with me winning the full bracket, and Scott Turner (aka "The Pain Machine") and I tied for the sweet 16 bracket:
http://blog.smellthedata.com/2010/04/duke-is-not-only-winner-from-last-night.html

I've gotten a few emails asking for the code I used. I haven't looked at it since last year, but the Python code is available here:

You can follow links for the description of the probabilistic matrix factorization model, and a post using a similar model applied to a recommendation system.

Unfortunately, the documentation is a bit lacking beyond that. Hopefully the posts are enough to explain things at a high level. If you have more detailed questions (or find bugs), feel free to ask (or point them out) in the comments.

If somebody adapts this code to work with this year's data format, please share! (and let me know)

This code may also be of interest.

Saturday, February 19, 2011

Memory Hacking

Posted by Danny Tarlow
I wrote a while back about the idea of using machine learning to make scrabble words easier for me to memorize. One of the main problems was how to describe the "me complexity" (or "danny complexity") of some representation of the dictionary. This article doesn't directly address the question, but I found it quite interesting to think about in this context:
http://www.nytimes.com/interactive/2011/02/20/magazine/mind-secrets.html?ref=magazine

I find that building intuitions and stories around concepts in research helps me greatly when coming up with new ideas. Are there corresponding classes of mind-hacks for other things, like creativity or mathematical manipulations?

(Via @igrigorik)

Modeling Chess with Probabilistic Matrix Factorization

Posted by Danny Tarlow
Dirk writes...
Thanks for posting your code and ideas on your blog. I was playing around with your PMF code applying it to chess games, where users are white and items are black. Because players can play more than once against each other, would you remove the duplicate pairs?
Cool. It sounds like an interesting application.

I don't see any reason to remove duplicate pairs. If different games have different outcomes, you definitely want to keep both to prevent biasing the model. If there are duplicate games with the same outcome, we should be more confident that the outcome wasn't a fluke, so it roughly makes sense that we'd want to weight it a bit more (which is the effect of having duplicate entries in the data set).

It's a bit strange to apply the vanilla model chess, though. There is no relationship in the model between player X as white and player X as black--they might as well be two separate people. It may be interesting to look at extensions where the latent vectors for player X as white are constrained to be similar as the latent vectors for player X as black.

Do any readers disagree?

(The relevant code is here.)

Sunday, February 13, 2011

NCAA Boxscore Data 2006-2010

Posted by Lee
All the boxscores from the 2006-2010 seasons (up until the first few days of February) are now available for download!

The format is specified in the last post (here). However, please note that the team codes now include an "UNK" which refers to a team that is not within the list of codes. These are often small teams either in non-Division 1 conferences or no longer in Division 1. UNK can map to multiple schools. Please let me know via the comments if this will pose any major problems. I could include the actual team name, but all of the "UNK" teams will not be in the tournament and it is likely that almost all the schools in the tournament will not have an "UNK" team in their regular season schedule.

Saturday, February 12, 2011

Thoughts on Modeling Basketball

Posted by Danny Tarlow
This is a guest post by the great George Dahl, who is one of my go-to guys for sports modeling. I invited George to write a guest post related to the upcoming 2011 March Madness Predictive Analytics Challenge, because he has experience with applying machine learning to several different sports predictions problems, particularly in tennis and basketball. In the past, I've posted about work he was involved with related to Bayesian NBA Basketball Predictions.

George is a Ph.D. student in the Toronto Machine Learning Group, and he is supervised by Geoffrey Hinton. George focuses most of his research energy on deep learning approaches to speech and language processing and is a part of a collaboration that has obtained some of the best results on the TIMIT dataset.

With what will certainly be the most exciting March Madness prediction competition yet looming on the horizon, Danny asked me to share a few of my thoughts on using machine learning for basketball prediction. In this discussion, I want to focus on predicting everything possible about the outcomes of individual games and avoid discussing predicting an entire bracket. Certainly a reasonable approach for predicting a bracket is to predict individual games in sequence, but it should be obvious that there might be some problems with this since mistakes in early games can have a very large impact on the quality of the final bracket. This discussion also won’t really focus on the details of the competition too much, but instead more on whatever I feel like rambling about.

Before we start, I want to mention the two people that made it possible for me to work on basketball prediction and coauthor a paper featuring it. Last year, I became very interested in predicting NBA game results (and betting on them) and I wanted to see how far I could push the basic matrix factorization approach in Danny’s original post. One area where probabilistic matrix factorization (PMF) has been highly effective is in the famous Netflix Prize competition. Even with the number of training cases in the Netflix competition, a Bayesian approach provides substantial benefits in prediction accuracy by preventing overfitting. In a basketball game prediction system, we should expect it to be even more important to be protected by the cloak of the Reverend* since teams may only play each other a few times in a given season and the composition of the team, injury status of the players, and other features of a team constantly evolve. Thus it seemed natural to try Bayesian PMF on basketball prediction. Knowing that the latest and greatest practical sampling wisdom might prove essential, I recruited two sampling wizards, Doctors Ryan Adams and Iain Murray, the Radagast the Brown and Pallando the Blue of Bayesian machine learning. Without their wizardly might, I wouldn’t have nearly as much to say about this topic.

Although my real goal is to give a thought or two on how to go beyond the work Ryan, Iain, and I did to predict the outcomes of basketball games, I also want to briefly introduce *what* we did and tell you where you can find our code (the final version was mostly written by Ryan--much of my code wasn’t used after some initial prototypes) and paper if you want to try something similar for the challenge. I am not planning on directly applying our method to predicting games to predicting brackets, but if any enterprising soul is planning on trying out what we did I would be quite curious to see how well it does.

So how should we think about the basketball game prediction problem? My philosophy is to always start at the data. I was able to collect data on the outcome of NBA games from 2002 to about 2009. Each data case is a game and has the date of the game, the home team, the away team, the home team final score, the away team final score, a bookie spread, and a bookie over/under line. One obvious thing to do is to design some features f(g) that you can compute for each game g, perhaps something about the strength of the home team’s coach, lineup, roster, etc. for that day and anything else you can think of and calculate or set based on the knowledge of some NBA expert. Then we could just train some “off the shelf” classifier and predict a binary label that denotes if the home team won the game. Although not unreasonable, I think this approach tries to fit the problem into a particular solution mold instead of designing a solution that is well suited to the problem.

The most important thing about a basketball game is that it is an interaction between two teams. Instead of learning or engineering features of games, we should learn features for teams and model the interaction between teams directly. If we have additional information specific to the interaction (such as the venue of the game, time and date of the game, identity of referees, etc.), we should use it to modulate the team features or even make the team features functions of this side information. When we observe the result of a game, the way we will get the most information out of it is by building in as much problem structure as naturally as possible. Doing binary classification of games throws away score information that tells us about the offensive and defensive capabilities of the teams. Treating game records as i.i.d. regardless of when the games took place may similarly overgeneralize from temporary effects and ignore permanent changes in the offensive and defensive (and other) capabilities of the teams involved.

An important point relating to using score information is that the numbers of points two teams earn in a game are highly correlated, simply because of the possession rules of basketball (other factors cause team dependent correlations of various sorts between scores as well, such as how good at getting rebounds players on the team are and the potentially complicated decisions coaches might make that would cause these players to be on or off the bench). Seeing the Suns at home beat the Timberwolves 152 to 114 could be explained in a variety of ways. 114 is a reasonably high score, perhaps Minnesota has a strong offense, weak defense, and the Suns have an equally strong offense, but a stronger defense (and they are at home so maybe that helps them score more than it helps them defend?). One simplistic way of predicting the number of points a team will score in a game is to use the average number of points they scored over all their recent games. Then, to predict the final score of a game between the Suns and the Timberwolves, we could use the average number of points scored by the Suns and the average number of points scored by the Timberwolves. However, a different (and equally simplistic) way of predicting the number of points scored by a team in a game is to use the average number of points lost by the opposing team over all the opposing team's recent games. This method, in turn, gives us another way of predicting the final score of a game between the Suns and the Timberwolves. We could predict that the Suns will score the average number of points the Timberwolves typically lose and that the Timberwolves will score the average number of points the Suns typically lose. These two estimates of the score in a new game will not in general agree.

Beyond not agreeing with each other, what makes these averages even worse is that they don’t do a good job of even encapsulating the offensive or defensive strength of a team. Scoring a lot of points gives the opponent many more chances to score against you, so a strong offensive team will distort our estimates of its defensive capability and a strong defensive team might distort our estimates of its offensive capability. How do we reconcile these conflicting ways of viewing the evidence?

We need a probabilistic model of basketball game scores that incorporates the possession-based correlation between scores or we risk learning distorted values for potentially any model parameter meant to capture how good teams are offensively or defensively. Conversely, since the correlation in basketball is so strong, it can be really helpful to give our model the power to learn it. Hopefully if we integrate out parameters of the model that correspond to any of these difficult-to-infer “offensive” and “defensive” strength qualities and weight them by their posterior probability in the typical Bayesian way we will do well as long as the likelihood is sensible and incorporates a learned correlation between the two scores in a game.

What I have been talking about above is the basic modeling framework we used in our paper. Use a matrix factorization model to capture the dyadic nature of the data, use a Bayesian approach to avoid overfitting and integrate over our unknown latent variables, model offensive and defensive features of teams as smooth functions of time and home/away status (the only side information we had access to in our data), and use a sensible likelihood that can incorporate any correlation between the scores of the two teams in a game the model might discover or need to explain the data.

But how do we go beyond this? What sort of enhancements to the model in our paper are the most promising? Once again, I like to focus on the data. How well are we dealing with the structure in the data we already have, what other sorts of data could we get or would we want to get, and how would we exploit any such new data to improve the model? Are there any trends in that data that give us insight into problems with the model or potential improvements?

We can’t really get more game records if we already have all the games each team played in the last few years, but we could get more information about each game or the state of the teams at different points in time. We want pieces of information that are either really useful, really easy to collect, or both. I have two ideas in this vein that I think are promising. The first idea isn’t too ambitious. Right now, the model only uses the final score, but it shouldn’t be too hard to collect box scores. Can we use this new information to improve the model? Surely something better than ignoring it can be done. If we are being more ambitious, can we get access to injury reports or player specific data on a large scale? With a lot of player information, we can move from modeling games as interactions between teams which are modeled as monolithic (if time varying) entities with a large set of latent features, to modeling games as interactions between groups of players on opposing sides. Just as modeling games as interactions between teams and learning latent team features was more natural than classifying games, we could in principle use our knowledge of basketball to model games as interactions between players and learn latent features for individual players.

Let’s talk about box scores first, since it is less daunting a prospect. The way Ryan, Iain, and I dealt with the correlation between scores in the same game is by placing a bivariate Gaussian over them in the model. One natural thing to do with box scores is to just stick a reasonable joint distribution over them and maybe share some of the parameters of that distribution across teams. Since the points scored by a team in a single quarter are at most the number of points scored by the team in the entire game, the use of a Gaussian distribution might no longer be as appropriate as it is for modeling the total number of points scored in a game, but the thing to do would be to look at some histograms and find a reasonable distribution to use. A potential advantage of using box scores is that it might give the model the capability of, at least in part, dealing with the so-called “garbage time” phenomenon common in basketball games. The points scored in the final quarter of a game (or maybe even some earlier quarters) may or may not be representative of the actual capabilities of the teams, because a very different set of players is on the court when the outcome is almost completely determined than when the outcome of the game is still in question.

Representing individual players explicitly in the model is a fascinating possibility, but requires more data than what I have collected. At the minimum, we would need the team roster for every game, but it would be great to also have lineups and injury reports. The ultimate dream would be to have the time series that follows everything happening to the ball and who is on the court, but this quickly leaves the realm of possibility since only the teams themselves might be collect that sort of information.

The competition data looks like it will have the same data I collected for the NBA, without the bookie information, plus the set of players on each team and some information specific to each player in each game. The player information for player r in game g includes things like the number of minutes r was on the court during g, the number of field goals r made during g, the number of field goals r attempted during g, etc. I don’t know the best way to model the interactions between players, but one possibility is to model the way a player performs in a game as being based on the sum of all pairwise interactions between players involving that player, perhaps weighted by how long the other player was in the game. Each player would then have a set of latent features for their inter-team interactions and intra-team interactions, and the effect of an interaction would be based on the dot product between the relevant feature vectors. With a sensible set of priors that make the features functions of side information as before and a reasonable conditional likelihood function for the specific pieces of data we get about how players perform in games, we should be able, at least in principle, to integrate out all the unknown latent features and make predictions.

Good luck with the competition!

*The Reverend here of course being the Reverend Thomas Bayes. This delightful idiom is, I believe, originally due to the late Sam Roweis.

Monday, February 7, 2011

March Madness 2011 Data Update

Posted by Lee
There is just a little over one month left before the beginning of March Madness! Hopefully, you're happily coding away and building some really cool models.

I am still in the process of crawling all the boxscore data. Unfortunately, it is taking longer than I had anticipated. However, I did want to give everyone a chance to see what the data would look like and the opportunity to use some sample data while developing their algorithms and models.

The sample data contains two tab-delimited files. One contains a list of all the games played and the other contains a list of each player's performance within a game. I plan on using this format for the final set of data, but if you have any major issues with it, feedback is welcome via the comments.

The game file has four columns: a game identifier, the date the game was played, the home team, and the away team.

The players file has the following columns: the player's name, the game ID corresponding to this particular's row performance, the team the player was playing for, the number of minutes the player played in the game, field goals made, field goals attempted, three pointers made, three pointers attempted, free throws made, free throws attempted, offensive rebounds, defensive rebounds, assists, turnovers, steals, blocks, and personal fouls.

While there is no explicit points data in these files (to avoid redundancy), it can easily be reconstructed. For example, to determine the number of points scored by a player, simply perform the following calculation: free throws made + 2 * field goals made + three pointers made (not times three as they are already counted in the field goals). To determine the team's scores, simply sum the scores of the players for each team with the corresponding game identifier.

Good luck and please let us know via the comments if you run into any problems.

Wednesday, January 19, 2011

Get ready for the 2011 March Madness Predictive Analytics Challenge

Posted by Lee

For all posts related to the contest, check the March Madness category