Monday, February 21, 2011

2010 March Madness Contest Winning Algorithm: Code

Posted by Danny Tarlow
Last year's March Madness algorithm predictions contest ended with me winning the full bracket, and Scott Turner (aka "The Pain Machine") and I tied for the sweet 16 bracket:
http://blog.smellthedata.com/2010/04/duke-is-not-only-winner-from-last-night.html

I've gotten a few emails asking for the code I used. I haven't looked at it since last year, but the Python code is available here:
http://www.cs.toronto.edu/~dtarlow/march_madness_public.tgz

You can follow links for the description of the probabilistic matrix factorization model, and a post using a similar model applied to a recommendation system.

Unfortunately, the documentation is a bit lacking beyond that. Hopefully the posts are enough to explain things at a high level. If you have more detailed questions (or find bugs), feel free to ask (or point them out) in the comments.

If somebody adapts this code to work with this year's data format, please share! (and let me know)

This code may also be of interest.

Saturday, February 19, 2011

Memory Hacking

Posted by Danny Tarlow
I wrote a while back about the idea of using machine learning to make scrabble words easier for me to memorize. One of the main problems was how to describe the "me complexity" (or "danny complexity") of some representation of the dictionary. This article doesn't directly address the question, but I found it quite interesting to think about in this context:
http://www.nytimes.com/interactive/2011/02/20/magazine/mind-secrets.html?ref=magazine

I find that building intuitions and stories around concepts in research helps me greatly when coming up with new ideas. Are there corresponding classes of mind-hacks for other things, like creativity or mathematical manipulations?

(Via @igrigorik)

Modeling Chess with Probabilistic Matrix Factorization

Posted by Danny Tarlow
Dirk writes...
Thanks for posting your code and ideas on your blog. I was playing around with your PMF code applying it to chess games, where users are white and items are black. Because players can play more than once against each other, would you remove the duplicate pairs?
Cool. It sounds like an interesting application.

I don't see any reason to remove duplicate pairs. If different games have different outcomes, you definitely want to keep both to prevent biasing the model. If there are duplicate games with the same outcome, we should be more confident that the outcome wasn't a fluke, so it roughly makes sense that we'd want to weight it a bit more (which is the effect of having duplicate entries in the data set).

It's a bit strange to apply the vanilla model chess, though. There is no relationship in the model between player X as white and player X as black--they might as well be two separate people. It may be interesting to look at extensions where the latent vectors for player X as white are constrained to be similar as the latent vectors for player X as black.

Do any readers disagree?

(The relevant code is here.)

Sunday, February 13, 2011

NCAA Boxscore Data 2006-2010

Posted by Lee
All the boxscores from the 2006-2010 seasons (up until the first few days of February) are now available for download!

The format is specified in the last post (here). However, please note that the team codes now include an "UNK" which refers to a team that is not within the list of codes. These are often small teams either in non-Division 1 conferences or no longer in Division 1. UNK can map to multiple schools. Please let me know via the comments if this will pose any major problems. I could include the actual team name, but all of the "UNK" teams will not be in the tournament and it is likely that almost all the schools in the tournament will not have an "UNK" team in their regular season schedule.

Saturday, February 12, 2011

Thoughts on Modeling Basketball

Posted by Danny Tarlow
This is a guest post by the great George Dahl, who is one of my go-to guys for sports modeling. I invited George to write a guest post related to the upcoming 2011 March Madness Predictive Analytics Challenge, because he has experience with applying machine learning to several different sports predictions problems, particularly in tennis and basketball. In the past, I've posted about work he was involved with related to Bayesian NBA Basketball Predictions.

George is a Ph.D. student in the Toronto Machine Learning Group, and he is supervised by Geoffrey Hinton. George focuses most of his research energy on deep learning approaches to speech and language processing and is a part of a collaboration that has obtained some of the best results on the TIMIT dataset.



With what will certainly be the most exciting March Madness prediction competition yet looming on the horizon, Danny asked me to share a few of my thoughts on using machine learning for basketball prediction. In this discussion, I want to focus on predicting everything possible about the outcomes of individual games and avoid discussing predicting an entire bracket. Certainly a reasonable approach for predicting a bracket is to predict individual games in sequence, but it should be obvious that there might be some problems with this since mistakes in early games can have a very large impact on the quality of the final bracket. This discussion also won’t really focus on the details of the competition too much, but instead more on whatever I feel like rambling about.

Before we start, I want to mention the two people that made it possible for me to work on basketball prediction and coauthor a paper featuring it. Last year, I became very interested in predicting NBA game results (and betting on them) and I wanted to see how far I could push the basic matrix factorization approach in Danny’s original post. One area where probabilistic matrix factorization (PMF) has been highly effective is in the famous Netflix Prize competition. Even with the number of training cases in the Netflix competition, a Bayesian approach provides substantial benefits in prediction accuracy by preventing overfitting. In a basketball game prediction system, we should expect it to be even more important to be protected by the cloak of the Reverend* since teams may only play each other a few times in a given season and the composition of the team, injury status of the players, and other features of a team constantly evolve. Thus it seemed natural to try Bayesian PMF on basketball prediction. Knowing that the latest and greatest practical sampling wisdom might prove essential, I recruited two sampling wizards, Doctors Ryan Adams and Iain Murray, the Radagast the Brown and Pallando the Blue of Bayesian machine learning. Without their wizardly might, I wouldn’t have nearly as much to say about this topic.

Although my real goal is to give a thought or two on how to go beyond the work Ryan, Iain, and I did to predict the outcomes of basketball games, I also want to briefly introduce *what* we did and tell you where you can find our code (the final version was mostly written by Ryan--much of my code wasn’t used after some initial prototypes) and paper if you want to try something similar for the challenge. I am not planning on directly applying our method to predicting games to predicting brackets, but if any enterprising soul is planning on trying out what we did I would be quite curious to see how well it does.

So how should we think about the basketball game prediction problem? My philosophy is to always start at the data. I was able to collect data on the outcome of NBA games from 2002 to about 2009. Each data case is a game and has the date of the game, the home team, the away team, the home team final score, the away team final score, a bookie spread, and a bookie over/under line. One obvious thing to do is to design some features f(g) that you can compute for each game g, perhaps something about the strength of the home team’s coach, lineup, roster, etc. for that day and anything else you can think of and calculate or set based on the knowledge of some NBA expert. Then we could just train some “off the shelf” classifier and predict a binary label that denotes if the home team won the game. Although not unreasonable, I think this approach tries to fit the problem into a particular solution mold instead of designing a solution that is well suited to the problem.

The most important thing about a basketball game is that it is an interaction between two teams. Instead of learning or engineering features of games, we should learn features for teams and model the interaction between teams directly. If we have additional information specific to the interaction (such as the venue of the game, time and date of the game, identity of referees, etc.), we should use it to modulate the team features or even make the team features functions of this side information. When we observe the result of a game, the way we will get the most information out of it is by building in as much problem structure as naturally as possible. Doing binary classification of games throws away score information that tells us about the offensive and defensive capabilities of the teams. Treating game records as i.i.d. regardless of when the games took place may similarly overgeneralize from temporary effects and ignore permanent changes in the offensive and defensive (and other) capabilities of the teams involved.

An important point relating to using score information is that the numbers of points two teams earn in a game are highly correlated, simply because of the possession rules of basketball (other factors cause team dependent correlations of various sorts between scores as well, such as how good at getting rebounds players on the team are and the potentially complicated decisions coaches might make that would cause these players to be on or off the bench). Seeing the Suns at home beat the Timberwolves 152 to 114 could be explained in a variety of ways. 114 is a reasonably high score, perhaps Minnesota has a strong offense, weak defense, and the Suns have an equally strong offense, but a stronger defense (and they are at home so maybe that helps them score more than it helps them defend?). One simplistic way of predicting the number of points a team will score in a game is to use the average number of points they scored over all their recent games. Then, to predict the final score of a game between the Suns and the Timberwolves, we could use the average number of points scored by the Suns and the average number of points scored by the Timberwolves. However, a different (and equally simplistic) way of predicting the number of points scored by a team in a game is to use the average number of points lost by the opposing team over all the opposing team's recent games. This method, in turn, gives us another way of predicting the final score of a game between the Suns and the Timberwolves. We could predict that the Suns will score the average number of points the Timberwolves typically lose and that the Timberwolves will score the average number of points the Suns typically lose. These two estimates of the score in a new game will not in general agree.

Beyond not agreeing with each other, what makes these averages even worse is that they don’t do a good job of even encapsulating the offensive or defensive strength of a team. Scoring a lot of points gives the opponent many more chances to score against you, so a strong offensive team will distort our estimates of its defensive capability and a strong defensive team might distort our estimates of its offensive capability. How do we reconcile these conflicting ways of viewing the evidence?

We need a probabilistic model of basketball game scores that incorporates the possession-based correlation between scores or we risk learning distorted values for potentially any model parameter meant to capture how good teams are offensively or defensively. Conversely, since the correlation in basketball is so strong, it can be really helpful to give our model the power to learn it. Hopefully if we integrate out parameters of the model that correspond to any of these difficult-to-infer “offensive” and “defensive” strength qualities and weight them by their posterior probability in the typical Bayesian way we will do well as long as the likelihood is sensible and incorporates a learned correlation between the two scores in a game.

What I have been talking about above is the basic modeling framework we used in our paper. Use a matrix factorization model to capture the dyadic nature of the data, use a Bayesian approach to avoid overfitting and integrate over our unknown latent variables, model offensive and defensive features of teams as smooth functions of time and home/away status (the only side information we had access to in our data), and use a sensible likelihood that can incorporate any correlation between the scores of the two teams in a game the model might discover or need to explain the data.

But how do we go beyond this? What sort of enhancements to the model in our paper are the most promising? Once again, I like to focus on the data. How well are we dealing with the structure in the data we already have, what other sorts of data could we get or would we want to get, and how would we exploit any such new data to improve the model? Are there any trends in that data that give us insight into problems with the model or potential improvements?

We can’t really get more game records if we already have all the games each team played in the last few years, but we could get more information about each game or the state of the teams at different points in time. We want pieces of information that are either really useful, really easy to collect, or both. I have two ideas in this vein that I think are promising. The first idea isn’t too ambitious. Right now, the model only uses the final score, but it shouldn’t be too hard to collect box scores. Can we use this new information to improve the model? Surely something better than ignoring it can be done. If we are being more ambitious, can we get access to injury reports or player specific data on a large scale? With a lot of player information, we can move from modeling games as interactions between teams which are modeled as monolithic (if time varying) entities with a large set of latent features, to modeling games as interactions between groups of players on opposing sides. Just as modeling games as interactions between teams and learning latent team features was more natural than classifying games, we could in principle use our knowledge of basketball to model games as interactions between players and learn latent features for individual players.

Let’s talk about box scores first, since it is less daunting a prospect. The way Ryan, Iain, and I dealt with the correlation between scores in the same game is by placing a bivariate Gaussian over them in the model. One natural thing to do with box scores is to just stick a reasonable joint distribution over them and maybe share some of the parameters of that distribution across teams. Since the points scored by a team in a single quarter are at most the number of points scored by the team in the entire game, the use of a Gaussian distribution might no longer be as appropriate as it is for modeling the total number of points scored in a game, but the thing to do would be to look at some histograms and find a reasonable distribution to use. A potential advantage of using box scores is that it might give the model the capability of, at least in part, dealing with the so-called “garbage time” phenomenon common in basketball games. The points scored in the final quarter of a game (or maybe even some earlier quarters) may or may not be representative of the actual capabilities of the teams, because a very different set of players is on the court when the outcome is almost completely determined than when the outcome of the game is still in question.

Representing individual players explicitly in the model is a fascinating possibility, but requires more data than what I have collected. At the minimum, we would need the team roster for every game, but it would be great to also have lineups and injury reports. The ultimate dream would be to have the time series that follows everything happening to the ball and who is on the court, but this quickly leaves the realm of possibility since only the teams themselves might be collect that sort of information.

The competition data looks like it will have the same data I collected for the NBA, without the bookie information, plus the set of players on each team and some information specific to each player in each game. The player information for player r in game g includes things like the number of minutes r was on the court during g, the number of field goals r made during g, the number of field goals r attempted during g, etc. I don’t know the best way to model the interactions between players, but one possibility is to model the way a player performs in a game as being based on the sum of all pairwise interactions between players involving that player, perhaps weighted by how long the other player was in the game. Each player would then have a set of latent features for their inter-team interactions and intra-team interactions, and the effect of an interaction would be based on the dot product between the relevant feature vectors. With a sensible set of priors that make the features functions of side information as before and a reasonable conditional likelihood function for the specific pieces of data we get about how players perform in games, we should be able, at least in principle, to integrate out all the unknown latent features and make predictions.

Good luck with the competition!

*The Reverend here of course being the Reverend Thomas Bayes. This delightful idiom is, I believe, originally due to the late Sam Roweis.

Monday, February 7, 2011

March Madness 2011 Data Update

Posted by Lee
There is just a little over one month left before the beginning of March Madness! Hopefully, you're happily coding away and building some really cool models.

I am still in the process of crawling all the boxscore data. Unfortunately, it is taking longer than I had anticipated. However, I did want to give everyone a chance to see what the data would look like and the opportunity to use some sample data while developing their algorithms and models.

The sample data contains two tab-delimited files. One contains a list of all the games played and the other contains a list of each player's performance within a game. I plan on using this format for the final set of data, but if you have any major issues with it, feedback is welcome via the comments.

The game file has four columns: a game identifier, the date the game was played, the home team, and the away team.

The players file has the following columns: the player's name, the game ID corresponding to this particular's row performance, the team the player was playing for, the number of minutes the player played in the game, field goals made, field goals attempted, three pointers made, three pointers attempted, free throws made, free throws attempted, offensive rebounds, defensive rebounds, assists, turnovers, steals, blocks, and personal fouls.

While there is no explicit points data in these files (to avoid redundancy), it can easily be reconstructed. For example, to determine the number of points scored by a player, simply perform the following calculation: free throws made + 2 * field goals made + three pointers made (not times three as they are already counted in the field goals). To determine the team's scores, simply sum the scores of the players for each team with the corresponding game identifier.

Good luck and please let us know via the comments if you run into any problems.