Posted by Danny TarlowThis is a guest post by the great George Dahl, who is one of my go-to guys for sports modeling. I invited George to write a guest post related to the upcoming 2011 March Madness Predictive Analytics Challenge, because he has experience with applying machine learning to several different sports predictions problems, particularly in tennis and basketball. In the past, I've posted about work he was involved with related to Bayesian NBA Basketball Predictions.
George is a Ph.D. student in the Toronto Machine Learning Group, and he is supervised by Geoffrey Hinton. George focuses most of his research energy on deep learning approaches to speech and language processing and is a part of a collaboration that has obtained some of the best results on the TIMIT dataset.
With what will certainly be the most exciting March Madness prediction competition yet looming on the horizon, Danny asked me to share a few of my thoughts on using machine learning for basketball prediction. In this discussion, I want to focus on predicting everything possible about the outcomes of individual games and avoid discussing predicting an entire bracket. Certainly a reasonable approach for predicting a bracket is to predict individual games in sequence, but it should be obvious that there might be some problems with this since mistakes in early games can have a very large impact on the quality of the final bracket. This discussion also won’t really focus on the details of the competition too much, but instead more on whatever I feel like rambling about.
Before we start, I want to mention the two people that made it possible for me to work on basketball prediction and coauthor a paper featuring it. Last year, I became very interested in predicting NBA game results (and betting on them) and I wanted to see how far I could push the basic matrix factorization approach in Danny’s original post. One area where probabilistic matrix factorization (PMF) has been highly effective is in the famous Netflix Prize competition. Even with the number of training cases in the Netflix competition, a Bayesian approach provides substantial benefits in prediction accuracy by preventing overfitting. In a basketball game prediction system, we should expect it to be even more important to be protected by the cloak of the Reverend* since teams may only play each other a few times in a given season and the composition of the team, injury status of the players, and other features of a team constantly evolve. Thus it seemed natural to try Bayesian PMF on basketball prediction. Knowing that the latest and greatest practical sampling wisdom might prove essential, I recruited two sampling wizards, Doctors Ryan Adams and Iain Murray, the Radagast the Brown and Pallando the Blue of Bayesian machine learning. Without their wizardly might, I wouldn’t have nearly as much to say about this topic.
Although my real goal is to give a thought or two on how to go beyond the work Ryan, Iain, and I did to predict the outcomes of basketball games, I also want to briefly introduce *what* we did and tell you where you can find our code (the final version was mostly written by Ryan--much of my code wasn’t used after some initial prototypes) and paper if you want to try something similar for the challenge. I am not planning on directly applying our method to predicting games to predicting brackets, but if any enterprising soul is planning on trying out what we did I would be quite curious to see how well it does.
So how should we think about the basketball game prediction problem? My philosophy is to always start at the data. I was able to collect data on the outcome of NBA games from 2002 to about 2009. Each data case is a game and has the date of the game, the home team, the away team, the home team final score, the away team final score, a bookie spread, and a bookie over/under line. One obvious thing to do is to design some features f(g) that you can compute for each game g, perhaps something about the strength of the home team’s coach, lineup, roster, etc. for that day and anything else you can think of and calculate or set based on the knowledge of some NBA expert. Then we could just train some “off the shelf” classifier and predict a binary label that denotes if the home team won the game. Although not unreasonable, I think this approach tries to fit the problem into a particular solution mold instead of designing a solution that is well suited to the problem.
The most important thing about a basketball game is that it is an interaction between two teams. Instead of learning or engineering features of games, we should learn features for teams and model the interaction between teams directly. If we have additional information specific to the interaction (such as the venue of the game, time and date of the game, identity of referees, etc.), we should use it to modulate the team features or even make the team features functions of this side information. When we observe the result of a game, the way we will get the most information out of it is by building in as much problem structure as naturally as possible. Doing binary classification of games throws away score information that tells us about the offensive and defensive capabilities of the teams. Treating game records as i.i.d. regardless of when the games took place may similarly overgeneralize from temporary effects and ignore permanent changes in the offensive and defensive (and other) capabilities of the teams involved.
An important point relating to using score information is that the numbers of points two teams earn in a game are highly correlated, simply because of the possession rules of basketball (other factors cause team dependent correlations of various sorts between scores as well, such as how good at getting rebounds players on the team are and the potentially complicated decisions coaches might make that would cause these players to be on or off the bench). Seeing the Suns at home beat the Timberwolves 152 to 114 could be explained in a variety of ways. 114 is a reasonably high score, perhaps Minnesota has a strong offense, weak defense, and the Suns have an equally strong offense, but a stronger defense (and they are at home so maybe that helps them score more than it helps them defend?). One simplistic way of predicting the number of points a team will score in a game is to use the average number of points they scored over all their recent games. Then, to predict the final score of a game between the Suns and the Timberwolves, we could use the average number of points scored by the Suns and the average number of points scored by the Timberwolves. However, a different (and equally simplistic) way of predicting the number of points scored by a team in a game is to use the average number of points lost by the opposing team over all the opposing team's recent games. This method, in turn, gives us another way of predicting the final score of a game between the Suns and the Timberwolves. We could predict that the Suns will score the average number of points the Timberwolves typically lose and that the Timberwolves will score the average number of points the Suns typically lose. These two estimates of the score in a new game will not in general agree.
Beyond not agreeing with each other, what makes these averages even worse is that they don’t do a good job of even encapsulating the offensive or defensive strength of a team. Scoring a lot of points gives the opponent many more chances to score against you, so a strong offensive team will distort our estimates of its defensive capability and a strong defensive team might distort our estimates of its offensive capability. How do we reconcile these conflicting ways of viewing the evidence?
We need a probabilistic model of basketball game scores that incorporates the possession-based correlation between scores or we risk learning distorted values for potentially any model parameter meant to capture how good teams are offensively or defensively. Conversely, since the correlation in basketball is so strong, it can be really helpful to give our model the power to learn it. Hopefully if we integrate out parameters of the model that correspond to any of these difficult-to-infer “offensive” and “defensive” strength qualities and weight them by their posterior probability in the typical Bayesian way we will do well as long as the likelihood is sensible and incorporates a learned correlation between the two scores in a game.
What I have been talking about above is the basic modeling framework we used in our paper. Use a matrix factorization model to capture the dyadic nature of the data, use a Bayesian approach to avoid overfitting and integrate over our unknown latent variables, model offensive and defensive features of teams as smooth functions of time and home/away status (the only side information we had access to in our data), and use a sensible likelihood that can incorporate any correlation between the scores of the two teams in a game the model might discover or need to explain the data.
But how do we go beyond this? What sort of enhancements to the model in our paper are the most promising? Once again, I like to focus on the data. How well are we dealing with the structure in the data we already have, what other sorts of data could we get or would we want to get, and how would we exploit any such new data to improve the model? Are there any trends in that data that give us insight into problems with the model or potential improvements?
We can’t really get more game records if we already have all the games each team played in the last few years, but we could get more information about each game or the state of the teams at different points in time. We want pieces of information that are either really useful, really easy to collect, or both. I have two ideas in this vein that I think are promising. The first idea isn’t too ambitious. Right now, the model only uses the final score, but it shouldn’t be too hard to collect box scores. Can we use this new information to improve the model? Surely something better than ignoring it can be done. If we are being more ambitious, can we get access to injury reports or player specific data on a large scale? With a lot of player information, we can move from modeling games as interactions between teams which are modeled as monolithic (if time varying) entities with a large set of latent features, to modeling games as interactions between groups of players on opposing sides. Just as modeling games as interactions between teams and learning latent team features was more natural than classifying games, we could in principle use our knowledge of basketball to model games as interactions between players and learn latent features for individual players.
Let’s talk about box scores first, since it is less daunting a prospect. The way Ryan, Iain, and I dealt with the correlation between scores in the same game is by placing a bivariate Gaussian over them in the model. One natural thing to do with box scores is to just stick a reasonable joint distribution over them and maybe share some of the parameters of that distribution across teams. Since the points scored by a team in a single quarter are at most the number of points scored by the team in the entire game, the use of a Gaussian distribution might no longer be as appropriate as it is for modeling the total number of points scored in a game, but the thing to do would be to look at some histograms and find a reasonable distribution to use. A potential advantage of using box scores is that it might give the model the capability of, at least in part, dealing with the so-called “garbage time” phenomenon common in basketball games. The points scored in the final quarter of a game (or maybe even some earlier quarters) may or may not be representative of the actual capabilities of the teams, because a very different set of players is on the court when the outcome is almost completely determined than when the outcome of the game is still in question.
Representing individual players explicitly in the model is a fascinating possibility, but requires more data than what I have collected. At the minimum, we would need the team roster for every game, but it would be great to also have lineups and injury reports. The ultimate dream would be to have the time series that follows everything happening to the ball and who is on the court, but this quickly leaves the realm of possibility since only the teams themselves might be collect that sort of information.
The competition data looks like it will have the same data I collected for the NBA, without the bookie information, plus the set of players on each team and some information specific to each player in each game. The player information for player r in game g includes things like the number of minutes r was on the court during g, the number of field goals r made during g, the number of field goals r attempted during g, etc. I don’t know the best way to model the interactions between players, but one possibility is to model the way a player performs in a game as being based on the sum of all pairwise interactions between players involving that player, perhaps weighted by how long the other player was in the game. Each player would then have a set of latent features for their inter-team interactions and intra-team interactions, and the effect of an interaction would be based on the dot product between the relevant feature vectors. With a sensible set of priors that make the features functions of side information as before and a reasonable conditional likelihood function for the specific pieces of data we get about how players perform in games, we should be able, at least in principle, to integrate out all the unknown latent features and make predictions.
Good luck with the competition!
*The Reverend here of course being the Reverend Thomas Bayes. This delightful idiom is, I believe, originally due to the late Sam Roweis.