Thursday, March 21, 2013

Predicting March Madness by Jasper Snoek

Posted by Jasper

Now that March Madness is officially underway, and the deadline to submit new bracket predictions has passed, I'm ready to divulge the details of my super secret, possibly excessively advanced, march madness prediction model.  For a few years now, there has been a special "elite" pool to predict march madness.  The twist is that all the predictions have to be made by a computer algorithm - no humans allowed.  This means we can't use seed information, predictions from experts or the POTUS's executive insight.  Instead, we predict based only on data (my model uses only scores).  This is the second year that I am entering an algorithm.  My entry from last year, which won the pool and beat the vast majority of humans in the Yahoo challenge, is being used as a baseline.  This means I have to submit something more sophisticated this year to stay on top.

The model:

The Simple Version:
A few years ago, the world of machine learning (a subfield of artificial intelligence that combines statistics, math and computer science to get computers to learn and infer from data) was rocked by the Netflix challenge.  Netflix offered a prize of a million dollars to anyone who could beat their movie recommendation system by 10%. One of the most powerful and surprisingly simple algorithms to come out of that challenge was Probabilistic Matrix Factorization (PMF).  The idea was that a movie rating was a simple product of a set of hidden or 'latent' factors pertaining to the movie and the user.  Although the factors are not pre-defined, you could imagine that the model may learn one factor for a movie that corresponds to the amount of action and then a user would have a factor encoding how much they like action (and similarly for e.g. romance).  We learn the model by adjusting these factors to maximize the probability that the user would give the ratings that we can see.  To predict someone's rating for a given movie they haven't seen yet, you just multiply their factors by the movie factors.

Similarly to movie ratings we can create factors for basketball teams to predict game scores.  Here the factors (again learned by the model) could correspond to offensive skill and defensive capabilities.  This was the basis of my model for last year.  There was a small twist in that I altered the way that the model was learned - to focus only on scores for which it predicted the wrong winner.

This year my model is significantly more complex but builds on the same principles.  It has two levels of latent or hidden factors.  The first encodes factors for each team - such as offensive skill, defensive skill, etc.  The second layer combines team factors just like in standard PMF, but instead of mapping directly to the scores they map to a hidden representation that encodes the game.  My reasoning is that the resulting score of a game is much more complex than a product of simple factors pertaining to each team.   The idea is that the game representation now encodes things like: will the game be close or will it be a blowout - will it be high scoring or a defensive brawl?  From the game representation I have a mapping to the difference between the home team score and the away team score.  Now this is where things get a little complicated.  Since there are relatively only a small number of games in this season (just over 5000) and this model is already fairly complex, rather than directly try to learn a function mapping from the game factors to the scores, I model a distribution over all possible mappings.  The idea: given all (infinite) reasonable mappings from factors representing the game to scores, what is the most probable outcome?  To do this I use a statistical model called a Gaussian process.
The factors encoding teams.

Now to learn the model:
I take all of the game scores from the past season.  For each game, I tell the model which team is the home team, which is the away team, and then adjust the team factors and game factors in order to maximize the probability of the real score.  In order to choose the number of factors at each step, I use a new automatic parameter tuning algorithm I personally helped develop called Bayesian optimization.

What do the factors look like?
Just to the right I have an example of the factors that are learned if I train the model using just two factors for the teams (for those of you in machine learning, these are the weights of the neural network) and I have plotted where each of the teams are in this factor space (along with their seeds).  You can see that the model is putting the better teams in the lower left and the worse teams near the top right.  It doesn't seem to fancy the odds of South Dakota...  I'll explain later why I call the model "Turducken".

Below this I have a picture zoomed in on just the bottom left.  You can see that the powerhouses are all encoded in this region.  You can click on these images to zoom in.  Now you can see that two factors already encode quite a bit about which teams are better.  My model uses two hundred factors - so it is encoding something that is quite significantly more complex.

Zoomed in on the bottom left.
Below there is a picture of the factors learned to encode games.  There is a dot for each game which is colored by relative score.  So a 1 means that the home team wins by a lot and a two means that the home team loses by a lot ("a lot" here actually means about 50 points).  So the model takes the team factors on the right and multiplies them to get to the game factors below.  Then from the game factors it predicts by how much the home team will win or lose.

What is this Bayesian optimization?
One really exciting area of machine learning that has advanced a lot over the past year is related to how to build systems that work more automatically. To really eke out the best performance, you usually need an expert to sit and tweak a bunch of knobs, see what happens, and repeat many times. It's really time-consuming and nearly
impossible for a non-expert (and even difficult for experts). But there is work on automating this process, building a system to automatically tune the knobs and decipher the results.  I am using Bayesian optimization that I left running overnight to automatically determine how many factors to use for teams and for games based on how well the model can predict the scores of 500 games that I pulled out of the set of data that the model learns from.  The procedure decided to use 200 factors per team and just two per game.

In Machine Leaning Speak:
The devil is of course in the details.  The model I am using is a buzz-word powerhouse.  I call it a deep semi-parametric Bayesian probabilistic matrix factorization that is optimized using Bayesian optimization.  My fellow machine learning PhD friend, George Dahl, calls it a "statistical Turducken".  It uses a neural network trained with 'dropout' to perform a nonlinear probabilistic matrix factorization into a latent space that encodes games.  A Gaussian process mapping is then used to map from games to the score difference.  The input to the neural network is a binary encoding of which team is the home team (so the number of dimensions equals the
An example of the factors learned by the model to encode 'games'.  
number of teams) and then similarly a binary encoding of which team is the away team.  So the input to the model is a numTeams x 2 dimensional binary encoding with two bits on.  This may seem wasteful, but note that now the weights to be learned by the neural network correspond exactly to latent factors pertaining to each team.  The teams get different factors depending if they are home or away (as I personally have no college basketball expertise, I have no idea if this is a wise design choice).  The neural network maps these factors into a hidden unit representation and then to a latent space.  From the latent space I map using a Gaussian process with a squared exponential kernel to score difference.

The model is trained using backpropagation - from the marginal likelihood of the Gaussian process I backpropagate error through the kernel of the GP to the weights of the neural network.  I use stochastic gradient descent on randomly chosen minibatches of 250 games at a time and a 50% dropout rate on the hidden units of the neural network.  I used Bayesian optimization on a validation set of 500 games to determine the number of hidden units in the neural network (i.e. the number of factors in the PMF), the latent dimension of the input to the GP and the number of epochs to train the model for.

What did it predict?
You can check out the bracket that it predicted here:
As of this writing, the model is 4/4 including a minor upset of Wichita over Pittsburgh.

You can take a look at our pool here:
Interestingly, even though it doesn't know anything about the seeds, it predicted the four number one seeds in the final four.  According to the turducken, Indiana is going all the way.  This is pretty remarkable - the algorithm is in close agreement with some of the top human basketball experts.  That is already a validation that it is doing something reasonable.  There are not too many
controversial predictions here, though it is predicting some upsets (e.g. Notre Dame over Ohio St.).  It will be really exciting to see how it does as the next days play out!

1 comment:

Harry Snoek said...

Waauw--kun je dit ook gebruiken voor aandelen?