Monday, March 10, 2014

Machine March Madness 2014

Posted by Danny Tarlow
You may have seen this year that Kaggle is running a Machine Learning March Madness competition.

Lee and I talked it over, and it looks like they're doing a pretty good job with it, so with just a slight tinge of regret, we decided to put our annual Machine March Madness competition on hiatus. However, we reserve the right to revive the competition in future years, depending on how things go.

Our most loyal competitor is not so easily discouraged, though, so it is with great pleasure that we announce that Scott Turner will be running the Machine March Madness competition this year on his blog. If you feel like competing in the traditional style, please head over there and register your entry!

Tuesday, April 9, 2013

Congratulations to the Machine March Madness Winner

Posted by Danny Tarlow
Well, after another exciting March Madness tournament, Louisville emerged as the winner of March Madness, and Ryan Boesch emerged as the winner of Machine March Madness, with his algorithm beating out the field of 22 other machine competitors and all the human baselines. Congratulations, Ryan!

I asked him a few questions, which he answers below:

1. What inspired you to compete in the Machine March Madness competition?

Last year I finished a class on Convex Optimization during the winter quarter and was planning to take a Machine Learning class in the spring quarter. I was looking for a project to apply what I had learned. I saw this competition and submitted a last minute bracket.

2. What do you attribute your win to? What is your model best at?

The win was of course very lucky. Basketball games are random in nature so to find which model is actually the best it would require many years of tournaments. One tournament is not statistically significant.

There is nothing particularly special about my model. I used Danny's model, only I fit the parameters using convex optimization instead of batched gradient decent.

3. What do you think the most promising direction(s) towards improving your model would be?

Most Promising: My current model simply matches teams and sees which has the higher predicted score. It doesn't account for difficult of previously played games in the tournament. For example, say team 1 has a 51% chance to win the first round and also 51% chance to win the second round against team 2. If team 2 has a 95% chance of winning the first round then they are more likely to make it to round 3 even though they only have a 49% chance to beat team 1 in the second round. This is taken into account in Nate Silver's picks for example.

Second Most Promising: When in a pool with other competitors the goal is no longer to maximize your expected score, but instead to maximize your expected chance of winning. These two optimizations do not always result in the same picks. I may consider taking this into account in future years. I found this paper on Nate Silver's blog which analyzes this idea.

4. What advice would you give to future competitors?

Be wary of over fitting your model.

5. What would you change about the competition in future years?

We should try to get out and advertise for the competition earlier and to a broader audience to maximize participation.

Friday, April 5, 2013

Machine March Madness Final Four Outlook

Posted by Danny Tarlow
Once again, we have a guest post by Scott Turner, our local Machine March Madness competitor and analyst, who also runs Thanks for another great post, Scott!

The Machine March Madness Contest finds itself in peculiar waters this year -- no doubt because the Tournament is in strange waters itself. There is only a single team left alive from the twelve #1, #2, and #3 seeds. Going into the Final Four, only four brackets have their champion prediction left alive -- and all four have Louisville. In fact, only one bracket has anyone left alive other than Louisville -- and that's "Tim J's Nets for Nets", who has Syracuse to the final game (but losing to Ohio State). But despite the craziness in the Tournament this year, the top four predictors in the Machine March Madness Contest are in the top 5% of all brackets.

If Louisville manages to win out, the champion will be "Ryan's Rank 1 Approximation" with 121 points. He will beat out all the human competitors as well.

If Louisville loses in the final game, the champion will be "Predict the Madness" (tied with "Danny's Dangerous Picks"), with "Ryan's Rank 1 Approximation" and my own Prediction Machine both a single point behind.

If Louisville somehow loses to Wichita State and Syracuse beats Michigan, then "Tim J's Nets for Nets" will vault all the way from fifteenth into a tie for first with "Predict the Madness". If Syracuse loses to Michigan, then "Predict the Madness" will win outright.

It has certainly been a crazy year for the Tournament!

Monday, March 25, 2013

Upset Analysis by Scott Turner

Posted by Danny Tarlow
This is a guest post by Scott Turner, who is a perennial Machine March Madness competitor, and who runs

There are 22 entries in the Machine Madness contest this year, so analyzing them is a much bigger task than in past years.  Nonetheless I dug through all the brackets and looked at all the first round upset predictions to see how well the machines did.

Correct Upset Predictions

Interestingly enough, every first round upset was picked by at least two of the predictors except for Harvard -- which no one picked -- and Florida Gulf Coast, which only "Larry's Upsetting Picks" predicted.   The only consensus upset pick was Minnesota over UCLA, which was predicted by exactly half of the predictors.  Iowa State also got broad support (40%) but none of the rest of the picks had more than 4 predictors in support.  Here's the full table of the upsets that occurred and who predicted them (stretch your window!):

Entry Sum Minnesota Iowa St Oregon Wichita St. Mississippi Temple California La Salle Fla GC Harvard
*Danny's Dangerous Picks 2 1 1
Andy's Astounding Bracket 2 1 1
Ask me about my T-Rex 3 1 1 1
Curtis Lehmann's Crazy Bracket 0
Dan Tran's Dazzling Bracket 3 1 1 1
Guess O'Bot 3000 0
K. V. Southwood's Fine Bracket 2 1 1
Larry's upsetting picks 4 1 1 1 1
LA's Machine Mad Pick 1 1
Leon's Super Legendary Bracket 1 1
Marginal Madness 1 1
Mark's LR bracket 4 1 1 1 1
MatrixFactorizer 3 1 1 1
natebrix's Neat Bracket 2 1 1
noodlebot 1 1
Predict the Madness 1 1
Ryan's Rank 1 Approximation 1 1
Scott Turner's Prediction Mach 2 1 1
ScottyJ's Grand Bracket 0
The Rosenthal Fit 2 1 1
TheSentinel 2 1 1
Tim J's Nets for Nets 4 1 1 1 1
   Ave Correct: 1.9 50% 41% 18% 18% 18% 14% 14% 9% 5% 0%

My conclusion here is that UCLA-Minnesota and Notre Dame-Iowa State were probably mis-seeded.

UCLA-Minnesota is an interesting case in human psychology.  Minnesota lost 11 of its last 16 games, finished 8th in its conference and lost in the first game of the conference tournament, while UCLA won 11 of its last 16, won the Pac-12 regular season conference title, and lost in the title game of the conference tournament.  It's no wonder UCLA got a 6 seed and Minnesota an 11.  But in fact, Minnesota was playing against much better competition through the conference games, and most of its losses came to ranked opponents and/or on the road.  Machines understand the concept of a "good loss" much better than people.  

The Notre Dame-Iowa State mis-seeding wasn't so egregious.  This probably should have been an 8-9 matchup instead of a 7-10, in which case a win by Iowa State would have hardly been surprising.

All of the rest of the games were probably true upsets.

Incorrect Upset Predictions

Most of the predictors also made a number of incorrect upset predictions.  Most predictors had one or two missed upsets, although six of the predictors made no missed upset predictions (primarily because they made mostly chalk predictions).  Here's the full table:

*Danny's Dangerous Picks 2 1 1
Andy's Astounding Bracket 0
Ask me about my T-Rex 2 1 1
Curtis Lehmann's Crazy Bracket 1 1
Dan Tran's Dazzling Bracket 2 1 1
Guess O'Bot 3000 3 1 1 1
K. V. Southwood's Fine Bracket 4 1 1 1 1
Larry's upsetting picks 7 1 1 1 1 1 1 1
LA's Machine Mad Pick 0
Leon's Super Legendary Bracket 8 1 1 1 1 1 1 1 1
Marginal Madness 0
Mark's LR bracket 4 1 1 1 1
MatrixFactorizer 2 1 1
natebrix's Neat Bracket 2 1 1
noodlebot 0
Predict the Madness 1 1
Ryan's Rank 1 Approximation 1 1
Scott Turner's Prediction Mach 2 1 1
ScottyJ's Grand Bracket 0
The Rosenthal Fit 0
TheSentinel 2 1 1
Tim J's Nets for Nets 4 1 1 1 1
Average Missed: 2.1 45% 36% 23% 23% 18% 14% 14% 9% 9% 5% 5% 5% 5% 5%

As a general rule, the predictors that made the most correct upset picks also made the most incorrect upset picks.  Notably, "Larry's Upsetting Picks" made the incredible call of the FGCU upset (and also called the second round upset) but also made seven incorrect upset picks.

There was almost a consensus (45%) on Colorado over Illinois.  That's an interesting contrast with the Minnesota pick -- Illinois should have benefited in most of the predictors from a tough B1G conference schedule, but many of the predictors thought Illinois was still vulnerable.  Illinois had a 16 point halftime lead in this game, but let it slip away and need some late game heroics to win, so this was certainly a reasonable prediction.

St. Mary's over Memphis was another popular pick.  Memphis won by 2 when a last-second shot by St. Mary's missed, so this also seemed like a reasonable upset pick.


Upset Profits

An important question is whether any of the predictors profited from their upset predictions -- that is, whether the points they gained from correct upset predictions were more than the points they lost from missed upsets.  In general, this is complex to calculate because we have to look at how the predictions affect the later rounds of the tournament.  But it's easy enough to look at just the first round scoring.  Here's the table:

Andy's Astounding Bracket202
The Rosenthal Fit 202
Ask me about my T-Rex321
Dan Tran's Dazzling Bracket321
LA's Machine Mad Pick101
Marginal Madness101
*Danny's Dangerous Picks220
Mark's LR bracket440
natebrix's Neat Bracket220
Predict the Madness110
Ryan's Rank 1 Approximation110
Scott Turner's Prediction Mach220
ScottyJ's Grand Bracket000
Tim J's Nets for Nets440
Curtis Lehmann's Crazy Bracket01-1
K. V. Southwood's Fine Bracket24-2
Guess O'Bot 300003-3
Larry's upsetting picks47-3
Leon's Super Legendary Bracket18-7

We see that a couple of the predictors ("The Rosenthal Fit" and "Andy's Astounding Bracket") came out two points positive, fifteen of the predictors gained one or zero points, and five of the predictors lost points.  Interestingly, both "The Rosenthal Fit" and "Andy's Astounding Bracket" made only two upset predictions and got both of them right -- and there was no overlap in their predictions.  Furthermore, neither of them predicted the "easiest" upset of Minnesota over UCLA.



None of the predictors performed very well at picking upsets, and there wasn't wide agreement on the upset picks.  The consensus would have selected only the Minnesota-UCLA upset and been +1 in scoring, but no individual predictor did that.  Most of the predictors did not hurt themselves with their upset picks (at least looking at only the first round), but none really saw significant benefit.  Given the potentially large downside of missing upset predictions, in future contests it wouldn't be an unreasonable strategy to force your predictor to make all chalk selections in the first round. 

Sunday, March 24, 2013

Thursday, March 21, 2013

Predicting March Madness by Jasper Snoek

Posted by Jasper

Now that March Madness is officially underway, and the deadline to submit new bracket predictions has passed, I'm ready to divulge the details of my super secret, possibly excessively advanced, march madness prediction model.  For a few years now, there has been a special "elite" pool to predict march madness.  The twist is that all the predictions have to be made by a computer algorithm - no humans allowed.  This means we can't use seed information, predictions from experts or the POTUS's executive insight.  Instead, we predict based only on data (my model uses only scores).  This is the second year that I am entering an algorithm.  My entry from last year, which won the pool and beat the vast majority of humans in the Yahoo challenge, is being used as a baseline.  This means I have to submit something more sophisticated this year to stay on top.

The model:

The Simple Version:
A few years ago, the world of machine learning (a subfield of artificial intelligence that combines statistics, math and computer science to get computers to learn and infer from data) was rocked by the Netflix challenge.  Netflix offered a prize of a million dollars to anyone who could beat their movie recommendation system by 10%. One of the most powerful and surprisingly simple algorithms to come out of that challenge was Probabilistic Matrix Factorization (PMF).  The idea was that a movie rating was a simple product of a set of hidden or 'latent' factors pertaining to the movie and the user.  Although the factors are not pre-defined, you could imagine that the model may learn one factor for a movie that corresponds to the amount of action and then a user would have a factor encoding how much they like action (and similarly for e.g. romance).  We learn the model by adjusting these factors to maximize the probability that the user would give the ratings that we can see.  To predict someone's rating for a given movie they haven't seen yet, you just multiply their factors by the movie factors.

Similarly to movie ratings we can create factors for basketball teams to predict game scores.  Here the factors (again learned by the model) could correspond to offensive skill and defensive capabilities.  This was the basis of my model for last year.  There was a small twist in that I altered the way that the model was learned - to focus only on scores for which it predicted the wrong winner.

This year my model is significantly more complex but builds on the same principles.  It has two levels of latent or hidden factors.  The first encodes factors for each team - such as offensive skill, defensive skill, etc.  The second layer combines team factors just like in standard PMF, but instead of mapping directly to the scores they map to a hidden representation that encodes the game.  My reasoning is that the resulting score of a game is much more complex than a product of simple factors pertaining to each team.   The idea is that the game representation now encodes things like: will the game be close or will it be a blowout - will it be high scoring or a defensive brawl?  From the game representation I have a mapping to the difference between the home team score and the away team score.  Now this is where things get a little complicated.  Since there are relatively only a small number of games in this season (just over 5000) and this model is already fairly complex, rather than directly try to learn a function mapping from the game factors to the scores, I model a distribution over all possible mappings.  The idea: given all (infinite) reasonable mappings from factors representing the game to scores, what is the most probable outcome?  To do this I use a statistical model called a Gaussian process.
The factors encoding teams.

Now to learn the model:
I take all of the game scores from the past season.  For each game, I tell the model which team is the home team, which is the away team, and then adjust the team factors and game factors in order to maximize the probability of the real score.  In order to choose the number of factors at each step, I use a new automatic parameter tuning algorithm I personally helped develop called Bayesian optimization.

What do the factors look like?
Just to the right I have an example of the factors that are learned if I train the model using just two factors for the teams (for those of you in machine learning, these are the weights of the neural network) and I have plotted where each of the teams are in this factor space (along with their seeds).  You can see that the model is putting the better teams in the lower left and the worse teams near the top right.  It doesn't seem to fancy the odds of South Dakota...  I'll explain later why I call the model "Turducken".

Below this I have a picture zoomed in on just the bottom left.  You can see that the powerhouses are all encoded in this region.  You can click on these images to zoom in.  Now you can see that two factors already encode quite a bit about which teams are better.  My model uses two hundred factors - so it is encoding something that is quite significantly more complex.

Zoomed in on the bottom left.
Below there is a picture of the factors learned to encode games.  There is a dot for each game which is colored by relative score.  So a 1 means that the home team wins by a lot and a two means that the home team loses by a lot ("a lot" here actually means about 50 points).  So the model takes the team factors on the right and multiplies them to get to the game factors below.  Then from the game factors it predicts by how much the home team will win or lose.

What is this Bayesian optimization?
One really exciting area of machine learning that has advanced a lot over the past year is related to how to build systems that work more automatically. To really eke out the best performance, you usually need an expert to sit and tweak a bunch of knobs, see what happens, and repeat many times. It's really time-consuming and nearly
impossible for a non-expert (and even difficult for experts). But there is work on automating this process, building a system to automatically tune the knobs and decipher the results.  I am using Bayesian optimization that I left running overnight to automatically determine how many factors to use for teams and for games based on how well the model can predict the scores of 500 games that I pulled out of the set of data that the model learns from.  The procedure decided to use 200 factors per team and just two per game.

In Machine Leaning Speak:
The devil is of course in the details.  The model I am using is a buzz-word powerhouse.  I call it a deep semi-parametric Bayesian probabilistic matrix factorization that is optimized using Bayesian optimization.  My fellow machine learning PhD friend, George Dahl, calls it a "statistical Turducken".  It uses a neural network trained with 'dropout' to perform a nonlinear probabilistic matrix factorization into a latent space that encodes games.  A Gaussian process mapping is then used to map from games to the score difference.  The input to the neural network is a binary encoding of which team is the home team (so the number of dimensions equals the
An example of the factors learned by the model to encode 'games'.  
number of teams) and then similarly a binary encoding of which team is the away team.  So the input to the model is a numTeams x 2 dimensional binary encoding with two bits on.  This may seem wasteful, but note that now the weights to be learned by the neural network correspond exactly to latent factors pertaining to each team.  The teams get different factors depending if they are home or away (as I personally have no college basketball expertise, I have no idea if this is a wise design choice).  The neural network maps these factors into a hidden unit representation and then to a latent space.  From the latent space I map using a Gaussian process with a squared exponential kernel to score difference.

The model is trained using backpropagation - from the marginal likelihood of the Gaussian process I backpropagate error through the kernel of the GP to the weights of the neural network.  I use stochastic gradient descent on randomly chosen minibatches of 250 games at a time and a 50% dropout rate on the hidden units of the neural network.  I used Bayesian optimization on a validation set of 500 games to determine the number of hidden units in the neural network (i.e. the number of factors in the PMF), the latent dimension of the input to the GP and the number of epochs to train the model for.

What did it predict?
You can check out the bracket that it predicted here:
As of this writing, the model is 4/4 including a minor upset of Wichita over Pittsburgh.

You can take a look at our pool here:
Interestingly, even though it doesn't know anything about the seeds, it predicted the four number one seeds in the final four.  According to the turducken, Indiana is going all the way.  This is pretty remarkable - the algorithm is in close agreement with some of the top human basketball experts.  That is already a validation that it is doing something reasonable.  There are not too many
controversial predictions here, though it is predicting some upsets (e.g. Notre Dame over Ohio St.).  It will be really exciting to see how it does as the next days play out!

The 2013 Machine March Madness Field

Posted by Danny Tarlow
Thanks everybody who entered this year's Machine March Madness competition. Based on the descriptions of the approaches, it's clear that a lot of hard work and ingenuity has gone into the contest. I'm excited to see how all the different approaches do.

Below, you can see the competitors's descriptions of their approaches. We'll also have some longer posts diving into more details coming up in the near future. If there are any in particular that you're itching to hear more about, leave a note in the comments.

If you have entered but not sent me a description of your approach yet, please do. I'll update this post as more descriptions come in.

Without further ado, here is your 2013 Machine March Madness field!


Marginal Madness
Kevin Swersky

I'm using variational Bayesian matrix factorization with normal priors on the latent factors, and Gaussian-inverse Wishart hyperpriors on the hyperparameters of the priors. Inference is performed using mean-field (no direct optimization of any model parameters is done). The entries of the matrix are R(i,j) = P(team i beats team j) using the empirical counts over the 2012-2013 season. I found that the brackets produced using this were much more stable with respect to the number of factors than any other representation. I used 20 factors, the number of which was chosen based on squared error on 25% randomly held-out entries of R. For my predictions, I just took the mean vectors and ignored any uncertainty learned by the model. Ideally, I should have selected the number of factors, or assessed the stability of the model by using the variational lower bound, but I was lazy. To predict the final score, I used gradient-boosted regression trees from scikit-learn on the feature vectors produced by the factorization.


Larry's Upsetting Picks

I'm using a PMF-based model and I'm also modelling several other aspects such as teams' strength over time (both over a season and across seasons) as well as conferences' strength. These different aspects are combined linearly together to form a prediction.

I also tried using a team's winning percentage (both over the season and over the last few games) but that didn't lead to an improvement.

On a technical note, I also noticed that in PMF instead of using the raw score, using the difference in scores gives slightly increased (winner determination) accuracy.


K. V. Southwood's Fine Bracket
K.V. Southwood

I created an ensemble model based on 3 individual models:

1) multiple linear regression model based on predicting the points margin

2) multiple linear regression model based on predicting offensive points scored

3) logistic regression model based on predicting win vs. loss

Ryan's Rank 1 Approximation
Ryan B.

Brief description of approach (same as last year): For each season (e.g. 2006-2007) I have enumerated the teams and compiled the scores of the games into a matrix S. For example, if team 1 beat team 2 with a score of 82-72 then S12=82 and S21=72. Ideally, each team would play every other team at least once, but this is obviously not the case so the matrix S is sparse. Using the method proposed by George Dahl, I define vectors o and d which correspond to each teams offensive and defensive ability. The approximation to the matrix S is then just the outer product od' (for example (od')_12=o1d2=S12est). This is a simple rank one approximation for the matrix. If each team played each other at least once then the matrix S would be dense and the vectors o and d could be found by finding the SVD of S (see Because this is not the case, we instead define a matrix P that represents which teams played that season. For example, P12=P21=1 if teams 1 and 2 played a game. Now the problem stated by George can be expressed compactedly as, "minimize ||P.*(o*d')-S||_F". Here, '.*' represents the Hadamard product and ||.||_F is the Frobenius norm. In this from, it is easy to see that, for constant vector o and variable vector d, this is a convex problem. Also, for constant vector d and variable vector o this is a convex problem. Therefore, by solving a series of convex problems, alternating the vector variable between o and d, the problem converges rapidly in about 5 to 10 steps (see "Nonnegative Matrix Factorizations" code here From this point the problem is easily expanded to handle higher rank approximations.


Scott Turner's Prediction Machine
Scott Turner

Linear regression on a number of statistics, including strength ratings to predict MOV (Margin of Victory). The basic model is used to predict game outcomes throughout the year, but there are some modifications for the Tournament. Additions this year include a new metric for analyzing possible upsets, an algorithm for forcing upset selections based upon the (predicted) score required to win the pool, and some modifications for neutral-court and tournament games. More details at


See my blog post and project page.

Danny's Dad (Human Baseline)
Danny's Dad.

Literally, Danny's Dad's picks.

Obama's Bracket (Human Baseline)
Barack Obama

The President's picks.

Jasper Snoek

Probabilistic matrix factorization augmented with Gaussian Processes and Bayesian optimization. More details will be forthcoming in a longer blog post (Update: here).


LA's Machine Mad Pick
LeAnthony M.

I used 2011 final four stats data rather than last years. Including RPI, Off eff, turnovers, & def eff. A fitness function of the final standings NCAA tournament standings feed into an evolving genetic program giving me a final equation. I feed in this equations, this years team of 64 to compute the final standing of the 2013 tournament.


Predict the Madness
Monte McNair




Similar strategy as last year. Used Ken Pomeroy's Pythag ratings with the log5 calculation to determine probability of winning the game.

Used a Monte Carlo simulation at 65 iterations which provided a few interesting upsets, Oregon over Oklahoma St. (I believe they were miss seeded myself!).


Danny's Dangerous Picks

Developed a variant on probabilistic matrix factorization, where the scores of a game are modeled as the output of a neural network that takes as input a learned latent vector for each team as well as the elementwise product of the latent vectors for the two teams. Latent vectors for each team are learned for each team for each season jointly with the neural net parameters, which are shared across all seasons from 2006-2007 through the present. I used 5D latent vectors and a one hidden layer neural net with 50 hidden units.


Human Bracket

The Commissioner's human bracket.


The Rosenthal Fit
Jeffrey Rosenthal

Details here:


Last Year's Winner (Baseline)
Jasper Snoek

(The winning algorithm from last year, run on this year's data but otherwise unmodified. Entered as a baseline.) I modified Danny's starter code in two ways: First, I added an asymmetric component to the loss function, so the model is rewarded for getting the prediction correct even if the absolute predicted scores are wrong. Second, I changed the regularization so that latent vectors are penalized for deviating from the global average over latent vectors, rather than being penalized for being far from 0. This can be interpreted as imposing a basic hierarchical prior.

I then ran a search over model parameters (e.g., latent dimension, regularization strength, parameter that trades off the two parts of the loss function) to find the setting that did best on number of correct predictions made in the past 5 years's tournaments.


Leon's Super Legendary Bracket

Defensive efficiency vs Offensive efficiency; tie-breakers favored defense over offense. Chose final score using season averages in wins/losses.


Tim J's Nets for Nets
Tim J.

Based on full season statics for each team run a discriminant analysis for correlation with wins including seasons 2000-present.

Then I trained a neural network only on neutral location games, measuring both performance in mean squared error and actual past year bracket scores from 2007-2012, and predicting the bracket for this year.


natebrix's Neat Bracket

The method is a variation on Boyd Nation's Iterative Strength Rating that incorporates margin of victory and weights late-season games more strongly. This link has more:



Mark's LR bracket

Logistic Regression???


Ask me about my T-Rex
Zach Mayer



ScottyJ's Grand Bracket



Guess O'Bot 3000



Andy's Astounding Bracket




Dan Tran's Dazzling Bracket