Posted by Danny TarlowThis is a guest post by Monte McNair, the man behind team "Predict the Madness," which is the leader of the machine competitors after the second round.
Developing a system to fill out the best NCAA Tournament bracket is composed of two parts: matchup prediction and bracket optimization.
The first thing to do is come up with a method to predict the likelihood of one team beating another. Since we only care about advancement, I want a system that produces a perentage as opposed to a point spread or something else. Therefore, I use a logistic regression with the outcome of games being the dependent variable. For the variables, I use the location of the game, metrics for the team's offense and defense, and metrics of the team's opponents' averages for both offense and defense. The NCAA Tournament is played at all neutral sites, but since I'm training on all games, I want to know how important playing at home is so that I can strip this out for neutral site games. The reason to use components of a team's offense and defense as opposed to simply points is that the different components that contribute to points have varying levels of reliability. As KenPom figured out this year, for example, defensive 3P% is extremely unreliable. My model takes this into account and weights it less than it would be if we used its influence on points against. By breaking it down, we let the model determine which factors are most reliable in predicting future performance.
The main thing we care about is that the model does a good job of predicing future games. Instead of waiting for future games, however, we can just use out of sample games. I took about 1/3 of our games and made them training games and left the other 2/3 as testing games. One thing I did that may be different from most is that I used all of a team's games for the season except for the game in question to create their profile. For example, say North Carolina played Duke on January 7th in one of my training games. For North Carolina's profile, I used stats from all of their games before AND after January 7th. I'm not sure what other systems do but I think they might use all games (without excluding the game in question) or perhaps just games PRIOR to the game in question. In any case, after training the model, I can test it against the out of sample games I set aside for testing. I divided up all the test games into 100 buckets ordered by their predicted win percentage and compared it to the actual win percentage in those games. As we can see, the buckets are closely aligned meaning the predictions are fairly accurate.
The next thing to do is to take our matchup predictions and maximize our expected points based on the scoring system we are presented with. While this is most beneficial when scoring systems provide bonuses for picking upsets or some other unique scoring, it can still be helpful in basic scoring systems and is better than simply advancing winners round by round.
As an example, take Louisville and New Mexico, the 4 and 5 seeds in the West region. My model predicts New Mexico as the favorite in a game against Louisville, projected to win 51.2% of the time. Both are favored in their 1st round matchups as well, so if we were to simply advance them both, we'd then choose New Mexico to advance over Louisville in the 2nd round. However, New Mexico has a tougher 1st round opponent in Long Beach State than Louisville does against Davidson. In the table below, we see that New Mexico wins just 65% against LBSU while Louisville wins 75% of the time against Davidson. This is enough to make it more likely that Louisville advances to the Sweet 16 than New Mexico, despite UNM being the better team.
New Mexico over Louisville: 51.2%
In a basic scoring system, this rarely comes into play and when it does, it provides little benefit. But it still is best to be accurate if you can.