Posted by Danny TarlowAs usual, the first round was full of upsets, with two of the #2 ranked teams falling. None of our competitors predicted either of those upsets, but they are still putting on a respectable performance. Here are details of each competitor's entry, along with the current performance.
The favorites at this point look like "The Matrix Factorizer" and "The Pain Machine". Both did quite well in the first round, and both have 7/8 elite eight teams still surviving, along with all 4/4 final four teams still alive.
The Matrix Factorizer
I modified Danny's starter code in two ways: First, I added an asymmetric component to the loss function, so the model is rewarded for getting the prediction correct even if the absolute predicted scores are wrong. Second, I changed the regularization so that latent vectors are penalized for deviating from the global average over latent vectors, rather than being penalized for being far from 0. This can be interpreted as imposing a basic hierarchical prior.
I then ran a search over model parameters (e.g., latent dimension, regularization strength, parameter that trades off the two parts of the loss function) to find the setting that did best on number of correct predictions made in the past 5 years's tournaments.
24 of 33 Correct, 25 Pts, 171 Pts Possible
The Pain Machine
Methodology: Linear regression on a number of statistics, including strength ratings to predict MOV (Margin of Victory). Some modifications for tournament use, particularly to force a likely number of upsets.
23 of 33 Correct, 24 Pts, 170 Pts Possible
Methodology: Using Ken Pomeroy's Pythag formula to rate teams, then calculated the actual game probabilities with the log5 formula. Used a random number generator to determine outcome of games. This provided some randomness which created a few interesting upsets. Simulate the tournament 50 times and record each team's probability to reach subsequent rounds. Step through each round of the bracket choosing winners based on the team that had a higher probability to win that round.
I found that running the simulation 50 times gave me the most variability in the final four, running the simulation more than 100 times gave me a bracket that had almost no upsets and most all of the higher seeded teams progressed through the tournament.
23 of 33 Correct, 24 Pts, 172 Pts Possible
Always pick the higher seed.
23 of 33 Correct, 24 Pts, 168 Pts Possible
For each season (e.g. 2006-2007) I have enumerated the teams and compiled the scores of the games into a matrix S. For example, if team 1 beat team 2 with a score of 82-72 then S12=82 and S21=72. Ideally, each team would play every other team at least once, but this is obviously not the case so the matrix S is sparse. Using the method proposed by George Dahl, I define vectors o and d which correspond to each teams offensive and defensive ability. The approximation to the matrix S is then just the outer product od' (for example (od')_12=o1d2=S12est). This is a simple rank one approximation for the matrix. If each team played each other at least once then the matrix S would be dense and the vectors o and d could be found by finding the SVD of S (see http://www.stanford.edu/~boyd/ee263/notes/low_rank_approx.pdf). Because this is not the case, we instead define a matrix P that represents which teams played that season. For example, P12=P21=1 if teams 1 and 2 played a game. Now the problem stated by George can be expressed compactedly as, "minimize ||P.*(o*d')-S||_F". Here, '.*' represents the Hadamard product and ||.||_F is the Frobenius norm. In this from, it is easy to see that, for constant vector o and variable vector d, this is a convex problem. Also, for constant vector d and variable vector o this is a convex problem. Therefore, by solving a series of convex problems, alternating the vector variable between o and d, the problem converges rapidly in about 5 to 10 steps (see "Nonnegative Matrix Factorizations" code here http://cvxr.com/cvx/examples/).
See this post for more details.
23 of 33 Correct, 24 Pts
Danny's Dangerous Picks
I started with the basic matrix factorization approach from my starter code, then I added small neural networks that applied a transformation to the base latent vectors based on whether the team was playing at home, away, or in the tournament. These transformation vectors were learned based on season and tournament performance of teams from other years. I split the data into 5 cross-validation sets, and looked for hyperparameter settings that did best on tournament prediction in past years. Like Jon, I also added an asymmetric component to the loss function.
Interestingly (disappointingly), after finding the setting of parameters that did best on past data, my method made some pretty conservative predictions for this year, predicting only 3 upsets.
22 of 33 Correct, 23 Pts, 165 Pts Possible
Predict the Madness
Methodology: To determine the probability of any matchup (Team 1 beating Team 2), I use a logistic regression using statistics for offense/defense of team and team's opponents plus location, dependent variable is outcome of the game. To select bracket, I use a program to calculate the best possible bracket by maximizing number of points based on scoring system, this correctly accounts for situations where simply advancing favored teams round by round would fail.
22 of 33 Correct, 23 Pts, 157 Pts Possible
The methodology is that I selected various stats and gave weight to those that I feel are important, such as points for and against, offensive rebounds, and turnover margin. I also factored in whether they were from one of the big conferences, the level of experience and success the coach has had, and then overlaid the formula with a strength of schedule formula that would reduce certain teams scores based on how good or bad the competition was that they played to get those stats.
22 of 33 Correct, 23 Pts, 139 Pts Possible
Machine Learning First Try
My methodology is as follows:
1. Develop a matrix that contains only 2011 scores (done using your data)
2. Develop a matrix that contains all of your teams and generate columns for averages over all players in 2011: minutes played, FT attempted/made, 3P attempted/made (done), rebounds, turnovers, fouls (again using your data)
3. Use machine learning, specifically a traditional Forest algorithm to predict each team's score for each game based on the 2011 data only
4. Select the winner for each round and repeat step 3 for the next round to determine the next winners
Currently, the algorithm predicted the first round modeling each team's score as an "Away" team since they are all technically on the road. I think I may change it so that the scores are based on a mean value of the model for an Away team and Home team because currently it is predicting LIU Brooklyn over MSU in the 1st round...if it comes true then so be it.
20 of 33 Correct, 21 Pts, 91 Pts Possible
By The Numbers
I took the data so generously provided, trained a couple of neural networks on the past performance, then used average away performance for each team to predict performance in the tourney. The networks are training as I type.
17 of 33 Correct, 18 Pts, 166 Pts Possible