Monday, March 14, 2011

The Algorithm's 2011 March Madness Predictions Part 1

Posted by Danny Tarlow
We seem to have ironed out most issues related to data for the 2011 March Madness Predictive Analytics Challenge. As a reminder, check the March Madness label to see all contest-related posts in one place. If you'd like to enter the competition, there's still time! (And Lee has posted new, easier to use data.)

With announcements out of the way, it is with great pleasure that I announce that I've got my algorithm running on the 2011 season data. In this post, I will show outputs from the simple 1D version of my probabilistic matrix factorization model. I'll run the higher dimensional version that I use to actually make predictions and write about it in a later post (or you can follow the instructions here, run the code, and train the model yourself).

How the Model Works: The basic idea is very simple: we want to represent each team's offensive and defensive strength with a single number (in the 1D version) for each. We will make a prediction for the result of team i playing team j as follows:

Predicted score for i = offense_strength_i * defense_strength_j
Predicted score for j = offense_strength_j * defense_strength_i

It should be clear that higher numbers mean better offense, and lower numbers mean better defense.

The learning algorithm looks over all games played this season and tries to find a setting of offensive and defensive strengths for each team such that scores predicted by the model best match the actual outcomes observed*. (If you want the short answer, this is achieved via the miracles of calculus.)

What I'm Showing: I will first report a combined measure, which takes the offensive strength and subtracts the defensive strength. If you think of having each team play against a baseline team with offensive and defensive strength 1, then the difference tells you how much you expect to the team to win by (or, if it's negative, to lose by). Afterwards, I show the top offenses and the top defenses, along with their strengths learned by the model. In all cases, I report the top 50 teams. Keep in mind that the algorithm knows nothing about rankings, players, or anything other than the final score of each game. Also keep in mind that I know less than the algorithm about what happened this year in college basketball. There are some important caveats in the comments under this post.

If you'd like to reproduce these results at home, follow the instructions and run the code in the next post.

So without further ado, here are the outputs. You can use these to predict the score of any game by plugging into the formula above. (Only the top 50 teams are shown in the post for each measure. The outputs for all teams in the database are here.)

Combined Rating

Duke (3.15)
Ohio St. (3.10)
Kansas (3.07)
Washington (2.71)
Pittsburgh (2.64)
Texas (2.57)
Kentucky (2.52)
Purdue (2.43)
Louisville (2.42)
BYU (2.41)
Notre Dame (2.41)
Syracuse (2.39)
North Carolina (2.34)
San Diego St. (2.26)
Wisconsin (2.21)
Connecticut (2.15)
Georgetown (2.01)
Missouri (2.00)
Arizona (2.00)
Illinois (1.99)
Cincinnati (1.96)
West Virginia (1.95)
Vanderbilt (1.95)
Florida (1.94)
Marquette (1.93)
UNLV (1.93)
Villanova (1.92)
Kansas St. (1.84)
Maryland (1.83)
Gonzaga (1.78)
Michigan St. (1.72)
St. Mary's (1.72)
Virginia Tech (1.72)
St. John's (1.70)
Utah St. (1.68)
Washington St. (1.65)
Florida St. (1.63)
Texas A&M (1.63)
Belmont (1.62)
Clemson (1.59)
Xavier (1.58)
Michigan (1.57)
Temple (1.55)
New Mexico (1.50)
George Mason (1.49)
UCLA (1.47)
Minnesota (1.46)
Northwestern (1.44)
USC (1.43)

Offenses

Washington (10.72)
Duke (10.43)
Kansas (10.39)
BYU (10.32)
Oakland (10.20)
Missouri (10.18)
Ohio St. (9.97)
North Carolina (9.89)
Virginia Military (9.87)
Notre Dame (9.79)
Kentucky (9.76)
Vanderbilt (9.73)
Louisville (9.69)
Marquette (9.67)
Maryland (9.65)
Arizona (9.64)
Providence (9.62)
Pittsburgh (9.56)
Long Island (9.56)
St. Mary's (9.56)
Connecticut (9.54)
Syracuse (9.54)
La Salle (9.49)
Texas (9.49)
South Dakota St. (9.47)
Purdue (9.47)
Iona (9.46)
California (9.42)
Duquesne (9.42)
Belmont (9.40)
Villanova (9.39)
Georgetown (9.37)
Gonzaga (9.35)
Texas Tech (9.35)
Iowa St. (9.30)
Washington St. (9.29)
Mississippi (9.29)
Illinois (9.29)
Northwestern (9.29)
UNLV (9.25)
Boston Coll. (9.21)
Kansas St. (9.20)
Florida (9.18)
Xavier (9.17)
Detroit (9.14)
New Mexico (9.11)
St. John's (9.10)
Michigan St. (9.06)
Charleston (9.06)

Defenses

Wisconsin (6.61)
San Diego St. (6.74)
New Orleans (6.79)
Cincinnati (6.84)
Utah St. (6.85)
Ohio St. (6.87)
Texas (6.92)
Pittsburgh (6.92)
USC (6.96)
Alabama (6.98)
Penn St. (6.99)
Old Dominion (6.99)
West Virginia (7.00)
Texas A&M (7.01)
Clemson (7.02)
Michigan (7.03)
Purdue (7.04)
Cal Poly (7.05)
Virginia (7.12)
Virginia Tech (7.13)
Florida St. (7.15)
Syracuse (7.15)
Drexel (7.18)
Temple (7.22)
Kentucky (7.23)
Florida (7.24)
Stephen F. Austin (7.25)
Louisville (7.26)
Duke (7.29)
Richmond (7.29)
Illinois (7.30)
Kansas (7.32)
UNLV (7.32)
Michigan St. (7.34)
Fairfield (7.34)
Saint Louis (7.35)
Georgetown (7.36)
Kansas St. (7.36)
St. Peter's (7.37)
Northern Iowa (7.37)
Denver (7.38)
Notre Dame (7.39)
UAB (7.39)
St. John's (7.39)
Connecticut (7.39)
Montana (7.40)
UCLA (7.40)
Seton Hall (7.41)
South Florida (7.42)

* There's also some regularization.