## Wednesday, March 18, 2009

### Data driven march madness predictions

Posted by Danny Tarlow

I didn't really pay attention to college basketball this year, so I decided to take a different approach to filling out my bracket.

I started by downloading the full Division 1 Men's Basketball schedule (scraped from from rivals.yahoo.com), along with the score of each game, the date, and the home team.

The Model
For the model, I assume that each team has two (unknown) vectors of real numbers describing how good its offense and defense are in several attributes, respectively. For example, we might want to represent how good the guards on each team are, how good the forwards are, and how good the centers are -- both at offense and defense. We could do this using an offensive and defensive vector:
Offense: [5, 10, 4]
Defense: [2, 3, 10]

which would mean that the guards are a 5 on offense and a 2 on defense, etc. In my model, it's going to be easier if we assume that high numbers are better for offenses, and low numbers are better for defenses.

The score of a game between team i and team j can then be generated as the dot product of team i's offensive vector with team j's defensive vector, and vice versa. In our running example, if our team from before played a team with vectors:
Offense: [3, 2, 4]
Defense: [2, 5, 5]

Then the first team's score would be predicted to be
5 * 2 + 10 * 5 + 4 * 5 = 80

and the second team's score would be predicted to be
3 * 2 + 2 * 3 + 4 * 10 = 52

a real blowout!

Learning
Now, the only problem is that we don't actually know the vectors describing each team's offense and defense. That's ok -- we'll learn them from the data.

Formally, the goal is to find latent matrices O and D that minimize the sum of squared error between predicted scores and observed scores. In math,

\sum_g (score_{g(i)} - O_i: • D_j:)^2 + (score_{g(j)} - O_j: • D_i:)^2

where I use the notation that team i played team j in game g (i and j are dependent on g, but I drop this dependence in the notation to keep things simpler)*.

I won't go into details, but we can take the derivative of the error function with respect to each latent vector in order to find changes to the vectors that will make them closer match the results of all of the games earlier in the season. I repeat this until there aren't any changes that will improve the error (batch gradient descent, for the detail-oriented folks out there).

Results
In the case where I choose latent vectors to be 1 dimensional, I get as output an offensive and defensive rating for each team. Remember, to predict the first team's score against another team, multiply the first team's offensive rating (higher is better) by the second team's defensive rating (lower is better).

Here are the top 10 offenses and defenses, as learned by the 1D version of my model:

Offenses
North Carolina (9.79462281797)
Pittsburgh (9.77375501699)
Connecticut (9.74628326851)
Memphis (9.71693872544)
Louisville (9.69785532917)
Duke (9.65866585522)
UCLA (9.59945808934)
West Virginia (9.56811566735)
Arizona St. (9.56282860536)
Missouri (9.55043151623)

Defenses
North Carolina (7.02359489844)
Pittsburgh (7.0416251036)
Memphis (7.05499448413)
Connecticut (7.07696194481)
Louisville (7.14778041166)
Duke (7.18950625894)
UCLA (7.21883856723)
Gonzaga (7.22607569868)
Kansas (7.2289767174)
Missouri (7.2395184452)

And here are the results of simulating the full tournament with a 5 dimensional model. For each game, I report the predicted score, but for the bracket I just chose the predicted winner.

==================== ROUND 1 ====================
Ohio St. 74.9907105909, Siena 69.6702059811
Utah 69.7205426091, Arizona 69.2592708246
Wake Forest 72.3264784371, Cleveland St. 64.3143396939
West Virginia 66.7025939102, Dayton 57.550404701
Kansas 84.0565034675, North Dakota St. 71.281863854
Boston Coll. 65.0669174572, USC 68.7027018576
Michigan St. 77.3858437718, Robert Morris 59.6407479

Connecticut 91.9763662649, Chattanooga 63.9941388666
BYU 74.7464520646, Texas A&M 70.5677646712
Purdue 69.8634461612, Northern Iowa 59.4892887466
Washington 81.8475059935, Mississippi St. 74.6374151171
Marquette 73.4307446299, Utah St. 69.1796188404
Missouri 83.8888903275, Cornell 68.1053984941
California 74.9638076999, Maryland 71.2565877894
Memphis 78.3145709447, CSU Northridge 59.0206289492

Pittsburgh 85.5983991252, E. Tennessee St. 64.8099546261
Oklahoma St. 81.6131739754, Tennessee 81.8021658489
Florida St. 59.994769086, Wisconsin 60.9139371828
Xavier 77.3537694, Portland St. 63.8161558802
UCLA 76.790261041, VCU 65.2726887151
Villanova 72.9957948506, American 58.6863439306
Texas 64.5805075558, Minnesota 62.3595994418
Duke 85.084666484, Binghamton 61.1984347353

LSU 65.0807263343, Butler 64.9895028812
Illinois 70.6250577544, West. Kentucky 57.6646396014
Gonzaga 75.0447785407, Akron 61.0678281691
Arizona St. 64.7151394863, Temple 58.0578420156
Syracuse 74.7825424779, Stephen F. Austin 60.5056731732
Clemson 74.4054903161, Michigan 70.8395522274
Oklahoma 78.5992492855, Morgan St. 59.7587888038

==================== ROUND 2 ====================
Louisville 67.3059313968, Ohio St. 60.5835683909
Utah 71.3007847464, Wake Forest 73.2895225467
West Virginia 67.9574088476, Kansas 67.4869037187
USC 62.1192840465, Michigan St. 64.56295945

Connecticut 76.8719158147, BYU 71.8412099454
Purdue 74.245343296, Washington 73.6100911982
Marquette 76.4607554812, Missouri 80.5497967091
California 64.7143532135, Memphis 70.9373235427

Pittsburgh 79.1278381289, Tennessee 70.6786108051
Wisconsin 63.0943233452, Xavier 63.5379857382
UCLA 74.1282015782, Villanova 71.4919550735
Texas 66.3817261194, Duke 70.9875941571

North Carolina 86.2296333847, LSU 73.8695973309
Illinois 62.6218220536, Gonzaga 65.6078661776
Arizona St. 74.0588194422, Syracuse 71.254787147
Clemson 76.9943827197, Oklahoma 78.9108038697

==================== SWEET 16 ====================
Louisville 72.8097088102, Wake Forest 68.2411945982
West Virginia 66.1905929215, Michigan St. 65.2198396254

Connecticut 70.4975234274, Purdue 67.014115714
Missouri 66.6046145365, Memphis 69.9964130636

Pittsburgh 72.8975484716, Xavier 64.8486151341
UCLA 72.3676109557, Duke 73.1522519556

North Carolina 84.6606149747, Gonzaga 80.3910425893
Arizona St. 67.8668018941, Oklahoma 67.0441371239

==================== ELITE EIGHT ====================
Louisville 64.0822047092, West Virginia 61.7652102534

Connecticut 64.875382557, Memphis 65.9485921907

Pittsburgh 72.8027424093, Duke 70.5222034022

North Carolina 76.2640153058, Arizona St. 72.3363504426

==================== FINAL FOUR ====================
Louisville 60.7832463768, Memphis 61.4830569498

Pittsburgh 80.3421788636, North Carolina 81.0056716364

==================== FINAL GAME ====================
Memphis 73.8935857273, North Carolina 74.259537592


Here are the full 1D learned offensive and defensive outputs:

Let me know if you'd like the data I gathered or the code I wrote to make this work -- I'm happy to share.

* In addition, I regularize the latent vectors by adding independent zero-mean Gaussian priors (or equivalently, a linear penalty on the squared L2 norm of the latent vectors). This is known to improve these matrix-factorization-like models by encouraging them to be simpler, and less willing to pick up on spurious characteristics of the data.

Rolf Steier said...

haha, this is awesome!

Evan said...

A Professor at Georgia Tech has a model for predicting the tournament, check out:

http://www2.isye.gatech.edu/people/faculty/Joel_Sokol/

It seems like you get pretty similar results, but there are a few differences.

May the best model win!

Danny Tarlow said...

Evan -- thanks for that reference. It looks like the paper describing their method isn't up. I'll try to dig in and find it tonight and maybe post some comments about the differences in our methods.

In the meantime, for those who just want the predictions, here is the direct link to their predicted bracket.

nick said...

This is awesome man! I just filled out my bracket yesterday, and coincidentally I have similar results as you. Suddenly, I'm feeling a lot more confident about my picks! haha!

Zack said...

I saw your post on 538.com, and I'm also hoping that Mr. Silver gives us his bracket.

Regarding your model, it doesn't appear that you did anything to account for strength of schedule. It seems that your analysis would favor teams that built their data set against inferior competition. Indeed, it seems like this is the case, with teams from weaker conferences like the Pac10 an Conference USA doing significantly better in your bracket than in other prognostications.

Still, intriguing model.

nick said...

Also...I don't have much of a background in Machine Learning, but I'm part of DGP here at UofT and I'd be excited to see the code and better understand how you did this. Really cool!

Danny Tarlow said...

Hi Zack,

I agree. It would be great if Nate would take a crack at the bracket -- I wonder if there is anything he learned from his baseball days that would be applicable here.

In regards to your comment about strength of schedule, you're right that it's not explicitly included in the model.

However, it is implicitly taken into account.

For example, suppose team A is much better than B, and crushes B. Then there will be pressure on B's latent vectors to be weak (low offense values and high defense values) and pressure on A's to be strong.

Now imagine that team C comes along and is either going to beat team A or team B. In order to beat team C, it just needs medium strength offense and defense, but in order to overcome A's low defensive values and high offensive values, it's going to need high offensive values and low defensive values itself. In this way, we see that the value of a victory over teams _is_ dependent on the strength of the other team.

Now, the counter-argument is that there might not be that much flow of information between different regions of the model due to the league structure of the NCAA. If there is a weak connection (e.g., only a few teams from the PAC-10 played teams from the Big East), and there were some anomalies in those games, then the model could get confused. I think you're right this might be a valid concern, but I haven't done much to test for its effect.

Danny Tarlow said...

Hey Nick,

Thanks, but don't trust these too much! It's just a fun little project I did to kill some time on an airplane -- I'm not about to put any money behind my predictions =P

And sure, send me an email and I'll send you the code: dtarlow@cs

GBOGH said...

Can I get the code as well please? I've been trying to do some learning algorithms like this as well.

Thanks.
Matt

Jeremy said...

I'd like to see the ratings you gave all of the teams. The ratings seem pretty arbitrary. I don't know if you know anything about college basketball, so why would I trust your predictions?

Danny Tarlow said...

Hey Jeremy,

I added links to the learned 1D offense and defense ratings for every Division 1 team. Let me know if there is something systematic that you think is off.

I will be the first to admit that I know very little about college basketball, so you should definitely be skeptical. I'm not doing anything more than interpreting data about scores over the course of the season in a particular way. If you don't like my interpretation, then you should by all means ignore it -- I'm just explaining what I did (which I think is at least somewhat reasonable) and what resulted from me doing it.

Peter Prakash said...

Danny,

Care to share your email id, so I can contact you to get the code? I would love to take a look.

Thanks,
Peter

Danny Tarlow said...

Peter: my email is dtarlow at cs.toronto.edu.

Will Dwinnell said...

It's the revenge of the nerds: Time to make some money!

twangbio said...

Can i get the code please? Interesting to look into it.
Thanks.

Danny Tarlow said...

twangbio (and anybody else) -- I'm happy to send you the code, but please leave a way for me to contact you. Better yet, send me email: dtarlow at cs toronto edu