Tuesday, March 19, 2013

March Madness Team Embeddings

Posted by Danny Tarlow
I went with a new approach to Machine March Madness predictions this year. I won't go into the details right now, but here's a neat visualization that comes out of the algorithm. What you need to know is that I'm sticking with the basic original idea of using latent real-valued descriptors for each team, but I'm abandoning the requirement that there are segregated offensive and defensive descriptors for each team. Instead, the model this year represents each team with a set of numbers that can be used to explain both offensive and defensive performance.

So I'll skip all of the details and jump straight to showing you what the model has learned from this year's regular season. Below is a visualization of what happens when I ask the model to use two numbers to describe each team, then I plot the learned numbers as x and y coordinates on a standard plot.

These results lose the easy interpretability as offensive and defensive strengths, but the model is such that teams in similar locations on the plot will typically be predicted to perform similarly. To help with eyeballing the results, I've color coded 1 through 4 seeds: #1 seeds are blue, #2's are green, #3's are red, and #4's are magenta.

I won't try too hard to explain what's going on, but it does seem to group the stronger teams in the lower and left parts of the plot, and the weaker teams in the upper and right parts. Anybody notice any other interesting patterns?


Scott Turner said...

Interesting -- does this just use game scores to drive the ratings? Iona is the #2 scoring team in the nation, and Georgetown is something like 250th, so it suggests that diagonal is "scoring" but maybe I'm just seeing something that's not there.

Danny Tarlow said...

Interesting! Based on the top and bottom teams in terms of points per game (which I'm looking at here: http://slice-publish.s3-website-us-east-1.amazonaws.com/rrqTvopyJGc/# ), that explanation does seem to fit.

And it does just use game scores as the supervision in a manner similar to my previous models, where the descriptors are modulated by the opposing team's descriptors.

Ben said...

What dataset did you use to create this?

Danny Tarlow said...

The scores from all games this season, from the github repo:

hambown said...

Not sure if you can reveal your secrets here, but is this some form of k-SNE or t-SNE?

Danny Tarlow said...

Nope, no SNE here.