Tuesday, March 8, 2011

Predicting March Madness: Tournament Structure and Bracket Scoring Measures

Posted by Danny Tarlow
This is the second post aimed at getting you all excited about entering the 2011 March Madness computer predictions challenge. If you missed George's earlier post, go check it out first, to get the creative juices flowing.

George wrote about the problem of given two teams, how likely is it is that one will beat the other. What I will talk about now is how the tournament structure comes into play--the concerns that go beyond which of two teams will win the next game they're scheduled to play against each other. I'll assume throughout that we have at our disposal a good model of individual games.

The first natural question is whether any additional thought is necessary. If we have a good model of individual games, can't we just greedily use the model to predict all first round outcomes, then all second round outcomes, ..., then be done? Maybe it's not such a bad idea. That's what I've done the past few years, after all.

I'm going to argue that there may be good reasons to work on factoring the tournament structure and contest scoring rules into your model. Let's begin by thinking about the tournament probabilistically. In the most general setup, there is a probability distribution P over the joint space of all outcomes. In a bracket with 64 teams (ignore the play-in game for now), there are 2^32 possible results for the first round. Given those, there are 2^16 possible results for the second round. Given those, 2^8, then 2^4, then 2^2, then 2 possible outcomes for later rounds. Multiplying all those, we get 2^63 possible joint outcomes. A model should be to be able to assign a probability to any of those full outcomes.

Compact representation of probabilities (aka the model): 2^63 is a very large number, so it's not feasible to represent the probability of each outcome as a separate entry in a very large table. Beyond the practical concerns, it also wouldn't be a good idea from a statistical estimation perspective--such a model would make inefficient use of training data.

To illustrate this, imagine for a moment that the same tournament happens repeatedly (the same teams start in each position of the bracket, but there may be different outcomes), and suppose we want to use data from the last N tournaments to estimate the probability of outcomes in the next repetition of the tournament. One approach is to just count how often each full bracket resulted in the last N tournaments. This would use on the order of 2^63 parameters. If N is infinite (or _extremely_ large), then this is a perfectly good approach; however, when we have limited data (i.e., nearly always), nearly all entries in the big table will be zero; in this case, it is important to have a more compact representation.

The standard solution is to find a more compact representation of P. Here's one, which we'll call the fully factorized model: the probability of a full tournament outcome P(g_1, ..., g_63) over 63 games g_1 ... g_63 is proportional to f(g_1) * f(g_2) * ... * f(g_63). For each game, we can look at who the participants are, then estimate f by looking over the last N repetitions of the tournament where the appropriate two teams met in a game. It should be clear that there are exponentially many fewer parameters in the compact model than there are in the big joint table model--in the simplest version, we just need to represent one number for each pair of teams, so there are on the order of M^2 parameters if there are M teams in the tournament. (Going further, which is necessary in practice, when there is only one tournament per year, if we were to use e.g., a matrix factorization model, we could represent an "offensive strengths" vector (or scalar) and a "defensive strengths" vector (or scalar), then we could produce

f(team i, team j) = g(team i offense * team j defense - team i defense * team j offense),

which requires only on the order of M parameters.)

Regardless of how f is computed, the fully factorized model makes some practical sense as well, because on first pass, we typically don't expect the result of an early game in the South bracket to affect the outcome of an early game in the East bracket, for example. In the fully factorized model, that assumption is built in. Statistically, the rough basic intuition is that all else being equal, if we have fewer parameters in the model, we need fewer past data instances in order to pin down their values accurately.

The flip side of using a simpler model is that we lose the ability to model certain interactions. Let's think critically about this: do we really believe that the outcome of each game depends only on which teams are playing in the current game? If you listen to sports announcers, it's not clear to me that the answer is yes. For example,
• Some teams might do better or worse under pressure, players might get tired after several intense games in a row, or the location of the game affect the outcome. In all these cases, the round in which the game takes place could be important.
• If an underdog pulls off a big upset over a top seed, the underdog might gain confidence and become stronger in their next games.
• If a top seed just barely beats a big underdog, the top seed might either lose confidence or snap out of a lull.
• If an underdog in say the East bracket beats a top seed in say the South bracket on Day 1, it might give other underdogs a boost of confidence on Day 2.
• Suppose we have a structure where A is playing B and C is playing D, then the winners play each other. If B is weak and C is a big rival of A's, it's possible that A could look past B to C and not come prepared for the game against B.

Are these things important to model? I'm not sure, but it would be interesting to try. If we do, we'll need a more complex representation of our tournament outcome model than the simple fully factorized approach gives.

Loss functions: Now let's switch gears and consider precisely what our goal is. I'll assume that each bracket challenge contestant can only enter one bracket into the contest. In the simplest case, we might be concerned only with whether we get the perfect bracket or not; this would be the case in a contest where a perfect bracket wins \$5M, but there is no payout for a less-than-perfect bracket. Then, the optimal strategy is to choose the joint outcome that the model thinks is most likely. The optimality is easy to see, because the expected value is equal to \$0 * P(not perfect) + \$5M * P(perfect). So assuming the model captures all we know about the tournament, we should clearly choose the outcome that the model assigns the highest probability.

What about the case where the goal is to get as many predictions right as possible?

Maybe we are equally unhappy if we do worse than the baseline that makes all predictions based solely on seed, then we're linearly more happy as we beat that baseline by more. (For example, if you're betting against your mom, and she has a history of always choosing the better seed.)

Or maybe you are linearly more happy as you get more "points", but points are determined by some contest-specific scoring rules, like the first round is worth 1 point, the second round is worth 2, the third 4, etc.

Whatever your goal, there is likely some loss function that can be formulated to capture your desires for a model. Formally, a loss function is just a function of a model's prediction and the ground truth that tells how unhappy we are with the prediction. If the prediction is perfect (all games right!), loss should be 0, and if the prediction is the worst possible (all games wrong?!) the loss should be maximum (let's say 1). Obviously we'd prefer to just have a perfect bracket, but given such a hard problem, that's unrealistic. The job of a loss function is to tell the model how to trade off different kinds of mistakes. It's rarely the case that we really think there is one bracket that is perfect and all others (like the one that gets everything right except the play-in game) are equally bad, so formulating a good loss function generally takes some thought.

These types of questions are near and dear to my heart (though I generally think more about them in the context of computer vision than March Madness), so hopefully you now have some intuitions about why we might want to go beyond simple per-game models, and hopefully you're convinced that part of defining a good model is defining a good loss function.

A question that remains is how to actually build such a model. How do we formulate this idea that we are making multiple correlated predictions at once (predicting the whole bracket rather than just one game at a time), and how do we ask a learning algorithm to try to optimize a particular loss function? I'm glad you asked! Luckily, there is a whole subfield of machine learning that looks at these types of problems. It goes by the name of structured prediction, or structured output learning.

I'd write more about it here, but Hal DaumÃ© III has written a great series on the exact subject over at his blog. So if you are interested in more, I encourage you to go check out the following posts:
And it should go with out saying: keep working on those models! The competition will be starting soon!

1 comment:

Tlow said...

The bit about choosing good loss functions is a nice general point that I find is ignored in many machine learning set ups: if you're going to work so hard making a model optimize some objective, make sure it's the right objective!