*Posted by Danny Tarlow*

This is the second post aimed at getting you all excited about entering the

2011 March Madness computer predictions challenge. If you missed

George's
earlier post, go check it out first, to get the creative juices flowing.

George wrote about the problem of given two teams, how likely is
it is that one will beat the other. What
I will talk about now is how the tournament structure comes into play--the concerns
that go beyond which of two teams will win the next game they're scheduled to play
against each other. I'll assume throughout that we have at our disposal a good model
of individual games.

The first natural question is whether any additional thought is necessary. If
we have a good model of individual games, can't we just greedily use the model to
predict all first round outcomes, then all second round outcomes, ..., then be done?
Maybe it's not such a bad idea. That's what

*I've* done the past few years, after all.

I'm going to argue that there may be good reasons to work on
factoring the tournament structure and contest scoring rules into your model.
Let's begin by thinking about the tournament probabilistically.
In the most general setup, there is a probability distribution P over the
joint space of all outcomes. In a bracket
with 64 teams (ignore the play-in game for now), there are 2^32 possible results
for the first round. Given those, there are 2^16 possible results for the second
round. Given those, 2^8, then 2^4, then 2^2, then 2 possible outcomes for
later rounds. Multiplying all those, we get 2^63 possible joint outcomes.
A model should be to be able to assign a probability to any of those full outcomes.

**Compact representation of probabilities (aka the model):**
2^63 is a very large number, so it's not feasible to represent the probability
of each outcome as a separate entry in a very large table. Beyond the practical
concerns, it also wouldn't be a good idea from a statistical estimation perspective--such
a model would make inefficient use of training data.

To illustrate this, imagine for a moment that the same tournament happens repeatedly
(the same teams start
in each position of the bracket, but there may be different outcomes), and suppose we
want to use data from the last N tournaments
to estimate the probability of outcomes in the next repetition of the tournament.
One approach is to just count how often each full bracket resulted in the last N tournaments.
This would use on the order of 2^63 parameters.
If N is infinite (or _extremely_ large), then this is a perfectly good approach; however,
when we have limited data (i.e., nearly always), nearly all entries in the big table will
be zero; in this case, it is important to have a more compact representation.

The standard solution is to find a more compact representation of P. Here's one, which
we'll call the

*fully factorized* model:
the probability of a full tournament outcome P(g_1, ..., g_63) over 63 games g_1 ... g_63
is proportional to f(g_1) * f(g_2) * ... * f(g_63). For each game, we can look
at who the participants are, then estimate f by looking over the last N repetitions
of the tournament where the appropriate two teams met in a game.
It should be clear that there are exponentially many fewer parameters in the compact model
than there are in the big joint table model--in the simplest version, we just need to represent one number for
each pair of teams, so there are on the order of M^2 parameters if there are M teams in the
tournament. (Going further, which is necessary in practice, when there is only one tournament per year, if we were to use e.g., a
matrix factorization model, we could represent an "offensive strengths" vector (or scalar) and a "defensive
strengths" vector (or scalar), then we could produce

f(team i, team j) = g(team i offense * team j defense - team i defense * team j offense),
which requires only on the order of M parameters.)

Regardless of how f is computed, the fully factorized model makes some practical sense as well,
because on first pass, we typically don't expect the result of an early game in the South bracket to affect
the outcome of an early game in the East bracket, for example. In the fully factorized model,
that assumption is built in. Statistically, the rough basic intuition is that all else being
equal, if we have fewer parameters in the model, we need fewer past data instances in
order to pin down their values accurately.

The flip side of using a simpler model is that we lose the ability to model certain
interactions. Let's think critically about this: do we really believe that the outcome
of each game depends only on which teams are playing in the current game? If you listen
to sports announcers, it's not clear to me that the answer is yes. For example,

- Some teams might do better or worse under pressure, players might get tired after several intense
games in a row, or the location of the game affect the outcome.
In all these cases, the round in which the game takes place could be important.
- If an underdog pulls off a big upset over a top seed, the underdog might gain confidence and
become stronger in their next games.
- If a top seed just barely beats a big underdog, the top seed might either lose
confidence or snap out of a lull.
- If an underdog in say the East bracket beats a top seed in say the South bracket on Day 1, it
might give other underdogs a boost of confidence on Day 2.
- Suppose we have a structure where A is playing B and C is playing D, then the winners play each
other. If B is weak and C is a big rival of A's, it's possible that A could look past B to C and not
come prepared for the game against B.

Are these things important to model? I'm not sure, but it would be interesting to try.
If we do, we'll need a more complex representation of our tournament outcome model
than the simple fully factorized approach gives.

**Loss functions:** Now let's switch gears and consider precisely what our goal is.
I'll assume that each bracket challenge contestant can only enter one bracket into the contest.
In the simplest case, we might be concerned only with whether we get the perfect
bracket or not; this would be the case in a contest where a perfect bracket
wins $5M, but there is no payout for a less-than-perfect bracket. Then, the optimal
strategy is to choose the joint outcome that the model thinks is most likely. The optimality
is easy to see, because the expected value is equal to $0 * P(not perfect) + $5M * P(perfect).
So assuming the model captures all we know about the tournament,
we should clearly choose the outcome that the model
assigns the highest probability.

What about the case where the goal is to get as many predictions right as possible?

Maybe we are equally unhappy if we do worse than the baseline that makes all predictions
based solely on seed, then we're linearly more happy as we beat that baseline by more.
(For example, if you're betting against your mom, and she has a history of always choosing
the better seed.)

Or maybe you are linearly more happy as you get more "points", but points are determined
by some contest-specific scoring rules, like the first round is worth 1 point, the second
round is worth 2, the third 4, etc.

Whatever your goal, there is likely some

*loss function* that can be formulated to
capture your desires for a model.
Formally, a loss function is just a function of a model's prediction and the ground
truth that tells how unhappy we are with the prediction. If the prediction is perfect (all games right!),
loss should be 0, and if the prediction is the worst possible (all games wrong?!) the
loss should be maximum (let's say 1).
Obviously we'd prefer to just have a perfect bracket, but given such a hard problem,
that's unrealistic. The job of a loss function is to tell the model how to trade off
different kinds of mistakes. It's rarely the case that we really think there is
one bracket that is perfect and all others (like the one that gets
everything right except the play-in game) are equally bad, so formulating
a good loss function generally takes some thought.

These types of questions are near and dear to my heart (though I generally think more about
them in the context of computer vision than March Madness),
so hopefully you now have some intuitions about why we might want to go beyond simple
per-game models, and hopefully you're convinced that part of defining a good model is
defining a good loss function.

A question that remains is how to actually build such a model. How do we formulate
this idea that we are making multiple correlated predictions at once (predicting the
whole bracket rather than just one game at a time), and how do we ask a learning
algorithm to try to optimize a particular loss function? I'm glad you asked! Luckily,
there is a whole subfield of machine learning that looks at these types of problems.
It goes by the name of

*structured prediction*, or

*structured output learning*.

I'd write more about it here, but

Hal DaumÃ© III has written a great series on the
exact subject over at his blog. So if you are interested in more, I encourage you to go
check out the following posts:

And it should go with out saying: keep working on those models! The competition will be starting soon!