## Monday, March 12, 2012

### How to pick upsets?

Posted by Danny Tarlow
Scott Turner writes...
Doing well in a tournament picking contest probably comes down to picking the right upsets. Anyone can pick the higher seeds to win.

Define an upset as a lower seed beating a higher seed, and ignore upsets where there's only 1 step differential (i.e., a #9 beating a #8). If my math from last year is correct, the upset rate in the tournament is around 22%. Half those upsets happen in the first round, about 7.

http://harvardsportsanalysis.wordpress.com/2012/03/12/predicting-ncaa-tournament-upsets-the-importance-of-turnovers-and-rebounding/
http://courtsideanalyst.wordpress.com/2012/03/12/two-potential-ncaa-upset-picks-with-supporting-math/

I leave it to Danny / Lee to turn this into a blog posting :-)
My response...

From a machine learning perspective, I think Scott raises an interesting issue here. Let me rephrase the problem a little more abstractly, to more clearly get at the crux of the issue. Suppose that some oracle were to come down and tell us that exactly 15 of the games in this year's March Madness tournament will be upsets. How should this affect our prediction strategy?

There are probably two natural answers:
• Don't change anything. I have my prediction for each game, and I think it's going to lead to the most number of correct predictions.
• Make my base predictions, but go back and find the games that I'm most uncertain about, and flip predictions until I am predicting exactly 15 upsets.
Actually, these both are reasonable strategies, but they say something different about the objective function that we are optimizing with our picks. If the goal is to just get as many games right as possible, and we believe our model captures all of the information we have about the outcome of the games (and we believe the game outcomes are statistically independent), then the first strategy will still maximize the expected number of games that we will get correct. However, by making this choice, assuming our model isn't predicting 15 upsets already, then we've eliminated ourselves from contention for the \$5 million prize that Yahoo offers to anybody who picks the perfect bracket.

So if the goal is to win the \$5 million prize and you believe the oracle, then the right strategy is to pick the 15 upsets that the model thinks are most likely.

However, while both of these strategies make some sense, they both seem too extreme. Perhaps the more natural objective should be to ensure that we win this year's Machine March Madness prediction contest. If that's our goal, what's the best strategy? What if we had the predictions from all of the competitors for past years, and I told you that this year's field was going to be drawn from a similar set of competitors?

See Scott's picks for most likely upsets over at his blog.

Scott Turner said...

Well I only posted the most likely #4, #3, #2, and #1 seeds to be upset -- and I suspect they're all pretty unlikely. The 5,6, and 7 upsets I'm saving for my entry :-).

If you want to win Machine Madness, and you assume that your competitors are competent, then I think you have to force some upsets into your picks. I don't think you can rely on having a better strength assessment than the other competitors.

It would be a different story if we were predicting MOV.

Danny Tarlow said...

I agree that if you expect even a single competitor to be fairly conservative, then you need to pick some upsets. But how many should you pick? How should that vary based on the field of competitors that you expect to see?