Tuesday, March 9, 2010

2009 and 2010 March Madness Data

Posted by Danny Tarlow
As promised, here is data from the 2009 and 2010 NCAA Division 1 college basketball seasons. Available for each game is the date, who played (in id, short code, and full name format), who the home team was, and the resulting score. The database view that the data is coming from looks like this:
| date_played | day_counter | home_id | away_id | home_score | away_score | home_code | away_code | home_name     | away_name     |
| 2009-03-18  |         128 |      61 |      28 |         73 |         71 | cbt       | bap       | Creighton     | Bowling Green | 
| 2009-03-18  |         128 |      88 |     131 |         84 |         62 | fak       | jab       | Florida       | Jacksonville  | 
| 2009-03-18  |         128 |     135 |     122 |         83 |         79 | kab       | iag       | Kansas St.    | Illinois St.  | 
| 2009-03-18  |         128 |     309 |     209 |         68 |         59 | tba       | nbd       | Tulsa         | Northwestern  | 
| 2009-03-18  |         128 |      15 |     294 |         87 |         82 | abb       | tae       | Auburn        | Tenn-Martin   | 
| 2009-03-18  |         128 |      22 |     101 |         74 |         72 | bae       | gae       | Baylor        | Georgetown    | 
| 2009-03-18  |         128 |     236 |     166 |         66 |         78 | pas       | mav       | Providence    | Miami (FL)    | 
| 2009-03-18  |         128 |     323 |      77 |        116 |        108 | vah       | dav       | Virginia Tech | Duquesne      | 
| 2009-03-19  |         129 |     211 |     195 |         70 |         68 | nbf       | nam       | Notre Dame    | New Mexico    | 
| 2009-03-19  |         129 |      33 |     300 |         66 |         79 | baw       | tan       | BYU           | Texas A&M     | 
And the data itself is just a CSV export of this. For example:
"2008-11-10",0,113,98,63,65,"ham","gab","Houston","Georgia Southern"
"2008-11-11",1,76,98,97,54,"dau","gab","Duke","Georgia Southern"

In the linked file, there is one CSV for the 2008-2009 season, and one for the 2009-2010 season. Two notes
  • There are still a few more games to be played this season, so I will update the data when the season officially ends and the brackets are announced.
  • I do not have all the tournament data from last year. If some kind soul would like to transcribe last year's final bracket into text form, I'd appreciate it (I don't have the time or inclination to do it right now).

So with that all out of the way, and without further ado, here is the data (Update: this is version 3 of the data, which has entries for four teams that were missing in the first version and has removed some game duplicates that were in version 2).

I encourage you all to try out your favorite machine learning algorithms on it. If you need a starter idea, I still like my strategy from last year, which you can read about here, but there are likely plenty of other good approaches. I'm curious to see what other people can come up with.

I'll try to get my model's predictions posted by March 16, when the first game begins. If you have a bracket and a model that you'd like to share, please send me the predictions along with a brief description of your approach. It'd be fun to see the different ideas then see how the different approaches stack up. I can't promise a prize for the winner (other than a showcase on this blog), but perhaps there is some generous sponsor out there who would like to donate a better prize to the pool? =P

We have a prize (!!) generously offered up by Doug in the comments: "a custom vinyl sticker, or laser etching featuring the yet to be revealed, super secret "Smell the data" logo."

*For those curious, here is the SQL query that generated the data. If you'd like to give me an alternate SQL script to run, I might be willing to give you a custom export. The raw table descriptions are here.
SELECT  gr.date_played, TO_DAYS(gr.date_played) - TO_DAYS(date('2008-11-10')) as day_counter,
  tc1.team_id as home_id, tc2.team_id as away_id, 
  home_score, away_score,
  gr.home_code, gr.away_code, 
  tc1.team_name as home_name, tc2.team_name as away_name
INTO OUTFILE '/tmp/2009_season.csv'
 game_result gr, team_code tc1, team_code tc2 
 gr.date_played < '2009-06-01' and
 gr.home_code = tc1.team_code and gr.away_code = tc2.team_code
ORDER by gr.date_played;


Joseph Turian said...

Do I understand correctly that _id, _code, and _name are all identical (in the sense of labels in a discrete vocabulary)?

Danny Tarlow said...

Yes, they're redundant but useful for different things.
_id is easy to work with in the code

_code is what rivals.yahoo.com uses, so if you want to, say, scrape more detailed team stats, you can use that code in the urls. e.g.

_name is just for pretty display

Tlow said...

I'm willing to sponsor a prize, but what happens if my sigmoid functions rock the socks off of your SQL data tables or whatnot?

But on a more serious note, is your challenge essentially an algorithm competition with finite data inputs, or is selection of judgment criterion also a significant aspect ?

The prize that I'm willing to offer is a custom vinyl sticker, or laser etching featuring the yet to be revealed, super secret "Smell the data" logo.

Danny Tarlow said...

Tlow: if you come up with an algorithm that beats all comers, I will make sure you get a prize.

Everybody else: my little brother, Doug (Tlow), has generously offered up the first prize for the pool! You heard him: "a custom vinyl sticker, or laser etching featuring the yet to be revealed, super secret "Smell the data" logo."

If any of you have seen my laptop, he is the one that made the stickers that are on it now, so this should be a highly coveted prize. Thanks, Doug!

Daniel said...

It should be worth noting, when I looked at the data, that the NCAA and NIT tournaments are actually in the 2009 season(as well as conference tournaments). I believe the cutoff for the large tournaments in 2009 was the 17th of March. As for the conference tourneys, those are probably harder to locate.

Scott Turner said...

The data seems to have some glitches, e.g.,


I've got 96 occurences of \N in the 2010 data. Maybe it's just me;
decoding the tar on Windows is somewhat problematic.

Danny Tarlow said...

Ah, thanks Scott. It looks like there are four teams from 2010 that weren't there in 2009:
Seattle (san)
North Dakota (nci)
SIU Edwardsville (saw)
South Dakota (say)

I'll get this fixed up shortly. I believe the 2009 data should be OK, though. Is that right?

Danny Tarlow said...

Ok, try this version. It should have the four new teams filled in proper:

Matt Curry said...

Hey Danny,
Thanks for data. I see a problem where games are duplicated, just the home and away teams are flipped.

For example:
Line 689 of 2010 season has lan vs wal on 11/26/2009. Then line 697 has the same game, just with the teams flipped.

This comes up other places as well.

Danny Tarlow said...

Thanks for the careful eye, Matt. It was a parsing error that came up when there was a ranked team.

It should be fixed in this new version of the data:

bjfish said...

Here's a script to scrape the stats as you had mentioned: http://gist.github.com/332084

Danny Tarlow said...

Very clean script--nice!

Also check out the additional data Lee posted here: