Sunday, March 13, 2011

Selection Sunday Today!

Posted by Lee
UPDATE 3: 3/11/2011 was missing from the data. The archive has been corrected. It was also renamed to reflect that it contains all games through 3/12 and not 3/13.

UPDATE 2: It turns out that I was incorrectly parsing boxscores where the teams had a seed number in front of their name, leading to this being interpreted as "UNK" when in fact, these are incredibly important games (thanks, Rob). I have fixed this and uploaded new data. Please re-download the data. The links below have been corrected to point at the new data as well.

UPDATE: Thanks to the keen observation of one of our contestants (thanks, Scott), we've updated the data as the boxscore format we were crawling changed starting 3/10. The link below has been updated to the new data.

Today is Selection Sunday, when we will find out who is playing in this year's NCAA tournament. This also means that all the conferences have completed their games and it is time to release the data. Full contest details are available here.

Please download these two files: The team code mapping is there to help you convert between the codes in the actual boxscore data and the actual names of the teams. As with previous data release, there is a team code called "UNK" that refers to a team that is not within the list of codes. These are often small teams either in non-Division 1 conferences or no longer in Division 1.

There are two files within the data archive: Players.tsv and Games.tsv.

Games.tsv - each row corresponds to a game played during the season.
  1. Game ID
  2. Date
  3. Home team
  4. Away team

Players.tsv - each row corresponds to a single player's performance in one game.
  1. Player name
  2. Game ID
  3. Player's team
  4. Minutes played
  5. Field goals made
  6. Field goals attempted
  7. Three pointers made
  8. Three pointers attempted
  9. Free throws made
  10. Free throws attempted
  11. Offensive rebounds
  12. Defensive rebounds
  13. Assists
  14. Turnovers
  15. Steals
  16. Blocks
  17. Personal fouls
Note that three pointers are included in the number of field goals. Thus if a player has made 3 field goals and 1 three pointer, then that player has scored 7 (2 + 2 + 3) points.

In the previous post, I had a faulty link to join the group on Yahoo. The correct link is As with before, please e-mail leezen+MarchMadness at gmail for the password. Please include your team name, team members, and brief description.

Good luck!


Rob Schroeder said...

Right now most of the data for Kansas is coded as UNK, considering it's a 1 seed you might want to make sure it gets changed to kaa.

Lee said...

Thanks for pointing this out, Rob. I'll look into it. I also noticed an issue with not parsing games correctly where teams had seed numbers in front of their names. I'll be posting an update soon.

Lee said...

I think I've tracked down the issue and corrected the data. Please let me know if you see other problems.

Jeff said...

Hi Lee - Am I mis-interpreting, or is the away ID not formatted correctly? I am importing the tab delimited file into Microsoft Access.

Probably my mistake, but I just want to rule out other problems.


Lee said...

Jeff, I assume you're referring to Games.tsv? I'm not seeing any issues with that column. Can you please describe your specific issue?

Jeff said...

Sorry Lee - my mistake - your data is perfectly fine. Thank you so much for posting this data. Win, lose or draw, having a lot of data to play with is a great thing.

wdeupree said...

It seems like there is data that is missing. For example, in looking at Kentucky's current season, I see two games that are missing. I don't see the 1/3/11 game against Penn or the 2/26/11 game against Florida. Moreover, when I look at each team's record for the season, it seems like each one is missing at least a few games. Am I not looking at something correctly? Does anyone else see missing data?

Lee said...

wdeupree: unfortunately, due to the nature of crawling and parsing the data from online, sometimes games do go missing either as a result of an erroneous response from the server or the parser's inability to interpret the boxscore data. We do try to make the data as good as possible, but we know we won't always capture every game. This also happens to be why we ask all competitors to use the same data set since the goal of the competition is to see who can make the best predictions and not who can do the best job crawling data. It turns out that the latter is not always that straightforward.

Steven said...

I would like to enter the competition. I have my bracket filled out. I cannot find out where I can formally submit my predictions, but I do have it done before Noon on Tuesday. My email address is stevenjackson121 @ with no spaces. I'd be glad to send my predictions anywhere you like. Unfortunately, I did not find out about this contest until this morning, so it may not be in the format you want/need, and the data I use may differ slightly from yours

Go said...

Hi I'm interested in the event even though it is a bit late. Looking at the games.tsv, I'm wondering why the scores/results of the games are not there? Thanks.

Danny Tarlow said...

If you're just looking for scores, there's simpler, aggregate data available here: