Posted by LeeThere is just a little over one month left before the beginning of March Madness! Hopefully, you're happily coding away and building some really cool models.
I am still in the process of crawling all the boxscore data. Unfortunately, it is taking longer than I had anticipated. However, I did want to give everyone a chance to see what the data would look like and the opportunity to use some sample data while developing their algorithms and models.
The sample data contains two tab-delimited files. One contains a list of all the games played and the other contains a list of each player's performance within a game. I plan on using this format for the final set of data, but if you have any major issues with it, feedback is welcome via the comments.
The game file has four columns: a game identifier, the date the game was played, the home team, and the away team.
The players file has the following columns: the player's name, the game ID corresponding to this particular's row performance, the team the player was playing for, the number of minutes the player played in the game, field goals made, field goals attempted, three pointers made, three pointers attempted, free throws made, free throws attempted, offensive rebounds, defensive rebounds, assists, turnovers, steals, blocks, and personal fouls.
While there is no explicit points data in these files (to avoid redundancy), it can easily be reconstructed. For example, to determine the number of points scored by a player, simply perform the following calculation: free throws made + 2 * field goals made + three pointers made (not times three as they are already counted in the field goals). To determine the team's scores, simply sum the scores of the players for each team with the corresponding game identifier.
Good luck and please let us know via the comments if you run into any problems.