Sunday, March 13, 2011

Data Caveats

Posted by Lee
As we've seen in the past 12 hours, there have been some hiccups with the data. I apologize that data quality has been a bit lacking. Hopefully things will be relatively stable from here. I also wanted to point out a few other caveats in the data.
  • Some games are missing completely. This is because at crawl time, sometimes the server hiccups and returns an empty page even though I am not being throttled.
  • Some games exist in Games.tsv but there is no player data for it. This is because in some cases, the Yahoo! box score format is missing columns like minutes and the parser does not expect this.
  • Some games have inaccurate full aggregate point totals. In a few cases, field goals exist at the team level, but not the individual player level. I expect this to be relatively rare, but does cause some score discrepancies in the aggregate.
I'll try to keep this list up to date. Please comment/e-mail if you run into other items such as the above.

