Saturday, December 6, 2008

Data schema

Posted by Danny Tarlow
I've been working a lot with Python and MySQL lately, and I've been very happy with them as the basis for doing research and other data analysis. In particular, MySQLdb, matplotlib, networkx, and CVXOPT have all helped to make the experience very pleasant. So I'm more or less ready to abandon Matlab as my primary research work environment. One of the things I'll miss about Matlab is the save command, which can store the entire environment to a file. It probably encourages some bad habits, but it's also quite convenient and has served me well in the past. To do things right, though, I want to store everything in my database, including data sets, algorithm parameter settings, internal behavior of the algorithm, and results. To do this, I need a schema. I might change my mind on this later, but I think I want to do something general rather than come up with a bunch of project-dependent schemas. It will likely be a bit more work, but hopefully it will also lead to me writing more general code on top of it, which may make future projects go quicker. Here's an early draft of a MySQL script to create the tables I'm thinking of. I'll edit this and add more as I make progress.
CREATE TABLE project (
  project_id INT(11) NOT NULL AUTO_INCREMENT,
  project_title VARCHAR(128),
  time_started TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY(project_id)
);

CREATE TABLE data_set (
  data_set_id INT(11) NOT NULL AUTO_INCREMENT,
  project_id INT(11),
  data_source VARCHAR(64), # Could be "synthetic", "netflix", etc.
  time_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (data_set_id)
);

CREATE TABLE data_instance (
  variable_instance_id INT(11) NOT NULL AUTO_INCREMENT,
  data_set_id INT(11),
  variable_name VARCHAR(32),
  variable_real_value DOUBLE,  # I'm not sure how to deal with different variable types
  PRIMARY KEY (data_instance_id)
);

CREATE TABLE execution (
  execution_id INT(11) NOT NULL AUTO_INCREMENT,
  data_set_id INT(11),
  algorithm_id INT(11),
  code_version INT(11),
  PRIMARY KEY (execution_id)
);

CREATE TABLE iteration (
  iteration_id INT(11) NOT NULL AUTO_INCREMENT,
  execution_id INT(11),
  step_number INT(11),
  is_final_iteration INT(1),
  time_started TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  time_finished TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (iteration_id)
);

CREATE TABLE parameter_value (
  parameter_value_id INT(11) NOT NULL AUTO_INCREMENT,
  iteration_id INT(11),
  parameter_name VARCHAR(32),
  parameter_value DOUBLE,
  PRIMARY KEY(parameter_value_id)
);


No comments: