Posted by Danny TarlowI like the term "evidence-based medicine." Hopefully we'll be hearing more about it in the future: http://www.nytimes.com/2008/12/27/business/27record.html
Saturday, December 27, 2008
Friday, December 26, 2008
Posted by Danny TarlowI've been playing around with sympy, a symbolic math library for Python: http://docs.sympy.org/index.html A lot of work I do involves writing down hairy likelihood functions, taking the partial derivatives, and solving for some update that increases the likelihood. Up until now, I've always done the calculus and algebra by hand. I'm wondering now why I put myself through the tedium. I was never one to use a TI-89 back in high school calculus, but at least then I could justify it to myself by saying I was learning more. But now, sympy seems like it has all the answers. It even has a print feature that will export directly to latex. Is there any downside?
Thursday, December 18, 2008
Posted by Danny TarlowFun little piece on two schools of interpreting probabilities: http://www.johndcook.com/blog/2008/02/26/what-a-probability-means/
Posted by Danny TarlowFrom David Paterson via CNN:
For example, a study by Harvard researchers found that each additional 12-ounce soft drink consumed per day increases the risk of a child becoming obese by 60 percent. For adults, the association is similar.I think this is really cool -- I'm all for using taxes to combat externalities, and obesity seems like a major one. However, my first thought (without knowing anything about the actual study) is that I'm not convinced about the causal role of soft drinks causing obesity. It seems to me that there are an awful lot of unhealthy lifestyles that include drinking more soft drinks. The unhealthy lifestyle can jointly cause higher rates of obesity and more consumption of soft drinks. I haven't been able to find a full list of the new proposed taxes. Is the soft drink tax just a representative example of a larger obesity-fighting program?
Monday, December 15, 2008
Posted by Danny TarlowFirst, let's assume that success in some endeavor is distributed as the product of talent in several relevant characteristics (e.g., musical performance success can be modeled as the product of talent in musical thinking, physical coordination, and dedication to practice). Then, I want to ask how we can explain different tails of distributions that could correspond to success in high-powered jobs. We know that the top level success variable will not be normally distributed, even if the underlying traits are. How would Larry Summers's arguments change by assuming a multiplicative model instead of the simple Gaussian model that his calculations are based on?
Friday, December 12, 2008
Wednesday, December 10, 2008
Saturday, December 6, 2008
Posted by Danny TarlowI've been working a lot with Python and MySQL lately, and I've been very happy with them as the basis for doing research and other data analysis. In particular, MySQLdb, matplotlib, networkx, and CVXOPT have all helped to make the experience very pleasant. So I'm more or less ready to abandon Matlab as my primary research work environment. One of the things I'll miss about Matlab is the
savecommand, which can store the entire environment to a file. It probably encourages some bad habits, but it's also quite convenient and has served me well in the past. To do things right, though, I want to store everything in my database, including data sets, algorithm parameter settings, internal behavior of the algorithm, and results. To do this, I need a schema. I might change my mind on this later, but I think I want to do something general rather than come up with a bunch of project-dependent schemas. It will likely be a bit more work, but hopefully it will also lead to me writing more general code on top of it, which may make future projects go quicker. Here's an early draft of a MySQL script to create the tables I'm thinking of. I'll edit this and add more as I make progress.
CREATE TABLE project ( project_id INT(11) NOT NULL AUTO_INCREMENT, project_title VARCHAR(128), time_started TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY(project_id) ); CREATE TABLE data_set ( data_set_id INT(11) NOT NULL AUTO_INCREMENT, project_id INT(11), data_source VARCHAR(64), # Could be "synthetic", "netflix", etc. time_created TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (data_set_id) ); CREATE TABLE data_instance ( variable_instance_id INT(11) NOT NULL AUTO_INCREMENT, data_set_id INT(11), variable_name VARCHAR(32), variable_real_value DOUBLE, # I'm not sure how to deal with different variable types PRIMARY KEY (data_instance_id) ); CREATE TABLE execution ( execution_id INT(11) NOT NULL AUTO_INCREMENT, data_set_id INT(11), algorithm_id INT(11), code_version INT(11), PRIMARY KEY (execution_id) ); CREATE TABLE iteration ( iteration_id INT(11) NOT NULL AUTO_INCREMENT, execution_id INT(11), step_number INT(11), is_final_iteration INT(1), time_started TIMESTAMP DEFAULT CURRENT_TIMESTAMP, time_finished TIMESTAMP DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (iteration_id) ); CREATE TABLE parameter_value ( parameter_value_id INT(11) NOT NULL AUTO_INCREMENT, iteration_id INT(11), parameter_name VARCHAR(32), parameter_value DOUBLE, PRIMARY KEY(parameter_value_id) );
Friday, December 5, 2008
Posted by Danny TarlowI like Gelman's philosophy better: http://www.stat.columbia.edu/~cook/movabletype/archives/2008/12/greg-mankiws-wo.html
Thursday, December 4, 2008
Posted by Danny TarlowIt seems like the low-hanging fruit with visualizing data is to have built-in ways of viewing things in time and space. If you know something has a location and a time associated with it, you can make some really cool visualizations. I like this one, for example: http://www.spatialkey.com/
Tuesday, December 2, 2008
Posted by Danny TarlowIt's always fun to pull together several different posts. Here are the ones I want to tie together. Can you figure out how? http://randomcrunching.blogspot.com/2008/11/cocktail-party-conversation-starter.html http://randomcrunching.blogspot.com/2008/11/multiplying-for-success.html http://randomcrunching.blogspot.com/2008/11/summers-time.html Here's a start: And the simple python script that generated the figure so you can play along at home (but you need numpy and pylab installed):
from numpy import * from pylab import * a = 50 + 10 * randn(10000,1) b = 50 + 20 * randn(10000,1) c = 50 + 30 * randn(10000,1) e = a + b + c subplot(311) hist(e, 500) xlim(mean(e) - 5*std(e), mean(e) + 5*std(e)) axis('off') title('Additive Model') d = a * b * c subplot(313) hist(d, 500) xlim(mean(d) - 5*std(d), mean(d) + 5*std(d)) title('Multiplicative Model') axis('off') show()